2

How can I make searches with grep on a large number of files run faster? My first attempt uses parallel (which could be improved or other approaches suggested).

The first grep simply gives the list of files, which are then passed to parallel, which runs grep again to output matches.

The parallel command is- supposed to wait for a grep to finish so that I get the results from each file together. Otherwise I get a mixup from the results form different files.

I also use sed to skip files if necessary through the command

sed -z "${ista}~${istp}!d"

Multiple patterns are stored in array ${ptrn[@]} whilst the trailing context after matching lines is defined in ${isufx[@]}.

ptrn=("-e" "FN" "-e" "DS")
ictx=(-A 8)

grep --null -r -l "${isufx[@]}"  \
  -f <(printf "%s\n" "${ptrn[@]}") -- "${fdir[@]}"  \
    | sed -z "${ista}~${istp}!d"  \
    | PARALLEL_SHELL=bash psgc=$sgc psgr=$sgr uptrn=$ptrn  \
        parallel -m0kj"$procs"  \
          'for fl in {}; do
            printf "\n%s\n\n" "${psgc}==> $fl <==${psgr}"
            grep -ni '"${ictx[@]@Q}"'
                 -f <(printf "%s\n" "${ptrn[@]}") -- "$fl"
          done'
Raffa
  • 24,905
  • 3
  • 35
  • 79
Fatipati
  • 361
  • 1
  • 9
  • If you're grep-ing through files you may want to look at the other apps designed for that ack-grep `ack`; "the silver surfer" `ag`; ripgrep `rg`; hyperscan; findrepo;... and a bunch of others. – pbhj Mar 30 '23 at 18:29
  • Does `ripgrep` include concurrent execution on many files? – Fatipati Mar 30 '23 at 19:10

2 Answers2

1

grep is one of the most refined and time proven tools performance-wise ... Please, see for example the speed comparison of grep with other text-processing tools on very large 1G+ files with 8M+ lines here: https://askubuntu.com/a/1420653 ... Also, proper(i.e. preserving separate files output with correct line order) text-processing is not, IMHO, a suitable task for parallel because as you noticed it will mix the results from different files and shift their line order ... Although you used the parallel's -k option to keep the same output order as the input, but that might only work as intended if:

  1. You limit the parallel jobs to 1 i.e. -j 1 and also --max-procs 1 -P 1.
  2. You make sure text is passed in the right order by e.g. piping the actual text(in the right order/sequence) to parallel and use its --pipe option to pipe the text to grep afterwords.

That, however, will defy your intended purpose of running multiple jobs in parallel and therefore the added speed gain(if any) is negligible.

Also, using a for loop will require grep to fully run for each argument/file present in the loop's head with virtually the same match pattern/s for each file as it appears ... So, might not be the best approach when you are trying to speed things up ... You might be better off using e.g. grep's option --recursive in that case.

However, you can run multiple jobs in the background by sending each grep call inside your for loop to the background redirecting its output to a separate file i.e. grep ... > file1 & then later joining the resulting output files in one output file if you want ... That would run multiple instances of it in the background and greatly speed-up the loop ... Please see the demonstration below.

For demonstration purposes I will use (sleep N; echo "something" > fileN) & in place of grep ... > file1 & ... the sub-shell syntax (...;...) is necessary if you're sending multiple nested commands to the background but not needed for a single command:

$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
  do
  # Start incrementing a counter to use in filenames and calculating sleep seconds.
  ((i++))
  # Send command/s to background
  (sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
  # Add background PID to array
  pids+=( "$!" )
  done

# Output:
[1] 31335
[2] 31336
[3] 31338

$ # Monitoring and controling the background jobs/processes
while sleep 5;
  do
  echo "Background PIDs are: ${pids[@]}"  
  for index in "${!pids[@]}"
    do
    if kill -0 "${pids[index]}" &> /dev/null;
      then
      echo "${pids[index]} is running"
      # Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process.
      else
      echo "${pids[index]} is not running"
      unset 'pids[index]'
      # Do whatever you want here if the process is not running.
      fi
    done
  if [[ "${#pids[@]}" -eq 0 ]]
    then
    echo "Combined output files contents:"
    cat file*
    unset i
    unset pids
    break
    fi
  done

# Output:
Background PIDs are: 31335 31336 31338
31335 is running
31336 is running
31338 is running
[1]   Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31335 31336 31338
31335 is not running
31336 is running
31338 is running
Background PIDs are: 31336 31338
31336 is running
31338 is running
[2]-  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31336 31338
31336 is not running
31338 is running
Background PIDs are: 31338
31338 is running
[3]+  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31338
31338 is not running
Combined output files contents:
file1 Fri Mar 31 12:20:47 AM +03 2023
file2 Fri Mar 31 12:20:57 AM +03 2023
file3 Fri Mar 31 12:21:07 AM +03 2023

Please also see Bash Job Control.

Raffa
  • 24,905
  • 3
  • 35
  • 79
  • Yes, you are correct. It is problematic when matches from different files get mixed, – Fatipati Mar 30 '23 at 18:43
  • I could use `--recursive` but the search would still be sequential. Can there be some other solutions? I do not want grep itself to run faster, but to run multiple instances, but with the ability to stop the grep processes in an easy way when needed. – Fatipati Mar 30 '23 at 18:46
  • @Backy You can send `grep` inside the `for` loop to the background redirecting its output to a separate file i.e. `grep … > file1 &` … That would run multiple instances of it in the background and greatly speed-up the loop. – Raffa Mar 30 '23 at 18:54
  • Ok, so I would refrain from an approach that calls parallel. How could I make it easy to stop the processes with a single command? – Fatipati Mar 30 '23 at 19:06
  • @Backy You can kill all background jobs/processes at once with e.g. `kill $(jobs -p)` or selectively see e.g. [Bash Job Control Basics](https://www.gnu.org/software/bash/manual/html_node/Job-Control-Basics.html) – Raffa Mar 30 '23 at 19:13
  • I want to kill selectively the ones I generate from my script only. But I do not want users to search for them to stop them. – Fatipati Mar 30 '23 at 19:21
  • @Backy I updated the answer for that ... And only the owner or the super user can kill a process. – Raffa Mar 30 '23 at 21:37
0

It is one of GNU Paralel's examples:

https://www.gnu.org/software/parallel/parallel_examples.html#example-parallel-grep

If you are grepping the same files again and again, maybe this is usable too: https://stackoverflow.com/a/11913999/363028

Ole Tange
  • 1,650
  • 1
  • 13
  • 25