Multiple read from a txt file in bash (parallel processing )

Question

Here is a simple bash script for HTTP status code

while read url
    do
        urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
        echo "$url  $urlstatus" >> urlstatus.txt
    done < $1

I am reading URL from text file but it processes only one at a time, taking too much time, GNU parallel and xargs also process one line at time (tested)

How to process simultaneous URL for processing to improve timing? In other words threading of URL file rather than bash commands (which GNU parallel and xargs do)

as answer from user this code works fine except it don't process some last url

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

may be adding wait help ,,, any suggestions

You could look into sub processes for this. That would mean you could start an individual shell/thread for each `curl`. As for your solution using xargs/parallel it would be worth it to include it since you might have just done something wrong. Just reading the file should be fast enough (except if it's really large) but the waiting for the answer is probably whats your problem. — Seth, Jan 18 '17 at 12:41
actually after using parallel it processes single URL as same as normal bash script timing. — user7423959, Jan 18 '17 at 13:37
Why would a single URL be any faster? With a single URL you could all the parallelization you want, it won't get faster. With multiple URLs on the other hand you could request a set of URLs at a time. So the issue might've been how you've called/used parallels. Hence it could be useful to include how you actually tried to use it. — Seth, Jan 18 '17 at 13:44
here is example --cat abc.txt | parallel -j100 --pipe /root/bash.sh abc.txt you now gotta some idea ,,,,and n1 is also used ,,, it processes one url at time ,, not parralel consuming same time . — user7423959, Jan 18 '17 at 13:47

me_alok · Accepted Answer · 2017-01-18T14:12:15.310

3

In bash, you could use the & symbol to run programs in background. Example

for i in {1..100..1}; do
  echo $i>>numbers.txt &
done;

EDIT: Sorry but the answer for your question in the comment is wrong, so i just edited the answer. Suggestions wrt code

urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 ) && echo "$url  $urlstatus" >> urlstatus.txt &

edited Jan 18 '17 at 14:12

answered Jan 18 '17 at 13:28

me_alok

382
1
3
14

can you give suggestion w.r.t. code as adding this symbol (&) don't improve timing . – user7423959 Jan 18 '17 at 13:38
Try this urlstatus=$(curl -o /dev/null --silent --head --write-out '%{http_code}' "${url}" --max-time 5 ) & – me_alok Jan 18 '17 at 13:59
already tried , – user7423959 Jan 18 '17 at 14:39
This worked for me. – ninja Jan 18 '17 at 14:58
It works, i tested before editing the answer – me_alok Jan 18 '17 at 15:02
your code works fine but one problem- it dont process the last some url ,, might need to add wait somewhere in code ,,, any suggestion on this – user7423959 Jan 19 '17 at 04:15
actually it misses a lot of url ,,, only some are shown – user7423959 Jan 19 '17 at 04:56
adding wait at the end of file also not working – user7423959 Jan 19 '17 at 05:14
There is no need to add a wait command in here unless you want to limit the number of threads and it should be inside the while loop. – me_alok Jan 19 '17 at 07:41
For missing url issue, what's the output in urlstatus.txt? Is it just the status code that's missing or the entire url and status? – me_alok Jan 19 '17 at 07:43
URL missing are total whose status code is 000,,,, that is not an issue, I want thread control in this script as very longer text file hangs mine system for while (although produces results),,,, any suggestions on adding threading to this code – user7423959 Jan 20 '17 at 05:15
Can you produce a sample input and output? – me_alok Jan 20 '17 at 06:37
Yeah the output is properly reproduced any suggestion on thread control in this script – user7423959 Jan 20 '17 at 09:00
Can you post a sample output (both stdout and urlstatus.txt)? – me_alok Jan 20 '17 at 10:40
1. here is input file http://s3.amazonaws.com/alexa-static/top-1m.csv.zip 2. i am saving your script as bash.sh and executing as from terminal ./bash.sh top1m.txt( unzipping above) 4. then it produces results in urlstatus.txt file 5. i want thread control inthis script (you may take some input to test as small file ) 6. there are much more files as this is big ...there are like 100 ,,,500 kb etc not as big is this ,,,6. your answer is working , ijust asking if thread control possible – user7423959 Jan 20 '17 at 12:51
Well multithreading is working in here, use 'top' command to see this. For thread control, let me see what i can do – me_alok Jan 21 '17 at 08:22

Ole Tange · Answer 2 · 2017-01-19T13:25:57.133

GNU parallel and xargs also process one line at time (tested)

Can you give an example of this? If you use -j then you should be able to run much more than one process at a time.

I would write it like this:

doit() {
    url="$1"
    urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
    echo "$url  $urlstatus"
}
export -f doit
cat input.txt | parallel -j0 -k doit

Based on the input.txt:

Input file is txt file and lines are separated  as
ABC.Com
Bcd.Com
Any.Google.Com
Something  like this
www.google.com
pi.dk

I get the output:

Input file is txt file and lines are separated  as  000
ABC.Com  301
Bcd.Com  301
Any.Google.Com  000
Something  like this  000
www.google.com  302
pi.dk  200

Which looks about right:

000 if domain does not exist
301/302 for redirection
200 for success

I must say I am a bit surprised if the input lines you have provided really are parts of the input you actually use. None of these domains exist, and domain names with spaces in probably never will exist - ever:

Input file is txt file and lines are separated  as
Any.Google.Com
Something  like this

If you have not given input from your actual input file, you really should do that instead of making up stuff - especially if the made up stuff does not resemble the real data.

Edit

Debugging why it does not work for you.

Please do not write a script, but run this directly in the terminal:

bash # press enter here to make sure you are running this in bash
doit() {
    url="$1"
    urlstatus=$(curl -o /dev/null --silent --head --write-out  '%{http_code}' "${url}" --max-time 5 )
    echo "$url  $urlstatus"
}
export -f doit
echo pi.dk | parallel -j0 -k doit

This should give:

pi.dk  200

hey i got same status code 000 ,, can you tell me how you executing your script from terminal , may it help — user7423959, Jan 19 '17 at 04:33
I put the input lines above into the file `input.txt`. Then I run the exact lines that is written above. My shell is bash. — Ole Tange, Jan 19 '17 at 07:49
i explain the whole process--- 1. i copied your bash script and saved it as bash.sh and giving execution permissions . 2. my input file is big file but i also tested on small 10 lines file---here is list www.yahoo.com ,www.google.com facebook.com amazon.com bing.com apple.com www.microsoft.com www.windows.com ,,,,,all seperated by lines and saved as top.txt 4. now then i go to terminal and type ./bash.sh top.txt 5. now it gives the result 000 in each 6. now can you assist me further where ia am wrong ,,,thanks — user7423959, Jan 19 '17 at 09:19

Multiple read from a txt file in bash (parallel processing )

2 Answers2