Mass grabing part of HTML source code using shell scripts

Question

From this page, a radio show http://www.ellinofreneianet.gr/sounds.php?s=0&p=10&o=l I want to download all the recorded shows.

They are all this type of pages http://www.ellinofreneianet.gr/sound.php?id=7101
and I want to grab from all these 7 thousand pages the line 422 of the source code where the download link is located.
It can be achieved by not line grabbing too, regular expression ".=podcast/." works too.

How to grab the line 422 of every page of that type OR get the "=podcast/****.mp3" part using shell scripts/commands?

I edited it for better understanding – NoName Sep 17 '14 at 15:21 — NoName, Sep 17 '14 at 15:21

Volker Siegel · Accepted Answer · 2014-09-17T16:53:25.317

0

Something like this?

for i in {7101..7200} ; do  wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | grep ".=podcast/." ; done

The wget options are -q quiet, show no progress etc, and -O - write output to stdout.

Not every page has a mp3 link there; Some even ones show a page which could be the 404 error page. The pages starting from 0 also seem empty.

The empty pages have URLs ending in podcast/", so we can exclude them with matching strings which don't have a " there:

... | grep ".=podcast/[^\"]"

To get only the .mp3 urls, use

... | grep -o 'bitsnbytesplayer.php.*\.mp3'

You found out yourself how to output the page URL before each mp3 URL. Here's an optimiset variant of that, using only one HTTP request per page:

for i in {7100..7200} ; do \
    wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | \
    grep -o 'bitsnbytesplayer.php.*\.mp3' && \
    echo http://www.ellinofreneianet.gr/sound.php\?id\=$i ; done | sed -n 'h;n;p;g;p'

The && echo ... prints the URL if the grep before found an mp3 url. The sed command switches the order of the line pairs.

edited Sep 17 '14 at 16:53

answered Sep 17 '14 at 15:12

Volker Siegel

1,504
11
21

I get "syntax error near unexpected token `wget'" Sorry but I'm not experienced with Linux. – NoName Sep 17 '14 at 15:27
Oh, sorry, my fault, I`m using `zsh`, you probably use `bash` - I'll change it. – Volker Siegel Sep 17 '14 at 15:30
Is it possible in the second case to print also the url? E.g. `http://www.ellinofreneianet.gr/sound.php?id=7101 bitsnbytesplayer.php?w=728&h=30&s=1&f=podcast/209TRITi.mp3` – NoName Sep 17 '14 at 15:42
What is the second case? Does the last line not work? – Volker Siegel Sep 17 '14 at 15:44
With second case I mean "To get only the .mp3 urls, use". It works but I want it to be one line the URL from which the mp3 link was grabbed and the second line to be the grabbed mp3 link. – NoName Sep 17 '14 at 15:47
I found how to do it, `for i in {7100..7200} ; do wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | grep -o -q 'bitsnbytesplayer.php.*\.mp3' && echo http://www.ellinofreneianet.gr/sound.php\?id\=$i ; wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | grep -o 'bitsnbytesplayer.php.*\.mp3' ; done` Thanks for the answer my friend. – NoName Sep 17 '14 at 16:30
Nice! It could be simplified to `for i in {7100..7200} ; do wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | grep -o 'bitsnbytesplayer.php.*\.mp3' && echo http://www.ellinofreneianet.gr/sound.php\?id\=$i ; done` if you can accept that the mp3 url comes first. – Volker Siegel Sep 17 '14 at 16:41
Ha, we can switch the lines back so the mp3 comes second again. Mind you, we're saving 7000 HTTP requests with that. :) I'll add to the answer. – Volker Siegel Sep 17 '14 at 16:48

Mass grabing part of HTML source code using shell scripts

1 Answers1