wget: recursively retrieve urls from specific website

Question

I'm trying to recursively retrieve all possible urls (internal page urls) from a website.

Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just want to get the urls of the same domain.

Thanks!

EDIT

I tried doing this in wget, and grep the outlog.txt file later. Not sure, if this is the right way to do it. But, It works!

$ wget -R.jpg,.jpeg,.gif,.png,.css -c -r http://www.example.com/ -o urllog.txt
$ grep -e " http" urllog1.txt | awk '{print $3}'

This question isn't very clear. What do you mean by "all possible URLs"? Do you want to start with one website and then crawl to all its linked websites, recursively? If so, how do you want to achieve that without downloading the actual websites, which you need to parse for further links? — Kerrek SB, Aug 29 '11 at 10:43
What did you try? wget -r is the recursive option. Did you try with that? What problem did you run into? — steenhulthin, Aug 29 '11 at 10:47
Just use `wget -r http://site.com`. Also nice option is `-p` which will also fetch all prerequisites for the page, even if they are external. — dma_k, Aug 29 '11 at 10:47
@Kerrek all possible URLs - yes, URLs which are linked to internal pages (that means, which has the same domain). & That's a good point there, wget can download only html content, at least to find the linked urls/pages, ignoring any other file types. — , Aug 29 '11 at 10:50
@abhiomkar: Well, yes, you wouldn't download all the pictures and flash animations of course. The `-r` option already does exactly that. If you also want stylesheets that may be linked inside other style sheets, you have to work a bit harder, but for the semantic content only, `-r` is exactly the answer. Did you try any of this before asking the question, by the way? I think wget has a pretty decent documentation... — Kerrek SB, Aug 29 '11 at 10:55
apparently, recursive option works. (added reject rule) Just wondering if there are any wget one-liners which does it faster. Thanks for your help guys! much appreciated. — , Aug 29 '11 at 11:36
Dupe http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only — giorgio79, May 25 '14 at 10:34

score 1 · Answer 1 · answered Apr 20 '17 at 18:59

You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.

wget: recursively retrieve urls from specific website

1 Answers1