How can I download a website and the links it references, but nothing more?

Question

I have a list of links here: https://docs.oracle.com/javase/tutorial/reallybigindex.html

That I would liked all downloaded. Would anyone know how this would go?

@random To me it seems strange to have this marked as a duplicate when the website in question **offers a zip file containing the needed pages** (see my answer). Why go for a general solution when there is a **specific solution** (which is **not** covered in the dupe) to the OP's question? — DavidPostill, May 28 '16 at 22:58
Either it's a duplicate of how to download a site and all the links but not all the links (because that's still not clarified) or it's out of scope for wanting to download a specific resource @dav — random, May 29 '16 at 00:35

joeytwiddle · Answer 1 · 2016-05-28T15:36:47.930

You can download Wget for Windows and use that from cmd.exe:

wget -r -l 2 https://docs.oracle.com/javase/tutorial/reallybigindex.html

If you also want the images and CSS files for those pages, then add -p and also -k to change the links in the HTML so you can browse these pages offline.

This tutorial has some screenshots which may help.

The value of -l 2 will get that first page, and all the pages that it links to. You can increase the number to get deeper pages, but I fear it will follow some links away from the tutorials and around the Oracle website.

Alternatively you could try VisualWget which has a UI!

Alternatively you might like to download the tutorials in ebook form.

score 1 · Answer 2 · answered May 28 '16 at 16:00

How can I download a website and the links it references

I have a list of links here: https://docs.oracle.com/javase/tutorial/reallybigindex.html

Instead of downloading all the links in the "The Really Big Index" it is easier to just Download the latest Java Tutorials bundle.

It is available in a variety of formats - zip, epub and mobi.

tutorial.zip includes reallybigindex.html and all of the referenced files.

Here are the top level contents of the expanded zip file:

score 1 · Answer 3 · answered May 28 '16 at 17:10

There are many ways to approach this. Not knowing your desired end product I can't be very specific.

wget, as suggested by @joeytwiddle
curl (similar to wget)
google sheets
browser add-ons for Chrome or Firefox (search scraper)

I'll expand on Google Sheets (I use this for simple one time projects):

create a new sheet
put this in cell a1 https://docs.oracle.com/javase/tutorial/reallybigindex.html
put this in cell b2 =IMPORTXML(A1, "//a[@href]/text()")(this retrieves the text of the click)
put this in cell e2 =IMPORTXML(A1, "//a[@href]/@href")(this retrieves the URL)

The second parameter of the function is an xpath expression. You'll need to adjust those to get the result you want. There are many online xpath testers to help you do this.

How can I download a website and the links it references, but nothing more?

3 Answers3