2

I have the following page

http://www.fda.gov/downloads/scienceresearch/fieldscience/laboratorymanual/ucm092156.pdf

I would like to find the pages on www.fda.gov that links to this page? How can I do that?

Norfeldt
  • 266
  • 1
  • 5
  • 19
  • What do mean by _links to this page_? Places on the FDA website that point to that particular link? – Tim G. Aug 20 '16 at 15:52
  • Places on the FDA website that point to that particular link, yes – Norfeldt Aug 20 '16 at 16:03
  • 2
    Possible duplicate of [Finding pages on a webpage that contain a certain link](http://superuser.com/questions/1034567/finding-pages-on-a-webpage-that-contain-a-certain-link) – Norfeldt Aug 23 '16 at 19:14

1 Answers1

3
  1. You can use wget to recursively download the entire website:

    wget --recursive --page-requisites --html-extension --no-parent --domains www.fda.gov www.fda.gov

  2. You can then use egrep to recursively search through all the files to find which pages link to ucm092156.pdf:

    egrep -r -o '*ucm092156.pdf' www.fda.gov/

Mark Riddell
  • 712
  • 4
  • 7
  • I have mac and windows.. no Linux – Norfeldt Aug 20 '16 at 16:04
  • Using homebrew to get wget .. – Norfeldt Aug 20 '16 at 16:08
  • 3
    Please note, web admins may not take kindly to you scraping their site, particularly if you have a high bandwidth connection. It is entirely possible that your IP address may be blacklisted. You may want to also include the `--limit-rate` flag to reduce the chances of that happening. For example, `--limit-rate=100k` will reduce your download speed to 100KB/sec – Mark Riddell Aug 20 '16 at 16:17
  • and you tell me this now... it's scraping the site as we speak – Norfeldt Aug 20 '16 at 16:18
  • 2
    BTW I found that `grep -rl '*ucm092156.pdf' www.fda.gov/` on mac does the same job. (still waiting for it to finish the download, but looks good so far) – Norfeldt Aug 20 '16 at 16:22
  • is there a way to only scrape `.html` files? It seems to download it all - including `.pdf` files – Norfeldt Aug 20 '16 at 16:32
  • 1
    Sort of. You can Accept or Reject certain files, however that process occurs after the file has been downloaded. For example, to only _keep_ htm files: `-A '*.htm'` – Mark Riddell Aug 20 '16 at 16:38