0

I need a solution to export all hyperlinks on a webpage (on a webpage, not from entire website) and a way to specify the links I want to export, for example only hyperlinks starting with https://superuser.com/questions/ excluding everything else.
Exporting as text file preferred and the results should be displayed one below another, one URL per line:

https://superuser.com/questions/1  
https://superuser.com/questions/2  
https://superuser.com/questions/3
[...]
user198350
  • 3,769
  • 15
  • 50
  • 89
  • @JeffZeitlin: I have tried `Invoke-WebRequest` in Powershell 5. I use both Windows and Linux, native terminal/Powershell method is preferred. – user198350 Feb 01 '17 at 16:57
  • 1
    Please note that https://superuser.com is not a free script/code writing service. If you tell us what you have tried so far (include the scripts/code you are already using) and where you are stuck then we can try to help with specific problems. You should also read [How do I ask a good question?](https://superuser.com/help/how-to-ask). – DavidPostill Feb 01 '17 at 16:58
  • 1
    If Invoke-WebRequest is not returning the HTML for the page your are interested in, you will need to troubleshoot that first. Once your Invoke-WebRequest succeeds, you should be able to parse the resulting HTML to extract what you want. Do not expect us to write the script for you, as DavidPostill indicates; you will need to 'show your work'. – Jeff Zeitlin Feb 01 '17 at 16:59

2 Answers2

1

If you are running on a Linux or a Unix system (like FreeBSD or macOS), you can open a terminal session and run this command:

wget -O - http://example.com/webpage.htm | \
sed 's/href=/\nhref=/g' | \
grep href=\"http://specify.com | \
sed 's/.*href="//g;s/".*//g' > out.txt

In usual cases there may be multiple <a href> tags in one line, so you have to cut them first (the first sed adds newlines before every keyword href to make sure there's no more than one of it in a single line).
To extract links from multiple similar pages, for example all questions on the first 10 pages on this site, use a for loop.

for i in $(seq 1 10); do
wget -O - http://superuser.com/questions?page=$i | \
sed 's/href=/\nhref=/g' | \
grep -E 'href="http://superuser.com/questions/[0-9]+' | \
sed 's/.*href="//g;s/".*//g' >> out.txt
done

Remember to replace http://example.com/webpage.htm with your actual page URL and http://specify.com with the preceding string you want to specify.
You can specify not only a preceding string for the URL to export, but also a Regular Expression pattern if you use egrep or grep -E in the command given above.
If you're running a Windows, consider taking advantage of Cygwin. Don't forget to select packages Wget, grep, and sed.

iBug
  • 10,304
  • 7
  • 37
  • 70
  • This is almost the method that I use to batch download music from [KHInsider](http://downloads.khinsider.com) without buying their VIP service. Just manually extract the links and place them in a download manager like [IDM](https://www.internetdownloadmanager.com). – iBug Feb 02 '17 at 02:52
0

If you are okay with using Firefox for it, you can you the addon Snap Links Plus

  1. Hold down the right mouse button and drag a selection around the links.

  2. When they are highlighted, press and hold Control while letting go of the right mouse button.

Yisroel Tech
  • 9,687
  • 3
  • 23
  • 37
  • Wouldn't work well due to the selection method, source page can be hundreds of pages long. – user198350 Feb 01 '17 at 17:01
  • So really no method based on a page won't work, since "source page" (https://superuser.com/questions/) is only one page and you want it to save from all "hundreds of pages" (like https://superuser.com/questions?page=2) – Yisroel Tech Feb 01 '17 at 17:05
  • That page was only an example. – user198350 Feb 01 '17 at 17:08
  • But still, what do you mean "hundreds of pages"? If you need to press something to load more pages then it isn't really one page. – Yisroel Tech Feb 01 '17 at 17:09
  • "Approximately", for example this page is that long (though it doesn't have hyperlinks, used as an example due to low size): https://easylist-downloads.adblockplus.org/easylist.txt There are more sites I may want to export links from. – user198350 Feb 01 '17 at 17:15
  • 1
    Oh, got you. This extension for CXhrome seems to do the job https://chrome.google.com/webstore/detail/link-klipper-extract-all/fahollcgofmpnehocdgofnhkkchiekoo?hl=en – Yisroel Tech Feb 01 '17 at 17:20