Get URL from HTML file using FindStr

Question

I'd like to preface this by saying that I'm very new to the command prompt and I've been only using it for some WGET and YOUTUBE-DL, and that I'm on a Windows 8 PC.

I'd like to get a bunch of links from an html file. The links all start with

https://s-media-cache-ak0.pinimg.com/originals/

and end with

.jpg

Right now I'm using this:

findstr ^https://s-media-cache-ak0.pinimg.com/originals/.*\.jpg index.html > urls.txt

I did some research and I'm using the "range" function of FINDSTR as you can see. But I still get a lot of extra text that I'm not interested in. Is there anyway to trim it down?

[now you have two problems](http://nedbatchelder.com/blog/201204/two_problems.html). HTML is too complex for findstr or regex in general. Any findstr solution will eventually break — Rich Homolka, Jun 14 '15 at 20:35

score 3 · Answer 1 · edited May 23 '17 at 12:41

As this StackOverflow answer states, you really shouldn't atempt to parse [X]HTML with regex. findstr has very limited regex support in any case.

Use a proper HTML scraper/parser like Xidel instead. A command like the following will do what you're looking for:

xidel <URL or HTML file name> -q -e "//a/extract(@href/resolve-uri(.), 'https:\/\/s-media-cache-ak0\.pinimg\.com\/originals\/.*?\.jpg')[. != '']"

Get URL from HTML file using FindStr

1 Answers1