2

I'd like to preface this by saying that I'm very new to the command prompt and I've been only using it for some WGET and YOUTUBE-DL, and that I'm on a Windows 8 PC.

I'd like to get a bunch of links from an html file. The links all start with

https://s-media-cache-ak0.pinimg.com/originals/

and end with

.jpg

Right now I'm using this:

findstr ^https://s-media-cache-ak0.pinimg.com/originals/.*\.jpg index.html > urls.txt

I did some research and I'm using the "range" function of FINDSTR as you can see. But I still get a lot of extra text that I'm not interested in. Is there anyway to trim it down?

Ƭᴇcʜιᴇ007
  • 111,883
  • 19
  • 201
  • 268
  • [now you have two problems](http://nedbatchelder.com/blog/201204/two_problems.html). HTML is too complex for findstr or regex in general. Any findstr solution will eventually break – Rich Homolka Jun 14 '15 at 20:35

1 Answers1

3

As this StackOverflow answer states, you really shouldn't atempt to parse [X]HTML with regex. findstr has very limited regex support in any case.

Use a proper HTML scraper/parser like Xidel instead. A command like the following will do what you're looking for:

xidel <URL or HTML file name> -q -e "//a/extract(@href/resolve-uri(.), 'https:\/\/s-media-cache-ak0\.pinimg\.com\/originals\/.*?\.jpg')[. != '']"
Karan
  • 55,947
  • 20
  • 119
  • 191