Questions tagged [web-crawler]

76 questions
27
votes
5 answers

Convert web pages to one file for ebook

I want to download HTMLs (example: http://www.brpreiss.com/books/opus6/) and join it to one HTML or some other format that i can use on ebook reader. Sites with free books don't have standard paging, they're not blogs or forums, so don't know how to…
Hrvoje Hudo
  • 552
  • 2
  • 7
  • 14
24
votes
2 answers

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files. Google searches are completely useless. Here's a command I've tried: wget…
Nathan J.B.
  • 611
  • 1
  • 4
  • 9
16
votes
4 answers

Using Wget to Recursively Crawl a Site and Download Images

How do you instruct wget to recursively crawl a website and only download certain types of images? I tried using this to crawl a site and only download Jpeg images: wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg…
Cerin
  • 8,746
  • 16
  • 54
  • 65
15
votes
2 answers

Why is @ in email address sometimes written as [at] on webpages?

Why is @ sometimes in webpages written as [at]? Does it have any specific reason ?
Sai
  • 177
  • 7
15
votes
1 answer

How to save all files/links from a telegram chat/channel?

I want to save ALL http(s) links and/or files, posted to some telegram chat (private or group) or channel (like mailing list). I need an analog of TumblOne (for tumblr) VkOpt (able to save chating history in vk.com ) or jDownloader (for file…
WallOfBytes
  • 367
  • 2
  • 3
  • 15
12
votes
4 answers

How "legal" is site-scraping using cURL?

Recently I was experimenting with the cURL, and I found lot is possible with it. I built a small script that crawls a musical site, which plays online songs. On the way of my experiment, I found that it is possible to crawl the song source also..…
Chetan Sharma
  • 487
  • 4
  • 16
7
votes
1 answer

wget: recursively retrieve urls from specific website

I'm trying to recursively retrieve all possible urls (internal page urls) from a website. Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just…
abhiomkar
  • 171
  • 1
  • 4
6
votes
3 answers

Is it possible to discover all the files and sub-directories of a URL?

I wonder if there is a software I can use to discover all the files and sub-directories given a URL? For example, given www.some-website.com/some-directory/, I would like to find all the files in /some-directory/ directory as well as all…
Mark
  • 63
  • 1
  • 1
  • 3
6
votes
4 answers

What do I use to download all PDFs from a website?

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack,…
user385496
5
votes
2 answers

Tool to recursivly convert a HMTL file to PDF?

Are there any tools which not only convert a HTML file to PDF but also follow links, so that in the end I get 1(!) PDF file which contains all html files?
user27076
  • 244
  • 3
  • 9
4
votes
1 answer

Extract data from an online atlas

There is an online atlas that I would like to extract values from. The atlas provides a tool ('Query') to extract values when you click a location or enclose a region on the map, or you can specify the latitude/longitude of a point where you want…
KAE
  • 1,849
  • 8
  • 32
  • 48
4
votes
2 answers

Why website copy tools like Cyotek WebCopy and HTTrack cannot find files that search engines like Google can?

I would like to keep the target website private, but here are some details: It's a personal (as in single-author) public documentation / portfolio / blog sort of website It seems to be hosted using Apache The contents are static as far as I can…
Den
  • 143
  • 4
4
votes
1 answer

Finding pages on a webpage that contain a certain link

Google does a good jobs finding relevant information. Say I google: FDA's opinion on ISO-9001 Then it finds a link to a PDF on…
Norfeldt
  • 266
  • 1
  • 5
  • 19
3
votes
5 answers

Website crawler/spider to get site map

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir…
ack__
  • 119
  • 1
  • 1
  • 9
3
votes
2 answers

Firefox addon to download a whole site and one step more

Do you know any Firefox addon that could download a whole website and download all the sites from the links on the first website? I mean also all images and so on.
oneat
  • 3,271
  • 10
  • 51
  • 72
1
2 3 4 5 6