4

I would like to keep the target website private, but here are some details:

  • It's a personal (as in single-author) public documentation / portfolio / blog sort of website
  • It seems to be hosted using Apache
  • The contents are static as far as I can tell
  • When using browser on some relative paths it serves them using the browser's 'Index of' view of files
  • It does not seem to have robot.txt
  • It has a root index.html
  • This is not some "secret" information (it's in public web and there are no logins/accounts there)

There are "public" images and html files there that are not (ultimately) linked from index.html . The subj tools Cyotek WebCopy and HTTrack cannot find those files, yet Google can: site:example.com etc.

What does Google do that those web copy tools don't?

The idea of this exercise is to both preserve a copy and discover things not yet linked. I am asking this question to ideally both:

  1. Find a way to copy the full website as seen by search engines.
  2. Understand a bit more about the web.
Den
  • 143
  • 4

2 Answers2

3

What does Google do that those web copy tools don't?

It follows links from other websites. Tools such as HTTrack are limited to a single starting point, while Google has knowledge of practically the entire web – some of those contents may have been linked in forum posts, mailing list archives, tweets, and so on.

(That's most likely how Google found example.com in the first place – not by learning about a new domain, but by finding a cross-site link there while indexing another.example.net.)

u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • Interesting, so beyond that, the only way to get the full website would be essentially brute-forcing all possible page/file names? – Den Jan 12 '23 at 16:54
2

Not all web-copiers were created equal.

My experience when copying one website a year ago, was that most of the tools failed to do a complete download.

I finally found a utility that managed to scrape the whole website. As it's a paid service aimed at one website, the Wayback Machine, I won't mention the utility's name.

However, that utility took a couple of weeks to download the website, and as I was able to follow its work, I could see it working sometimes for hours on finding just one problematic file. The results were perfect and I never noticed anything missing.

I have no idea what algorithms this utility used, but evidently the straight-forward algorithms that just follow links are not very efficient, except in the simplest cases.

I would expect Google to be at least as good in scanning a website as was that utility I found. I wouldn't expect Cyotek WebCopy and HTTrack to spend as much processing time and internet usage on downloading a website as did that utility, or that Google can afford.


Coming to think about, there is another mechanism that is a very good explanation : Long memory.

I believe that an efficient scrapper would work like this:

  • Scan pages and their links
  • Collect all the URLs, together with some version indicator, perhaps the date/time, or perhaps the checksum
  • If a page hasn't changed since the last scrap, don't index it and don't process its links (already done previously).

This way a scrapper can re-scan a website that hasn't changed in 99% of its pages in a very efficient manner, thus skipping almost all of the pages.

I imagine that infrequently, the scrapper will launch a pruning operation, to verify which URLs still point to valid pages. I know that Google can indeed sometimes give pages as results which do not exist. This being a costly operation, pruning is probably not done very frequently.

The outcome of such an algorithm is that : Indexed pages that still exist will stay indexed as long as they still exist, and even sometimes afterward.

This seems to me to be the most likely explanation for pages that are no longer linked to still be indexed by Google. The pages were previously linked, but now are no longer linked.

(Note : I do not believe that when scrapping a website, Google will look at all the trillions of other websites that it has indexed, to find additional pages to index in the current website.)

harrymc
  • 455,459
  • 31
  • 526
  • 924
  • Sounds like an option and makes sense in the context of the other answer too. I wish search engines had an API to copy all the paths they had discovered into a downloader tool. – Den Jan 12 '23 at 17:57
  • Maybe I will try https://github.com/hartator/wayback-machine-downloader – Den Jan 12 '23 at 17:58
  • After thinking more about it, I added another explanation that seems more likely to explain it. – harrymc Jan 12 '23 at 19:22
  • The approach I went for was googling to mine for additional website non-self-linked routes and feeding that to the web copy tool. – Den Jan 13 '23 at 10:47