Not all web-copiers were created equal.
My experience when copying one website a year ago, was that most of
the tools failed to do a complete download.
I finally found a utility that managed to scrape the whole website.
As it's a paid service aimed at one website, the Wayback Machine,
I won't mention the utility's name.
However, that utility took a couple of weeks to download the website,
and as I was able to follow its work, I could see it working sometimes
for hours on finding just one problematic file.
The results were perfect and I never noticed anything missing.
I have no idea what algorithms this utility used, but evidently the
straight-forward algorithms that just follow links are not very efficient,
except in the simplest cases.
I would expect Google to be at least as good in scanning a website
as was that utility I found.
I wouldn't expect Cyotek WebCopy and HTTrack to spend as much
processing time and internet usage on downloading a website as
did that utility, or that Google can afford.
Coming to think about, there is another mechanism that is a very
good explanation : Long memory.
I believe that an efficient scrapper would work like this:
- Scan pages and their links
- Collect all the URLs, together with some version indicator,
perhaps the date/time, or perhaps the checksum
- If a page hasn't changed since the last scrap, don't index it
and don't process its links (already done previously).
This way a scrapper can re-scan a website that hasn't changed
in 99% of its pages in a very efficient manner, thus skipping
almost all of the pages.
I imagine that infrequently, the scrapper will launch a pruning
operation, to verify which URLs still point to valid pages.
I know that Google can indeed sometimes give pages as results which
do not exist.
This being a costly operation, pruning is probably not done very
frequently.
The outcome of such an algorithm is that : Indexed pages that
still exist will stay indexed as long as they still exist,
and even sometimes afterward.
This seems to me to be the most likely explanation for pages
that are no longer linked to still be indexed by Google.
The pages were previously linked, but now are no longer linked.
(Note : I do not believe that when scrapping a website,
Google will look at all the trillions of other websites that
it has indexed, to find additional pages to index in the current
website.)