How can we know which URLs can be crawled as robots.txt tells if we don't know to which folder a URL belong to?

Question

I'm going to code a web crawler but before I want to know what is going to be possible to crawl.

Tell me if I'm wrong, but in robots.txt websites indicate folders not URLs that can and can't be crawled, so how can we know to which folder a URL belong to ?

score 0 · Accepted Answer · answered Jan 21 '19 at 14:01

0

The robots.txt file excludes directory prefixes. For example, if you have a robots.txt excluding a directory /foo, then /foo/bar.html must not be crawled.

For any URL you want to crawl, you have to check whether its path matches one of the directives in the robots file.

See the Google documentation for more info and examples:

The path value is used as a basis to determine whether or not a rule applies to a specific URL on a site. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path).

Note that URLs do not have to indicate actual directories on a server. /download.php?what=thestuff could be functionally equivalent to /download/thestuff and point to the same resource.

answered Jan 21 '19 at 14:01

slhck

223,558
70
607
592

1

What makes me confuse is the term "directory" because it makes me think that robots.txt exclude some directories path on the server instead of excluding URLs paths. Can you confirm that if the robots.txt contain : `User-agent: *` `Disallow: /download/thestuff` I can still crawl to `http://dl/something.html` even if `something.html` 's path on the server is `/download/thestuff` ? – DevAb Jan 21 '19 at 16:45
Yes, you can still crawl that URL. The physical/actual directories on the server do not matter. It's only about matching the URL against the directive in the `robots.txt` file. – slhck Jan 21 '19 at 17:07

How can we know which URLs can be crawled as robots.txt tells if we don't know to which folder a URL belong to?

1 Answers1