How a web crawler is supposed to work:
1. Reads /robots.txt
2. Parses robots.txt and honors User-Agent | Allow / Disallow designations
3. Returns periodically to retrieve permitted content
How AI/LLM training crawlers work:
1. Crawls entire website
2. Reads /robots.txt
3. Returns 10 minutes later
4. GOTO 1.