I started crawling 80,000,000 web pages for a personal project August 1 after an initial trial "PING-ing" 40,000,000 of them.
Cloudfare limiting is triggered at least twice a day, and almost immediately if I up my own rates.
Currently at 16% as of September 17.
Has rate limiting severely limited web archiving capability in the advent of AI?
Should I be crawling from multiple IP addresses?
What volume pages are likely to be crawled in day-to-day archiving processes?