Good morning, Madrid! Third day of #IETF123https://www.ietf.org/meeting/123/
For me, today, MAPRG (crawlers) and the plenary.
Discussion
Good morning, Madrid! Third day of #IETF123https://www.ietf.org/meeting/123/
For me, today, MAPRG (crawlers) and the plenary.
MAPRG deals with measurement and analysis of Internet protocols. Today session is about the impact of #AI crawlers. [Personal opinion: the fact that it crawls for AI or for any other reason do not change the impact.]
This is related to the aipref (AI preferences) working group and the BoF on HTTP bot authentication.
#IETF123
Testimony from #CommonCrawlhttps://commoncrawl.org/ people. Blocking bots is more and more common, for instance by Cloudflare, so research projects like CommonCrawl (bot "CCBot") suffer.
147 regular expressions to identify and classify refusals.
HTTP status code can be wrong, such as the 430 returned by Shopify. Or 429 returned for non-transient refusals.
Many sites can be unreachable because they are centralized under one company, like Newfold Digital.
On the other side (the content servers), how to defend against crawlers? [With the usual confusion between the usages we don't like like gen AI and the stress on the server from the crawling. That's two different issues.]
robots.txt is not always respected (for instant by TikTok's ByteSpider). The vast majority of artist-related HTTP servers don't use robots.txt (awareness? Also, many platforms do not allow the user to edit robots.txt)
I notice that Cloudflare blocks serious AI bots (those that respect robots.txt) but not the many unknown bots that make most of the traffic and trouble.
Now the testimony for #Wikimedia : the network use increases (+50 % in the last year), partly from bots (not always AI bots).
Bots have broader interests than humans (they don't go to the most popular page of the day) so are less often served from cache. Bots make 35 % of the traffic but 65 % of the expensive [not from caches] traffic.
But like IETF, WIkimedia does not want to block: the goal is to make knowledge available.
Heavy users should download the dumps?
A space for Bonfire maintainers and contributors to communicate