🤖 STOP aux bots d'IA qui scrapent vos données ! ✋
Au lieu de se battre avec des dizaines de robots.txt (Hello WordPress & Gitea), on passe à l'offensive centralisée. 🛡️
On bloque les GPTBot, ClaudeBot, et autres directement à la porte, au niveau de notre cher NGINX Proxy Manager !
C'est plus propre, plus efficace, et ça fait plaisir à notre CPU. 😉
👉 La méthode complète, avec le fichier .conf à créer : https://wiki.blablalinux.be/fr/blocage-robots-ia-nginx-proxy-manager
🤖 STOP aux bots d'IA qui scrapent vos données ! ✋
Au lieu de se battre avec des dizaines de robots.txt (Hello WordPress & Gitea), on passe à l'offensive centralisée. 🛡️
On bloque les GPTBot, ClaudeBot, et autres directement à la porte, au niveau de notre cher NGINX Proxy Manager !
C'est plus propre, plus efficace, et ça fait plaisir à notre CPU. 😉
👉 La méthode complète, avec le fichier .conf à créer : https://wiki.blablalinux.be/fr/blocage-robots-ia-nginx-proxy-manager
Instagram collects scary amounts of your data 🫣 Mastodon on the other hand - None! ✅
What do we learn? There are great services available that do not abuse your data.
Give your friend the nudge they need - they deserve better than Meta's constant surveillance 👉🏼 https://tuta.com/blog/how-to-delete-an-instagram-account
The developer of Bear Blog on the latest wave of scrapers he faced.
They've depleted all human-created writing on the internet, and are becoming increasingly ravenous for new wells of content.
[...]
I'm still speculating here, but I think [mobile] app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.
The developer of Bear Blog on the latest wave of scrapers he faced.
They've depleted all human-created writing on the internet, and are becoming increasingly ravenous for new wells of content.
[...]
I'm still speculating here, but I think [mobile] app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.
sysadmins/webmasters of fedi:
I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.
There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.
Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?
sysadmins/webmasters of fedi:
I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.
There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.
Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?
Yay, I implemented a way to keep #AI #Crawlers from hammering our punny community Server w/o relying on javascript/anubis yet:
* hashlimit https to few connections/min only. (browsers will use pipelining anyway and make only one/few connections)
* Anything about that hashlimit becomes hard dropped.
* Configure the Webserver to drop a connection when the reply code is >=400
* make a hidden page with thousands of nonsense links that end in 404
* link that hidden page with hidden links from almost any other page.
* put that hidden page (and some non existing links) into robots.txt as 'Disallow'
🍿🎉
Yay, I implemented a way to keep #AI #Crawlers from hammering our punny community Server w/o relying on javascript/anubis yet:
* hashlimit https to few connections/min only. (browsers will use pipelining anyway and make only one/few connections)
* Anything about that hashlimit becomes hard dropped.
* Configure the Webserver to drop a connection when the reply code is >=400
* make a hidden page with thousands of nonsense links that end in 404
* link that hidden page with hidden links from almost any other page.
* put that hidden page (and some non existing links) into robots.txt as 'Disallow'
🍿🎉
I've had it with the aggressive #AI #crawlers now. Some bot has been hitting #MacPorts with a legitimate enough user agent that I can't block it without also blocking users.
Yesterday, it sent 377k requests (62 % of the total), 369k to URLs forbidden in robots.txt from 274k unique IPs. Most of it for content that could be analyzed quicker using svn checkout or git clone.
Dynamic content on the #web is broken. There's just no way to do that anymore. What a waste of energy.
I've had it with the aggressive #AI #crawlers now. Some bot has been hitting #MacPorts with a legitimate enough user agent that I can't block it without also blocking users.
Yesterday, it sent 377k requests (62 % of the total), 369k to URLs forbidden in robots.txt from 274k unique IPs. Most of it for content that could be analyzed quicker using svn checkout or git clone.
Dynamic content on the #web is broken. There's just no way to do that anymore. What a waste of energy.
If your crawler is spoofing its identity because I have disallowed it from my sites, I consider that an attempted server breach. You are no better than someone trying to compromise my SSH connection.
If your crawler is spoofing its identity because I have disallowed it from my sites, I consider that an attempted server breach. You are no better than someone trying to compromise my SSH connection.
🖥️ My ultra-budget server powering http://websysctl.alfonsosiciliano.net has been running smoothly for the past 2 months. So far, so good!
📈 #Crawlers hit tens of thousands of sysctl parameter pages daily. That's fine, since robots.txt allows it. But why keep requesting non-existent pages as if the site were built with WordPress 😤 ? Fortunately, the stack (#FreeBSD
+ #OpenResty 🌐 + #Lapis ✏️ + a custom-built #database 📦 ) stays well within the limited resources of my $5/month cloud server.
The code might soon be #OpenSource stay tuned!
#UNIX #sysctl #WebDev #WebServer #ThePowerToServe #coding #Lua #kernel