Bien, un con de bot a mis mon serveur Web à genoux quelques minutes cette nuit en se rendant dans des lieux interdit aux robots (il a pas chopé robots.txt ce saligot). Du coup, je regarde la doc d' #Anubis.
You Don't Need Anubis
#HackerNews #You #Need #Anubis #hackernews #techblog #softwaredevelopment #programming #insights
Just updated Anubis to v1.23.0 and it looks like something has changed 
Looking at the timelines from Tusky or Tuba, the timelines would not update. However the web UI of my GoToSocial instance still works fine. Checked the Anubis log and it looked like my Anubis was sending challenges to Tusky, Tuba and even to other fedi servers (Mastodon etc) 
So anyway I fixed it 🤞 by creating new configuration file: /usr/local/share/doc/anubis/data/apps/allow-api-routes.yaml:
- name: allow-api-routes
action: ALLOW
expression:
all:
- '!(method == "HEAD")'
- path.startsWith("/api/")
And another one: /usr/local/share/doc/anubis/data/apps/allow-user-routes.yaml:
- name: allow-api-routes
action: ALLOW
expression:
all:
- '!(method == "HEAD")'
- path.startsWith("/users/")
And then adding the new configurations to /etc/anubis/<gotosocial_service_name>.botPolicies.yaml:
bots:
...
- import: /usr/local/share/doc/anubis/data/apps/allow-api-routes.yaml
- import: /usr/local/share/doc/anubis/data/apps/allow-user-routes.yaml
...
Restarted the Anubis service, and it works again after that. Dunno if that that is even the correct way to do it though. Hopefully i haven't weakened anything 
OK here's a theory: #ChatGPT's #Atlas browser is not a really browser but fact a way for OpenAI to circumvent scrape blockers. It's more a distributed human-based scraper rather than anything else.
Given how widely loathed AI and how damaging AI scrapers have become #OpenAI's IP ranges ended up in quite a lot of block lists, many servers outright terminate any connection to them. Then there are things like #Anubis or #Iocaine that further frustrate #LLM scraping.
But what if you DIDN'T neeed to bother about all that? What if you could use civilian IP addresses with "organic" traffic patterns, and have humans solve Captchas, provide proof of work for Anubis, or get around Iocaine? All this for free -- you don't even need to pay people for it?
I would be REALLY interested to see what telemetry Atlas sends back. 100% certain it will send back things like URL and rendered HTML output, possibly user interaction patterns ("a normal human on this website moves their mouse first to the 'I am not a bot' captcha then clicks it). They do not have to respect robots.txt because, well, it comes from organic visitors...
Am I crazy?
OK here's a theory: #ChatGPT's #Atlas browser is not a really browser but fact a way for OpenAI to circumvent scrape blockers. It's more a distributed human-based scraper rather than anything else.
Given how widely loathed AI and how damaging AI scrapers have become #OpenAI's IP ranges ended up in quite a lot of block lists, many servers outright terminate any connection to them. Then there are things like #Anubis or #Iocaine that further frustrate #LLM scraping.
But what if you DIDN'T neeed to bother about all that? What if you could use civilian IP addresses with "organic" traffic patterns, and have humans solve Captchas, provide proof of work for Anubis, or get around Iocaine? All this for free -- you don't even need to pay people for it?
I would be REALLY interested to see what telemetry Atlas sends back. 100% certain it will send back things like URL and rendered HTML output, possibly user interaction patterns ("a normal human on this website moves their mouse first to the 'I am not a bot' captcha then clicks it). They do not have to respect robots.txt because, well, it comes from organic visitors...
Am I crazy?
“AI” crawlers do not follow the rules:
https://www.heise.de/en/background/Obituary-Farewell-to-robots-txt-1994-2025-10766991.html?seite=all
In order to keep our academic search engine https://www.base-search.net online, we were forced to lock out all browsers that cannot execute Javascript. We currently use #Anubis for this: https://anubis.techaro.lol/
We are sorry for the delay this causes for our legitimate users. Here is what you can do: Accept and keep our cookies! We do not use them for tracking. They help us remember that you are probably a human being.
“AI” crawlers do not follow the rules:
https://www.heise.de/en/background/Obituary-Farewell-to-robots-txt-1994-2025-10766991.html?seite=all
In order to keep our academic search engine https://www.base-search.net online, we were forced to lock out all browsers that cannot execute Javascript. We currently use #Anubis for this: https://anubis.techaro.lol/
We are sorry for the delay this causes for our legitimate users. Here is what you can do: Accept and keep our cookies! We do not use them for tracking. They help us remember that you are probably a human being.
I want to prevent LLM web-scraping bots from stealing the content from @cryogenix but it's a Web 1.x website that has a no-JavaScript policy (except for designated sections)...
So I can't use Anubis, I don't and won't use CloudFlare (because of JS injecion and privacy concerns), and CrowdSec doesn't seem to protect against it.
What can we realistically do? I wouldn't want to make it a Tor/Onion-only website.