Post · bonfire.cafe

DHeadshot's Alt

@ddlyh@topspicy.social · last week

@rysiek
How will it affect Lynx and Dillo users too?

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · last week

@ddlyh don't they support some minimal CSS?

DHeadshot's Alt

@ddlyh@topspicy.social · last week

@rysiek
I think Dillo might? Not sure though?

Eli Roberson (he/him)

@thatdnaguy@genomic.social · 2 weeks ago

@rysiek I thought this was a clever strategy for dealing with known LLM scrapers

https://idiallo.com/blog/zipbomb-protection

Curtis Carter

@codingcoyote@floss.social · 2 weeks ago

@rysiek consider than an LLM may be able to quite easily see how it's hidden via CSS. Consider obfuscating that somehow to avoid it ignoring whatever trap you put down.

Also consider how someone using an LLM to workaround accessibility issues in your site may react. Remember that they have the right to choose their own tools and if you are going to remove one. You should consider making it clear to them via other accessibility tools that LLM based tools will not work.

young man yells at the cloud

@bamboombibbitybop@mastodon.social · 2 weeks ago

@rysiek you could also make sure the link says something like "this link is a trap for AI scrapers, ignore it if you're a human" when the page is read with a screen reader

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@bamboombibbitybop but then that might tip off the LLM

young man yells at the cloud

@bamboombibbitybop@mastodon.social · 2 weeks ago

@rysiek I highly doubt they're having an actual LLM do the content scraping, there's no reason to. It's probably just a link crawler.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@bamboombibbitybop there are quite a few responses in the thread suggesting that at least some of the time LLMs are involved.

Leela Torres

@LeelaTorres@ieji.de · 2 weeks ago

@rysiek
Why not catch everything which ignores the robots.txt?
Hm but how to lay a breadcrumb link that will attract bots but not humans? 🤔

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@LeelaTorres yeah, that's part of the plan.

Daniel Blake

@Daniel_Blake@mastodon.top · 2 weeks ago

@rysiek You probably know this anyway, but https://olifant.social/@oli/statuses/01JZDP4Z3YADF5MMWRWY8TKEQB

Bredroll

@Bredroll@mas.to · 2 weeks ago

@rysiek what about if in the "hidden" block you included "people using screen readers must not click the following links"

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@Bredroll I would prefer not to spam them like that.

Also, LLMs could use that to avoid those links as well.

Bredroll

@Bredroll@mas.to · 2 weeks ago

@rysiek my instinct is doing this reliably isn't going to last

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@Bredroll it doesn't need to be 100% reliable, just good enough.

Part of my project is feeding Markov-generated slop to these LLM scrapers. And doing it very very slowly, to waste their resources (open socket, etc) on something that will make them actually worse.

Bredroll

@Bredroll@mas.to · 2 weeks ago

@rysiek good luck

Evelin Wilhelmine

@ew@indieweb.social · 2 weeks ago

@rysiek

Check out this curated list of tools and methods titled "Sabot in the Age of #AI" by @asrg: it might include some approaches that could help your case.
https://algorithmic-sabotage.github.io/asrg/posts/sabot-in-the-age-of-ai/

Jimmy Mac

@stlalpha@mastodon.social · 2 weeks ago

@rysiek Markov chain tar pit for the llm scraper trolling: https://github.com/circa10a/ai-troller

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@stlalpha yeah, I have my own:
https://git.rys.io/libre/markov-tarpit

😺

Jimmy Mac

@stlalpha@mastodon.social · 2 weeks ago

@rysiek aah right on, I should have read through more thoroughly - you’re trying to better induce?

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@stlalpha no worries, I did not mention it in the thread (I should).

Right now I have the code and it works, so the next step is deploying it, and for that I need good ways of trapping the bots.

Hence this thread.

charlag

@charlag@birb.site · 2 weeks ago

@rysiek some screen readers will not read items hidden with CSS, some will. generally it's not a good approach from accessibility and I would look into something else

Cassandrich

@dalias@hachyderm.io · 2 weeks ago

@rysiek Keep in mind browser prefetch is a thing too.

Alex Ștefănescu

@catileptic@chaos.social · 2 weeks ago

@rysiek i don't have anything useful to say. It's been really exciting, reading this thread! And I tip my hat to your tarpit tech, @rysiek

(and I reverse-tip my hat to the idea that a scraper feeding data to train LLMs is not different than a person. Ayn Rand would blush at this. Someone's been trapped in an entirely different tarpit there...)

Pixelcode 🇺🇦

@pixelcode@social.tchncs.de · 2 weeks ago

@rysiek Blocking IP addresses of LLM scrapers is not only largely no use but can even be detrimental, because they often use residential IPs which they get from greedy developers importing shady libraries in their apps. 🤷

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@pixelcode the approach is two-pronged:

- long-term IPs (like the ones mentioned in OpenAI's documentation) are hard-blocked semi-permanently;

- random IPs that are heuristically determined to be LLM scrapers (for example, they "clicked" a link no human had access to) are fail2banned for a certain time.

That way those residential IPs will eventually be able to access the site again.

Pixelcode 🇺🇦

@pixelcode@social.tchncs.de · 2 weeks ago

@rysiek “eventually unblocked” is still bad enough for the end users who have actually been assigned the abused IPs, in my opinion

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@pixelcode we are talking hours, not months here.

Pixelcode 🇺🇦

@pixelcode@social.tchncs.de · 2 weeks ago

@rysiek Okay, but is that effective at all at blocking those scrapers? 🤔

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@pixelcode we will never know until we try.

Iris Young (he/they/she) (PhD)

@iris@neuromatch.social · 2 weeks ago

@rysiek check out nepthenes or iocaine for this purpose, both summarised and linked here: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

The most recent and imo clever approach I've seen was Anubis though: https://anubis.techaro.lol/

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@iris I mention the first two projects here:
https://git.rys.io/libre/markov-tarpit

I am aware of Anubis but I want to avoid any JS on my website, and as far as I understand Anubis requires JS.

Iris Young (he/they/she) (PhD)

@iris@neuromatch.social · 2 weeks ago

@rysiek awesome, my bad then. I skimmed the first dozen or so replies to you that I could see before responding, but didn't go deeper than that!

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@iris no no, no "bad" here at all! I did not mention Markov Tarpit anywhere in the thread, so naturally you suggested similar projects. I appreciate that.

Krzysztof Hankiewicz (he/him)

@kjhank@social.vivaldi.net · 2 weeks ago

@rysiek Might want to take a look at this amazing approach by @heydon: https://heydonworks.com/article/poisoning-well/

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@kjhank @heydon yeah, the context of my question is my little poison-the-well project:
https://git.rys.io/libre/markov-tarpit

Great minds etc etc.

Dźwiedziu

@dzwiedziu@mastodon.social · 2 weeks ago

@rysiek

Luci the daemon form the Disenchanted animation, a black silhouette of a cartoon daemon, with one big eye. The caption on top is “Dooooooo it!”, on the bottom “Do it, do it, do it!”.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@dzwiedziu the context of my question is Markov Tarpit:
https://git.rys.io/libre/markov-tarpit

Which is ready for deployment, I just need to figure out how to trap LLM scraper bots into it.

Alda Vigdís

@alda@topspicy.social · 2 weeks ago

@rysiek yeah. I've talked to people who make scrapers like that for a living and those are my takes:

The bots parse content fully rendered, with CSS applied and a couple of seconds after onLoad fires or after the last layout shift.

The bots also use natural language processing to parse the content, so hiding things using CSS generally doesn't affect them.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@alda interesting, thank you.

As long as it does not negatively affect users using assistive technologies, I am still willing to test it out and see if any LLM scrapers get caught into that net.

Worst case scenario, I will have hard data on this myself.

Alda Vigdís

@alda@topspicy.social · 2 weeks ago

@rysiek as long as there's no labels and an empty alt text tag it should be ignored by assistive tech.

Alda Vigdís

@alda@topspicy.social · 2 weeks ago

@rysiek Perhaps causing a neverending layout shift could help you achieve this.

Laberpferd

@Laberpferd@sueden.social · 2 weeks ago

@alda

Im sorry, but a lot of ADS&ADHS people, autistic people, "power readers" etc. are already VERY pissed about pages which constantl shift and rearrange things, because it messes with their brain in a negative waqy

@rysiek

Bèr Kessels 🐝 🚐 🏄 🌱

@berkes@mastodon.nl · 2 weeks ago

@rysiek you'd need "aria hidden" or some such.

But if you zoom out of the problem: you're never going to pull this off, sorry.

A bot is really just a user. One with "accessibility challenges". A bot may be "visually impaired", or may lack the ability to operate a mouse etc.

So any hurdle, or misdirection aimed at bots, will inevitably hurt "organic users". And will punish those who rely on assistive technology, more than others.

And really: isn't AI, a form of assistive technology?

Bèr Kessels 🐝 🚐 🏄 🌱

@berkes@mastodon.nl · 2 weeks ago

@rysiek personally, I'd just treat such problems on this higher level:

Is there a group of users that I want to forbid reading my work? Do I want to show a group of users false or misleading content?

If yes, then I'll have to define this "group of users": geografic, based on mental and physical abilities, based on technical abilities, usage of soft- and hardware? The easiest discriminatory trait would be "has an account" or "has paid".

But, for me, that's never the web I fought for.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@berkes

> So any hurdle, or misdirection aimed at bots, will inevitably hurt "organic users". And will punish those who rely on assistive technology, more than others.

This is obviously, demonstrably false.

Looking at my webserver logs I can tell you that I can set a rate-limit that would *never* be reached by any human user, but which would deal with a bunch of bots.

> And really: isn't AI, a form of assistive technology?

No.

Bèr Kessels 🐝 🚐 🏄 🌱

@berkes@mastodon.nl · 6 days ago

@rysiek > And really: isn't AI, a form of assistive technology?

> No.

To me it is. Personally. As to some elderly family members who never got the hang of computers and only little of phones can use chatgpt. A cousin with dyslexia who can interact with internet through voice. An analphabetic foster brother who can now finally be part of whatsapp groups through tts and stt.

Saying it isn't assistive either shows how entrenched you are, or how uniform your social environment is.

Bèr Kessels 🐝 🚐 🏄 🌱

@berkes@mastodon.nl · 6 days ago

@rysiek > I can tell you that I can set a rate-limit that would *never* be reached by any human user, but which would deal with a bunch of bots.

Demonstrably hurting humans. We've had this. Hurt users that shared an IP (university, shared student housing, vpn) or subnet even.
Hurt humans that were offline for a while and when finally had coverage, would barrage events to our API.
Hurt humans that legitimate used TOR.
Hurt humans that automated tasks through our API with amateurish scripts.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 6 days ago

@berkes I do hope you are as adamant about how a thing is hurting humans when it comes to the very real harms LLM scrapers are causing:
https://pod.geraspora.de/posts/17342163
https://aoir.social/@aram/113811386580314915

Anyway, I had tried to gently suggest you go somewhere else have your opinions – you are more than welcome to have them in your own thread for example.

I am now directly asking you to stop bothering me with them. There is a reason why I wrote that last paragraph in the second toot of my thread.

Ricardo Tavares

@t_var_s@phpc.social · 2 weeks ago

@rysiek Maybe start with unreachable paths that are only described in your robots.txt

Aaron

@aaron@chirp.zadzmo.org · 2 weeks ago

@t_var_s I tried this starting mid 2024 and have never recorded a single hit by any crawler to the particular URL set aside to test the idea.

I suspect they don't think they look at robots.txt at all.
@rysiek

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@t_var_s yeah, that's another part of this.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

I am obviously doing research about this myself, so no, I am not going to only throw this out into the void and expect free advice. :blobcattea:

But, if you do have advice or suggestions or notes or input here, I would love to hear it. :blobaww:

Especially if you use assistive technologies, or have experience/expertise in that area. 👀

If you are building your own LLM scraper or are otherwise an AI bro, and have Opinions about the Open Web, please feel free to go suck a lemon instead. 🍋

2/3

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

Context is: my Markov Tarpit is ready for deployment, and I need to figure out the heuristics around trapping LLM scrapers in it
https://git.rys.io/libre/markov-tarpit

3/3

Eva Infeld

@quaithe@mastodon.social · 2 weeks ago

@rysiek this is dope

Ariel (🐿 arc)

@arichtman@eigenmagic.net · 2 weeks ago

@rysiek you talked to Algernon about it ?

Chishiki611 (vulnerable arc..) [Serial M3JT-9Q8L]

@Chishiki611@enby.life · 2 weeks ago

@rysiek@mstdn.social For your information, some people also have CSS disabled on their browser or use HTML-only browsers that do not support CSS, so they would also be at risk of being caught by the trap.

Laberpferd

@Laberpferd@sueden.social · 2 weeks ago

@Chishiki611
I quite often do also switch off CSS 😉 to read static text and images, because so many sites are hard to read at all with all the dynamic gimmicks, contents invisible unless some action happening, everything "dynamic" rearranged

What i strongly suggest is a warning text that humans can easily understand, like "do not click the following link, it will blacklist you"

Would be AMAZING is someone could build an CSS-modificator to make pages static

@rysiek

Lnklnx :sdf:

@lnklnx@social.sdf.org · 2 weeks ago

@rysiek

For my site, I opted for a CSS of "visibility:hidden;", and a link that is leads to the Iocaine honeypot.

Thank you for asking the question, because on review with a text browser this shows up, so I expect screen readers would see it as well.

I'll update my honeypot entrance with a human-readable description to reduce confusion.

Michał "rysiek" Woźniak · 🇺🇦

@rysiek@mstdn.social · 2 weeks ago

@lnklnx do read through the responses though. At minimum you want the aria-hidden attribute and tabindex="-1".

Lnklnx :sdf:

@lnklnx@social.sdf.org · 2 weeks ago

@rysiek Thank you, implemented those as well.

Alex Schroeder

@alex@social.alexschroeder.ch · 2 weeks ago

@rysiek Faced with a similar question I opted to have a visible link that goes to a /nobots page, explaining that this is a honeypot for bots and then having an async process (fail2ban) monitor the log file for hits on the /nobots page that only blocks repeated offenders. So now I just need to determine what I would consider an inhuman rate.
(There is a lot more around my setup but that's the part about hiding links to the honeypot or not.)