Dear fedi, I am thinking of building-in some LLM scraper bot traps into my website.

One of the ideas is links down near the bottom of each blogpost or page that are hidden in CSS (so that no human would click them) that when clicked immediately put the client IP address on naughty list.

I want to understand better how CSS-hidden links work for visitors and others using screen readers or other assistive technologies.

The last thing I'd want is to inconvenience any human! :blob0w0:

1/3

@rysiek consider than an LLM may be able to quite easily see how it's hidden via CSS. Consider obfuscating that somehow to avoid it ignoring whatever trap you put down.

Also consider how someone using an LLM to workaround accessibility issues in your site may react. Remember that they have the right to choose their own tools and if you are going to remove one. You should consider making it clear to them via other accessibility tools that LLM based tools will not work.

@rysiek i don't have anything useful to say. It's been really exciting, reading this thread! And I tip my hat to your tarpit tech, @rysiek

(and I reverse-tip my hat to the idea that a scraper feeding data to train LLMs is not different than a person. Ayn Rand would blush at this. Someone's been trapped in an entirely different tarpit there...)

@pixelcode the approach is two-pronged:

- long-term IPs (like the ones mentioned in OpenAI's documentation) are hard-blocked semi-permanently;

- random IPs that are heuristically determined to be LLM scrapers (for example, they "clicked" a link no human had access to) are fail2banned for a certain time.

That way those residential IPs will eventually be able to access the site again.

@rysiek yeah. I've talked to people who make scrapers like that for a living and those are my takes:

The bots parse content fully rendered, with CSS applied and a couple of seconds after onLoad fires or after the last layout shift.

The bots also use natural language processing to parse the content, so hiding things using CSS generally doesn't affect them.

@rysiek you'd need "aria hidden" or some such.

But if you zoom out of the problem: you're never going to pull this off, sorry.

A bot is really just a user. One with "accessibility challenges". A bot may be "visually impaired", or may lack the ability to operate a mouse etc.

So any hurdle, or misdirection aimed at bots, will inevitably hurt "organic users". And will punish those who rely on assistive technology, more than others.

And really: isn't AI, a form of assistive technology?

@rysiek personally, I'd just treat such problems on this higher level:

Is there a group of users that I want to forbid reading my work? Do I want to show a group of users false or misleading content?

If yes, then I'll have to define this "group of users": geografic, based on mental and physical abilities, based on technical abilities, usage of soft- and hardware? The easiest discriminatory trait would be "has an account" or "has paid".

But, for me, that's never the web I fought for.

@berkes

> So any hurdle, or misdirection aimed at bots, will inevitably hurt "organic users". And will punish those who rely on assistive technology, more than others.

This is obviously, demonstrably false.

Looking at my webserver logs I can tell you that I can set a rate-limit that would *never* be reached by any human user, but which would deal with a bunch of bots.

> And really: isn't AI, a form of assistive technology?

No.

@rysiek > And really: isn't AI, a form of assistive technology?

> No.

To me it is. Personally. As to some elderly family members who never got the hang of computers and only little of phones can use chatgpt. A cousin with dyslexia who can interact with internet through voice. An analphabetic foster brother who can now finally be part of whatsapp groups through tts and stt.

Saying it isn't assistive either shows how entrenched you are, or how uniform your social environment is.

@rysiek > I can tell you that I can set a rate-limit that would *never* be reached by any human user, but which would deal with a bunch of bots.

Demonstrably hurting humans. We've had this. Hurt users that shared an IP (university, shared student housing, vpn) or subnet even.
Hurt humans that were offline for a while and when finally had coverage, would barrage events to our API.
Hurt humans that legitimate used TOR.
Hurt humans that automated tasks through our API with amateurish scripts.

@berkes I do hope you are as adamant about how a thing is hurting humans when it comes to the very real harms LLM scrapers are causing:
https://pod.geraspora.de/posts/17342163
https://aoir.social/@aram/113811386580314915

Anyway, I had tried to gently suggest you go somewhere else have your opinions – you are more than welcome to have them in your own thread for example.

I am now directly asking you to stop bothering me with them. There is a reason why I wrote that last paragraph in the second toot of my thread.

I am obviously doing research about this myself, so no, I am not going to only throw this out into the void and expect free advice. :blobcattea:

But, if you do have advice or suggestions or notes or input here, I would love to hear it. :blobaww:

Especially if you use assistive technologies, or have experience/expertise in that area. 👀

If you are building your own LLM scraper or are otherwise an AI bro, and have Opinions about the Open Web, please feel free to go suck a lemon instead. 🍋

2/3

@Chishiki611
I quite often do also switch off CSS 😉 to read static text and images, because so many sites are hard to read at all with all the dynamic gimmicks, contents invisible unless some action happening, everything "dynamic" rearranged

What i strongly suggest is a warning text that humans can easily understand, like "do not click the following link, it will blacklist you"

Would be AMAZING is someone could build an CSS-modificator to make pages static

@rysiek

@rysiek Faced with a similar question I opted to have a visible link that goes to a /nobots page, explaining that this is a honeypot for bots and then having an async process (fail2ban) monitor the log file for hits on the /nobots page that only blocks repeated offenders. So now I just need to determine what I would consider an inhuman rate.
(There is a lot more around my setup but that's the part about hiding links to the honeypot or not.)