Post · bonfire.cafe

@ansuz@social.cryptography.dog · last month

sysadmins/webmasters of fedi:

I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.

There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.

Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?

#AI #search #robots #crawlers #scrapers

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer.

I half-expecting negative results, though.

[1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers

Comparison of search engines - Wikipedia

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list.

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed.

The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM.

This seems like a promising pattern for catching scrapers that pretend to be normal browsers.

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

Since so many people are boosting this thread I think I'll take the opportunity to mention that I'm available for hire on a part-time or contract basis.

Feel free to reach out if you like my ideas about computer-related topics and have both the budget and need of someone who has such ideas.

I can be reached by Fediverse DM or the contact form on my website:

https://cryptography.dog/contact/

Contact

I've added a contact page

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

I just published a blog post summing up my most pertinent thoughts about dealing with badly-behaved web-scraping bots:

https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/

It isn't exactly a Hallowe'en-themed article, but today is the 31st and the topic is concerned with pranking people who come knocking on my website's ports, so it's somewhat appropriate.

#infosec #bots #halloween #scrapers #AI #someMoreHashtagsHere

AI scrapers request commented scripts

A new avenue for identifying greedy, badly-behaved bots

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

One of the people that read my article on algorithmic sabotage set up an infinite source of nonsense for LLM scrapers to ingest:

https://shoobot.com

It uses txtgen (https://ndaidong.github.io/txtgen/) to respond to every subdomain and URL with garbage.

There are already other projects to do the same, but it's nice to see more people trying their hand at addressing the problem.

#AI #sabotage

Txtgen by ndaidong

1 more replies (not shown)

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

looks like someone reshared my article to Hacker News where someone has (predictably) already commented on the headline without reading the article 😅

https://news.ycombinator.com/item?id=45773347

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

On just this one website I count 23 unique IPs with wildly different user agents. Only 4 identify themselves as bots.

This domain receives far less traffic than any other, as I use it mostly as a personal blog and share it only with friends. I assume this will be much more effective if applied to my more public sites.

ansuz / ऐरन

@ansuz@social.cryptography.dog replied · last month

Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration.

I'll give this a few days and report back on how many bad bots it catches.

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.1-alpha.8 no JS en

Automatic federation enabled