Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
ansuz / ऐरन
@ansuz@social.cryptography.dog  ·  activity timestamp 3 days ago

sysadmins/webmasters of fedi:

I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.

There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.

Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?

#AI #search #robots #crawlers #scrapers

  • Copy link
  • Flag this post
  • Block
ansuz / ऐरन
@ansuz@social.cryptography.dog replied  ·  activity timestamp 2 days ago

I think I will probably have to dig through WIkipedia's comparison of search crawlers[1] for an answer.

I half-expecting negative results, though.

[1]: https://en.wikipedia.org/wiki/Comparison_of_search_engines#Search_crawlers

Comparison of search engines - Wikipedia

  • Copy link
  • Flag this comment
  • Block
ansuz / ऐरन
@ansuz@social.cryptography.dog replied  ·  activity timestamp 2 days ago

I looked at Mojeek and learned that they have a dedicated button for searching Substack, so I guess that's off the list.

  • Copy link
  • Flag this comment
  • Block
ansuz / ऐरन
@ansuz@social.cryptography.dog replied  ·  activity timestamp 2 days ago

I was just looking at my webserver logs while sipping coffee (as one does) and I noticed that one of my websites was receiving requests for a js file which I had prototyped but never actually deployed.

The script tag is present in the page, but it's commented out. I investigated, and it seems that scrapers see that tag and are trying to grab it even though it's completely non-functional. I guess they just want every bit of code they can find to help train an LLM.

This seems like a promising pattern for catching scrapers that pretend to be normal browsers.

  • Copy link
  • Flag this comment
  • Block
ansuz / ऐरन
@ansuz@social.cryptography.dog replied  ·  activity timestamp 2 days ago

On just this one website I count 23 unique IPs with wildly different user agents. Only 4 identify themselves as bots.

This domain receives far less traffic than any other, as I use it mostly as a personal blog and share it only with friends. I assume this will be much more effective if applied to my more public sites.

  • Copy link
  • Flag this comment
  • Block
ansuz / ऐरन
@ansuz@social.cryptography.dog replied  ·  activity timestamp yesterday

Several of my websites now feature a commented-out script tag linking to a non-existent file. Any IP requesting this file will be banned at the firewall level for a significant duration.

I'll give this a few days and report back on how many bad bots it catches.

  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0-rc.3.21 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login