#Tag · bonfire.cafe

The developer of Bear Blog on the latest wave of scrapers he faced.

They've depleted all human-created writing on the internet, and are becoming increasingly ravenous for new wells of content.
[...]
I'm still speculating here, but I think [mobile] app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.

https://herman.bearblog.dev/agressive-bots

#ai #scrapers #crawlers

Herman's blog

Aggressive bots ruined my weekend

The web-scraping arm race continues

Paolo Amoroso

@amoroso@oldbytes.space · 3 weeks ago

The developer of Bear Blog on the latest wave of scrapers he faced.

They've depleted all human-created writing on the internet, and are becoming increasingly ravenous for new wells of content.
[...]
I'm still speculating here, but I think [mobile] app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.

https://herman.bearblog.dev/agressive-bots

#ai #scrapers #crawlers

Herman's blog

Aggressive bots ruined my weekend

The web-scraping arm race continues

Anke boosted

ansuz / ऐरन

@ansuz@social.cryptography.dog · 3 weeks ago

sysadmins/webmasters of fedi:

I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.

There can definitely be value in having a site indexed by a search engine, but I would like to deliberately exclude all of those which are using the same data to train LLMs and other genAI. More specifically, I would only like to allow those which have an explicit stance against training on others data in this fashion.

Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?

#AI #search #robots #crawlers #scrapers

ansuz / ऐरन

@ansuz@social.cryptography.dog · 3 weeks ago

sysadmins/webmasters of fedi:

I am looking for suggestions of which search engine crawlers I should consider permitting in my robots.txt file.

Currently I reject everything other than Marginalia (https://marginalia-search.com/). Are there any others I should consider?

#AI #search #robots #crawlers #scrapers

Alex Akselrod boosted

cehteh

@cehteh@social.tchncs.de · 2 months ago

Yay, I implemented a way to keep #AI #Crawlers from hammering our punny community Server w/o relying on javascript/anubis yet:

* hashlimit https to few connections/min only. (browsers will use pipelining anyway and make only one/few connections)
* Anything about that hashlimit becomes hard dropped.
* Configure the Webserver to drop a connection when the reply code is >=400
* make a hidden page with thousands of nonsense links that end in 404
* link that hidden page with hidden links from almost any other page.
* put that hidden page (and some non existing links) into robots.txt as 'Disallow'

🍿🎉

cehteh

@cehteh@social.tchncs.de · 2 months ago

Yay, I implemented a way to keep #AI #Crawlers from hammering our punny community Server w/o relying on javascript/anubis yet:

🍿🎉

d@nny disc@ mc² boosted

Clemens

@neverpanic@chaos.social · 3 months ago

I've had it with the aggressive #AI #crawlers now. Some bot has been hitting #MacPorts with a legitimate enough user agent that I can't block it without also blocking users.

Yesterday, it sent 377k requests (62 % of the total), 369k to URLs forbidden in robots.txt from 274k unique IPs. Most of it for content that could be analyzed quicker using svn checkout or git clone.

Dynamic content on the #web is broken. There's just no way to do that anymore. What a waste of energy.

Clemens

@neverpanic@chaos.social · 3 months ago

I've had it with the aggressive #AI #crawlers now. Some bot has been hitting #MacPorts with a legitimate enough user agent that I can't block it without also blocking users.

Dynamic content on the #web is broken. There's just no way to do that anymore. What a waste of energy.

alcinnz and 1 other boosted

Stefano Marinelli

@stefano@mastodon.bsd.cafe · 3 months ago

If your crawler is spoofing its identity because I have disallowed it from my sites, I consider that an attempted server breach. You are no better than someone trying to compromise my SSH connection.

#IT #SysAdmin #Perplexity #Crawlers

Stefano Marinelli

@stefano@mastodon.bsd.cafe · 3 months ago

If your crawler is spoofing its identity because I have disallowed it from my sites, I consider that an attempted server breach. You are no better than someone trying to compromise my SSH connection.

#IT #SysAdmin #Perplexity #Crawlers

Alfonso Siciliano

@alfonsosiciliano@mastodon.bsd.cafe · 5 months ago

🖥️ My ultra-budget server powering http://websysctl.alfonsosiciliano.net has been running smoothly for the past 2 months. So far, so good!

📈 #Crawlers hit tens of thousands of sysctl parameter pages daily. That's fine, since robots.txt allows it. But why keep requesting non-existent pages as if the site were built with WordPress 😤 ? Fortunately, the stack (#FreeBSD + #OpenResty 🌐 + #Lapis ✏️ + a custom-built #database 📦 ) stays well within the limited resources of my $5/month cloud server.

The code might soon be #OpenSource stay tuned!

#UNIX #sysctl #WebDev #WebServer #ThePowerToServe #coding #Lua #kernel

Screenshot of the "Tree MIB" page from the WebSysctl site. The left panel shows an expandable tree view of the FreeBSD sysctl MIB hierarchy, with nodes like sysctl, kern, vm, sys, security, and their subcategories. The security.mac.mmap_revocation_via_cow node is selected. The right panel displays detailed information about this sysctl parameter, including its link, OID, name, description ("Revoke mmap access to files via copy-on-write semantics, or by removing all write access"), type (integer), format (I), flags (RD, WR, MPSAFE), label, and handler status (Defined). The top navigation bar includes links: Home, Docs, Table, Tree (highlighted), Update, Login, and Contacts.