Discussion
Loading...

Discussion

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Robert W. Gehl
Robert W. Gehl
@rwg@aoir.social  ·  activity timestamp 2 weeks ago

Every day, I check my web site logs. It's very clear human visitors to my various websites and blogs are rare. Clearly the sites are being scraped, constantly. This is in spite of the fact that they're all static sites and I write posts maybe twice a month at the most.

Must I put it all behind Cloudflare? Or must I introduce "proof of work" style friction? Or do I just give up and let my work be part of the planet-burning machine?

Suggestions welcome.

#askFedi #webAdministration

  • Copy link
  • Flag this post
  • Block
Andreas Wagner
Andreas Wagner
@anwagnerdreas@hcommons.social replied  ·  activity timestamp 2 weeks ago

@rwg I had blocks in the reverse proxy for hard coded user agent strings and IP ranges first. This soon did not help anymore. Presently I have deployed an anubis proof-of-work container between reverse proxy and backend service and it was really easy (I am no using any special config). I now have a tenth of the scraping requests in comparison with before, i.e. two per minute make it through to the backend whereas before it was about 20 (really very rough estimate). If scraping traffic increases again, I'll try and see how easy it is to set up iocaine.

But this all is for a backend that is not very robust and tolerant of many parallel requests. It sounds like your system, by contrast, can take a fair bit of scraping traffic. Then, I would either just not do anything or consider iocaine if I felt I just wanted to push back a bit as a matter of principle.

  • Copy link
  • Flag this comment
  • Block
Andreas Wagner
Andreas Wagner
@anwagnerdreas@hcommons.social replied  ·  activity timestamp 2 weeks ago

@rwg if you feel like that might be your crowd and platform, there is a nice and busy #bots room on the #Code4lib slack.

  • Copy link
  • Flag this comment
  • Block
errorquark
errorquark
@errorquark@ioc.exchange replied  ·  activity timestamp 2 weeks ago

@rwg I’d like to hear good solutions too. Scraping has increased significantly this year imo. Right now I’m using some of the iplists from firehol: http://iplists.firehol.org I feed these lists to ipset. I still have to analyze a botnet myself occasionally. I use firehol_level2 and firehol_webserver.

  • Copy link
  • Flag this comment
  • Block
4censord :nfp:
4censord :nfp:
@4censord@unfug.social replied  ·  activity timestamp 2 weeks ago

@rwg if you are ok with deploying something extra: iocaine https://iocaine.madhouse-project.org/

or, you can modify your webserver config, to in order

  • block a bunch of useragents of scrapers (https://github.com/ai-robots-txt/ai.robots.txt)
  • and block things pretending to be browsers

that should cut down on the scrapers significantly already (check https://come-from.mad-scientist.club/@algernon/statuses/01K70E97HC46HSBZDA6X5KS6H4)

GitHub

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.
  • Copy link
  • Flag this comment
  • Block
Alex
Alex
@depereo@mastodon.social replied  ·  activity timestamp 2 weeks ago

@rwg iocaine?

https://floss.social/@hywan/115705538559530358

  • Copy link
  • Flag this comment
  • Block
cuan_knaggs
cuan_knaggs
@mensrea@freeradical.zone replied  ·  activity timestamp 2 weeks ago

@rwg @uastronomer was talking about some simple steps that seemed to be working quite well the other day

  • Copy link
  • Flag this comment
  • Block
Petra van Cronenburg
Petra van Cronenburg
@NatureMC@mastodon.online replied  ·  activity timestamp 2 weeks ago

@rwg At the moment, I let the crap run because every so-called protection is too complicated and soon hacked, too, by the scrapers. Meanwhile they scrape even podcasts for voices.

Sabotage with idiotic texts doesn't help either; I'd have to do it on a grand scale with AI.
I prefer to focus on real people online: when someone subscribes to my newsletter, I can see if it's a bot.

  • Copy link
  • Flag this comment
  • Block
Robert W. Gehl
Robert W. Gehl
@rwg@aoir.social replied  ·  activity timestamp 2 weeks ago

@NatureMC I think I'm with you. Why add layers of complexity that will only contribute to an arms race (and more energy wastage)?

  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.1-alpha.40 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct