LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI (Including Many Fediverse Instances!!!)

"The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal."

ARTICLE:https://www.dropsitenews.com/p/meta-facebook-tech-copyright-privacy-whistleblower

FULL PDF:https://www.dropsitenews.com/api/v1/file/b3555944-e204-4f5e-9a64-e44281b19a82.pdf

#FediPact #meta #threads #AI

spla

@spla@mastodont.cat replied · 3 weeks ago

@FediPact I did apply this nginx config to fight against it and many other IA bots and scrappers:

https://github.com/kurren/ai-bots-crawlers

returning 444 to them seems a good way to confuse them and decrease server load.

ophiocephalic 🐍

@ophiocephalic@kolektiva.social replied · 3 weeks ago

@FediPact
> Rather than scraping from sites directly, many of the addresses on Meta’s leaked list belong to Content Delivery Networks (CDNs) that are used by websites to cache and store information to improve site performance.

This is a critical point. An instance or website can defend itself in numerous different ways, including actively adversarial strategies, and still succumb to extraction - if they're using Cloudflare

cc: @subMedia

My camera shoots fascists

@Mikal@sfba.social replied · 3 weeks ago

@ophiocephalic @FediPact @subMedia

OK, wait, Cloudflare has a setting you can activate where they will block known scrapers. I have it turned on for my personal sites. How do those two things square?

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.0-rc.2.11 no JS en

Automatic federation enabled