Post · bonfire.cafe

Re. Meta scraping copyrighted content on millions of websites, including fedi instances, to train it's AI:

https://www.dropsitenews.com/p/meta-facebook-tech-copyright-privacy-whistleblower

While mastodon.art blocks a bunch of crawlers and domains (including Meta) both at an IP level and in our robots.txt , and our instance domain isn't in the above list, unfortunately cdn.masto.host IS in the list. As we're hosted with masto.host, this is where all media uploaded to our instance is stored, and thus has been part of Meta's scraping :(

Per Helge Berrefjord

@berrefjord@mastodon.art replied · 4 days ago

@Curator Does anyone know for what purpose this enormous task by Meta? «Train Its AI» has no meaning per se – to me. Seems it's mainly photos being scraped; and then what?

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 3 days ago

@berrefjord So that its AI will produce 'better' results when used; as the linked article says, 'AI models require a tremendous amount of data for their training data to work effectively.' Meta's AI produces text and images; 'Use Meta AI assistant to get things done, create AI-generated images for free, and get answers to any of your questions.'

WhoDisturbsMySlumber

@WhoDisturbsMySlumber@mastodon.social replied · 5 days ago

@Curator so is mast social as well being scrapped? I'll use webglaze just wasn't sure

Wulfy

@n_dimension@infosec.exchange replied · 5 days ago

@Curator

INFORMATION WANTS TO BE FREE!

NOT LIKE THAT !!!!

#hackers vs #AI

Untilted & More :mastoart:

@untilted@mastodon.art replied · 5 days ago

@Curator Hello! Ever since I heard about AI scrapers I have been using https://artshield.io but then I've also heard people recommend Glaze and Nightshade (I didn't try either due to them being resource-intensive on my old laptop), so... to what extent would those images watermarked with ArtShield be impacted?

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

@untilted their site isn't loading for me but if they work by 'conconfusing' the AI then it'll do the same here; if you used that on your images before uploading them then those changes to your image will still apply when being scraped by AI (so it'll do whatever the site says it does)

Untilted & More :mastoart:

@untilted@mastodon.art replied · 5 days ago

@Curator Sorry I meant to type IO not CO 😳 I typed in the wrong TLD because I am still groggy

Lagz | Comms open!

@lagz@mastodon.social replied · 5 days ago

@Curator what do I expect, when they what? Pirated 8 TB of books without all these author's consent. They can, laws are 20 years behind, they will do it.

I hear their dataset is even crumbling, training on it's own data slop, because they ALWAYS need NEW human data to train on like a parasite.

All for the sake to comply to shareholders, rich ppl etc.

ona [she/they]

@ona@systerserver.town replied · 5 days ago

@Curator maybe you could protect your instance with https://github.com/TecharoHQ/anubis

Cmdr Jenny

@jenny753@indiepocalypse.social replied · 5 days ago

@Curator Oh gosh. We're listed here too. I have no idea how to fight something like that. I don't control the CDN.

Highlighted among the list
indiepocalypse.files.fedi.monster — Highlighted among the list indiepocalypse.files.fedi.monster

The Gibson 🅅

@thegibson@masto.hackers.town replied · 5 days ago

@Curator That is the real story... and now after just making the move, I have to consider going to self-hosting again.

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

@thegibson I can't even fathom doing that with how big our instance and database and storage is since we're a media instance. Yet more argument in favour of having loads of tiny instances instead of mega-instances (which needs the software and hosting to be more accessible first)

The Gibson 🅅

@thegibson@masto.hackers.town replied · 5 days ago

@Curator Total agreement.

loude 💫

@loude@mastodon.art replied · 5 days ago

@Curator It's really frustrating how powerless we all are against these businesses. Aside from just not uploading anything anymore. ☹️

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

@loude yup :(

Russ Sharek

@RussSharek@mastodon.art replied · 5 days ago

@Curator

I appreciate the transparency.

Also *redacted* those *censored* Meta bots.

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

There's not really much we can do about this. Masto.host doesn't block anything (it leaves blocks to the discretion of customers), but even public content from our instance that federates to other instances would have been hit if those instances got crawled.

I think I've seen people mention building a lawsuit against them but can't find any info on this right now; I'll update if I do. Meta did just win a lawsuit about doing the same thing with training its AI on millions of books though :/

Davey

@davey_cakes@mastodon.ie replied · 2 days ago

@Curator I wonder if we got a few folks together from the Mastohost customer base could we make a case for anti-scraping moves?

Moonglow Constellation

@moonglow@mastodon.art replied · 5 days ago

@Curator I wish masto.host did anything at all or even informed their users that they won't so people know what they're getting into

- 💙

tibi

@tibi@mastodon.art replied · 5 days ago

@Curator great, good to know I have to click on captchas all day like a moron and meanwhile these assholes can just shovel up everything into their mouths. bots not allowed unless they're part of a corporate botnet

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

files.fedi.monster is on the list too, so if you host with fedi.monster, even if your instance domain isn't on the leaked list, I *think* all of your instance's public media might have been included in the scrape :(

betalars :antifa:

@betalars@chaos.social replied · 5 days ago

@Curator What about followers-only images?

Mastodon•ART 🎨 Curator

@Curator@mastodon.art replied · 5 days ago

@betalars I don't *think* they'll be scraped,l because you need to be logged in to see them, only public ones

eishiya

@eishiya@mastodon.art replied · 5 days ago

@Curator followers only and private mention media can be accessed directly without auth if they have or guess the media URL, unfortunately :/ Auth is required to request the posts from the server, but it'll happily serve up the media. Seems like a design flaw in the Mastodon server software.