Post · bonfire.cafe

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

Why are AI people so monumentally *bad* at copyright?

I'm looking for ethical/copyright-safe training data sets. Common Corpus sells itself as that... but then I go read the paper and they include CC BY-SA scientific papers and GPL stuff from GitHub, and then in models trained on that dataset they proudly state:

> Only trained on open data under a permissible [sic?] license [...] By design, all Pleias model are unable to output copyrighted content.

Um, no?? CC BY-SA is not public domain, it's a copyright license. You can't train on CC BY-SA content and then claim your model is any more copyright-safe than whatever Google and Meta are releasing. It just means you're violating the copyright of people releasing content under open licenses only.

Jimmy

@sjcooke66@mastodon.social · 19 minutes ago

@lina Because AI 'learns' by scanning the internet - which includes links to copyright works - and is imbued with no way to make reparation: because AI can only learn by scanning the internet and has no moral guide leading it to 'fair recompense' etc...

Мя :sparkles_lesbian: ��

@mo@mastodon.ml · 7 hours ago

@lina it's considered safe if copyright holders don't have money to sue you

Gustavo

@qgustavor@urusai.social · 8 hours ago

@lina It is great to see people talking about this. Last time I posted something related people were eager to dismiss the issue.

Tim Ward ⭐🇪🇺🔶 #FBPE

@TimWardCam@c.im · 9 hours ago

@lina Let me guess ...

They were too tight to pay a copyright lawyer for a professional opinion so ...

... they asked ChatGPT whether it would be OK.

Marta Threadbare

@cygnathreadbare@retro.pizza · 9 hours ago

@lina yeah same with image data sets. The most supposedly ethical ones just scrapped anything creative commons from flickr, not caring if it's cc0 or not. And even then models trained with that aren't public, just closed betas (so the results must be pretty shit anyways).

Speed demon 🇪🇺 🇳🇴🇺🇦

@hakona@im.alstadheim.no · 10 hours ago

@lina
We only eat nice/under-funded people. This makes us super-nice.
- signed, some cannibal.

ShutterBugged

@developing_agent@mastodon.social · 12 hours ago

@lina Before the LLM Madness took hold in 2022, at least in the imaging sphere it was always a little questionable where datasets came from and nobody looked to hard because it was only ever used for training/benchmarking your thing and publishing a paper on it. (though there was at least on dataset which merely _linked_ to images elsewhere to avoid redistributing them >_>)

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 13 hours ago

So far the only one I've seen that credibly can claim to be free of copyright concerns is KL3M (it was created... by lawyers).

Are there any others?

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 12 hours ago

Replies: "This model supposedly is safe"

*Looks inside*: Huge pile of web scraped data without consent and ignoring licenses.

Seriously, this is a tragedy. LLMs are useful. You might not be able to train ChatGPT without stealing, but you can totally train a useful base model for task specific fine tuning without stealing. Why is almost nobody even trying????

Dev Albino :elixir: :python:

@v_raton@bolha.us · 7 hours ago

@lina it's bad but this is real, copyright is not a unique viable way to protect your work.

After all, any license is only as powerful as it is capable of being enforced by legal means.

Now the only diference is we have a powerfull way to do it in large scale and make it difficult, even. impossible to prove tat be ethical.

Sadly copyright era is over

slotos

@slotos@toot.community · 9 hours ago

@lina There’s no money in that.

AI is a wealth extraction tool. Any other utility is inconsequential to that goal.

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 9 hours ago

@slotos No it's not. AI/ML is a vast field that we all rely on every day, and within that LLMs can be used to do useful things in an ethical and sustainable way.

OpenAI and friends would like you to believe their product is the only way to use AI, because they are interested in extracting wealth from everyone. Just because they're saturating the world with their terrible and unethical version of AI doesn't mean that's the only version possible or that exists.

1 more replies

Jimmy Jim

@starchturrets@mastodon.social · 11 hours ago

@lina even with models where the datasets are open they train on synthetic data from proprietary models which I'd argue launders all the problems with them...

Tim Ward ⭐🇪🇺🔶 #FBPE

@TimWardCam@c.im · 11 hours ago

@lina They are. I worked somewhere that had a model trained on the (large, complex) system's documentation. Often it knew the answer (with a link to the source document), sometimes it didn't know the answer in which case it would say so (with links to possibly relevant source documents).

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 10 hours ago

@TimWardCam I mean public ones.

Synnef

@synnef@woof.tech · 11 hours ago

@lina the psycho billionaires can't sell the idea of "ai will change the world and replace workers and bring infinite wisdom" idea to investors like that

it's sad, the unethical ai industry needs to collapse so they get a reality check

Thomas Depierre

@Di4na@hachyderm.io · 12 hours ago

@lina because noone gives you money for it and it is not actually that useful. The reason the LLM like ChatGPT "works" is not because they are useful. It is because they feel like Her to the humans. This is partially due to their training but also how they are presented and what they do. Generic chatting.

If you make them specific, then you lose the AGI look and feel. And that destroy the main value... Which is the look and feel of massive growth.

We will probably do it, but in a few years. Once the mania calm down.

Harry Wood

@harry_wood@en.osm.town · 9 hours ago

@Di4na @lina Yeah I think that's right. It's the generic chatting magic trick that has everyone so beguiled. So...
"you can totally train a useful base model for task specific fine tuning without stealing"
I'm not so sure. There's quite a lot of content out there labelled public domain/copyright free, but maybe not *enough* to do the beguiling magic trick.
But I agree it would be nice if more people were trying/discussing this, if only to underline the copyright theft of the main players.

1 more replies

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 11 hours ago

@Di4na Not everything is about massive growth. I know that's what companies run by psychopathic billionaires seek, but that's not the entirety of the AI/ML field. People and entities with different goals exist. They just... don't seem to care about this? Like nobody's even trying...

1 more replies

vineyardsiren

@vineyardsiren@mastodon.social · 12 hours ago

@lina i dont think the developers care for credibility and at least right now it does not sell well to have a "worse" model but good morals.. Which saddens me

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 11 hours ago

@vineyardsiren It's not a worse model. You can do better than ChatGPT at specific tasks ethically and with a much smaller model.

2 more replies

pranabekka

@pranabekka@mastodon.social · 9 hours ago

@lina wasn't there a case where authors were suing some company and the court ruled that using the work was fine as long as it was obtained through legal channels? doesn't that set precedent for taking works that are otherwise available to people for free? such as cc-by-nc-nd and gpl

https://apnews.com/article/anthropic-ai-fair-use-copyright-pirated-libraries-1e5cece51c2e4bd0bb21d94de2abb035

AP News

Anthropic wins ruling on AI training in copyright lawsuit but must face trial on pirated books

In a test case for the artificial intelligence industry, a federal judge has ruled that AI company Anthropic didn’t break the law by training its chatbot Claude on millions of copyrighted books.

Kensan

@Kensan@mastodon.social · 12 hours ago

@lina Supposedly Apertus by ETH: https://huggingface.co/swiss-ai/Apertus-70B-2509#training

swiss-ai/Apertus-70B-2509 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 12 hours ago

@Kensan Haha no that one is web scraped and only supposedly respects "opt out". Copyright doesn't let you take whatever you want unless the person you took it from asks you to stop, that's not how it works.

E.g. my website doesn't have AI UAs blocked in robots.txt therefore they consider themselves free to use my data. I did not consent to that. Lack of opt out is not consent.

This is even more problematic with UGC where users can't even opt out if they don't control the robots.txt file. Honestly that entire approach is a farce, I have no idea how anyone takes the ethics of this model seriously.

degenerating degenerate

@hopeless@mas.to · 7 hours ago

@lina @Kensan

"Take" is doing a lot of work here... with LLMs as coding assists, they don't wholesale regurgitate their training content in my experience (with Gemini / Antigravity).

If the content is publicly available on the web, it's fair to imagine the copyright holder is okay with people reading / discussing / remembering / learning from it... readers made a temporary local copy so they could internalize it.

Why's it so out of the question to frame LLM use of it like that?

1+ more replies

Janne Moren

@jannem@fosstodon.org · 13 hours ago

@lina
A couple of recent models have been trained on out of copyright works. As you can imagine, their world "knowledge" is a bit out of date.