Post · bonfire.cafe

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

Typical ML argument: "If I can read something legally, why can't I train an LLM on it?"

Humans are capable of reading things and later writing a similar thing that is still a copyright violation. If I go and write a book that follows the plot line of Star Wars, that's still a copyright violation, even if no text is literally the same. If I play the melody to a song on my piano and release it without the appropriate mechanical cover license, that's also a copyright violation.

The reason this does not happen often is that, as humans, we are aware that that's plagiarism and there are rules. Sometimes it happens by accident, and people still get sued and lose.

LLMs have no such awareness and routinely output things which are blatant copyright violations when appropriately prompted. That means the model weights encode that work, and therefore, are themselves a derivative work.

Your brain encodes a massive amount of copyrighted information. You are not a walking copyright violation because humans aren't data, can't be copied and distributed en masse, have human rights, etc. This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc).

An LLM is not a human, does not have human rights, nor human privileges. It is data, and if it encodes copyrighted information, that's a derivative work. If you aren't following the license of the training data, that's a copyright violation.

Demi Marie Obenour

@alwayscurious@infosec.exchange · 14 hours ago

@lina US and EU law seems to be pointing in the direction of the model providers being potentially liable, but model users not being unless they do something stupid (like prompting the model to get those violations out).

Paul C.

@paul@m.pcgt.link · 18 hours ago

@lina If I retell Star Wars with different character names, not as a publication but just around a campfire with my family, that still isn't illegal, right? Lucasfilm/Disney only get to send in the legal team to black-bag me if I try to, e.g., publish a wax cylinder of my campfire stories (as sincere non-parody). Just saying a plot aloud isn't a problem, nor is sketching Mickey Mouse on a napkin, right?

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 16 hours ago

@paul The story is still copyrighted, but telling it to your family wouldn't count as a "public performance" so wouldn't infringe copyright. Telling it to a crowd at a park probably would, though.

Copyright of characters is complicated and varies by jurisdiction. That said, Mickey Mouse is in the public domain now, so your sketch is totally fine as long as you aren't trying to sell it or pass it off as legitimate Disney merchandise (because Mickey is still trademarked).

(Disclaimer: IANAL, this is just my understanding.)

Paul C.

@paul@m.pcgt.link · 16 hours ago

@lina On a related note, does camp fall under parody law if it's not intentional? Like, this image is probably covered because "Bugs Bunny + Spiderman" but I didn't specifically prompt for that. Diffusion models are just bad at multi-subject stuff.

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 16 hours ago

@paul I have no idea tbh... ^^;;

Parody rights are also not universal, it's a very jurisdiction specific thing.

1 more replies

Leeloo

@leeloo@c.im · 21 hours ago

@lina
Unfortunately the lawmakers and courts have been bought by the people who own these LLMs.

Ozzelot :anarchy: :linux:

@ozzelot@mstdn.social · 21 hours ago

@lina
(Let's just hope this doesn't lead to giving the LLMs human rights)

degenerating degenerate

@hopeless@mas.to · 22 hours ago

@lina

> Yes, this means that anyone downloading "open" models potentially puts themselves in as much legal risk as torrenting a movie does.

Huggingface still seems to be operating?

Also there is a big difference between bidirectionally torrenting somethng and "downloading" it... they are not interchangeable terms; normal humans are not getting into legal peril for downloading / streaming movies or music things (because it is not their "performance" of it but whoever is sending it to them).

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 19 hours ago

@hopeless

> Huggingface still seems to be operating?

People get away with torrenting movies all the time too, doesn't mean it's legal.

Fair point on torrenting vs downloading though (because seeding), edited. But yes people do get in trouble for just downloading too. Depends on the country.

yuki - queen of the snow

@yukijoou@fedi.kemonomimi.gay · 22 hours ago

@lina also the whole argument falls appart when companies don't even pay for a single copy of the work they're using to train on-

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 19 hours ago

@yukijoou Yup.

Florian

@florian@owo.ff15.eu · 23 hours ago

@lina just considering the LLM itself as a derivative work, wouldn't it be legal to train one on CC BY-SA or GPL text as long as the weights are released under the same license?

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 23 hours ago

@florian It would, if and only if:

1) You train only on compatible license content
2) You meet all the attribution requirements of those licenses (this is a big one)
3) Your weights are licensed under a compatible license themselves (usually the same for most copyleft licenses)
4) You understand that the output of the model may be copyrighted and require the same licensing and attribution (making the model unsuitable for, say, creative generation for publishing, but it would still be fine for a local voice assistant or something like that) and make your users aware of this.

Unfortunately, I haven't been able to find any models that meet any of those conditions, let alone all of them (other than KL3M which credibly claims to be trained on PD and non copyrightable content only).

Florian

@florian@owo.ff15.eu · 23 hours ago

@lina actually attribution might still be a major hurdle, you can totally supply all the attribution strings for your training corpus with the model, but is it really attribution if you can't point to where in the model a specific work is encoded (which you can't obviously)?

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 23 hours ago

@florian I don't think there's any need to care about where a work is encoded. You do need a list of all authors though.

I think there's some flexibility in that you might not need to literally list every individual, though IANAL. For example, if you train on Wikipedia (only), you could plausibly get away with specifying the exact database dump you used (where the edit history data is available) without having to extract and collate the author list yourself. But if you're scraping something like GitHub that does not make explicit dumps available, yeah, you'd better at *least* list every project and commit ID you scraped (and even that might not be enough).

1+ more replies

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 23 hours ago

Yes, this means that anyone downloading "open" models potentially puts themselves in as much legal risk as downloading a movie does.

Just *using* a cloud based LLM may or may not be safe depending on how a bunch of legal unknowns go, if the output happens to not qualify as a copyright violation. Of course, since you have no idea whether it will, you're basically playing copyright Russian roulette.

Demi Marie Obenour

@alwayscurious@infosec.exchange · 14 hours ago

@lina I don’t know if I will redistribute the vibecoded test suite I wrote. The only LLM-generated code I have published were trivial bugfixes that were clearly based on my own code.

gameshack_ :verified:

@gameshack_@infosec.exchange · 22 hours ago

@lina also it’s only legal if you’re a mega corporation or rich. /s

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 23 hours ago

BTW, there is almost certainly a model size and architecture threshold here. If I train an image upscaler on copyrighted data, for example, it's quite improbable that it will be able to generate copyrighted data from an unrelated input (for typical upscaler designs and sizes). Most likely that is safe.

Is there a size threshold for LLMs where, for typical training data distributions, the LLM is highly unlikely to "memorize" any copyrightable information? Is that size threshold a usable LLM? These are open questions (that nobody seems to be interested in researching).

I'm willing to bet if such a size threshold exists, it's much smaller than the minimum LLM size useful for "general purpose" prompting though.

Mauser II

@wakame@tech.lgbt · 23 hours ago

@lina

"You stand accused of downloading and then redistributing the BritDonnaAguilera model, which is able to generate music extremely close to that of popular pop singers."

Hoshino Lina (星乃リナ) 🩵 3D Yuri Wedding 2026!!!

@lina@vt.social · 23 hours ago

@wakame I'm waiting for people using Suno or whatever to get sued for copying melodies. There has to be some juicy stuff in there...