Post by @mttaggart@infosec.exchange

@mttaggart@infosec.exchange · 2 hours ago

An incredibly potent visualization of what's actually happening when a generative model produces text. I don't think you can look at this and see the result as anything but iterative plagiarism.

https://tuhinjubcse.github.io/granta-ngram-cartography.html?v=2

The Grove Remembers — n-gram cartography

Mx. Aria Stewart

@aredridel@kolektiva.social · 2 hours ago

@mttaggart I want to know what my own writing looks like then for comparison. How different is it what human authors do when they're trying to convey something?

Taggart :ifin:

@mttaggart@infosec.exchange · 2 hours ago

@aredridel I think it's going to depend on the n-gram, but I have reason to believe that the output of a good writer will have less overlap with the corpus than this.

Mx. Aria Stewart

@aredridel@kolektiva.social · 2 hours ago

@mttaggart What about mediocre ones?

Taggart :ifin:

@mttaggart@infosec.exchange · 2 hours ago

@aredridel Again, it depends. Your unique tokens will vary based on content and style.

But also, I don't really think that's the point here. Maybe you think that a human being who is writing—not plagiarizing, mind you—is not doing something different than a model iteratively predicting the next token from a corpus and model. I think they are different.

What I see here is a vast array of authored works that have been stolen, disassembled, and used as an ingredient in a word slurry without the consent or compensation of the creators.

~7 more replies

Mx. Aria Stewart

@aredridel@kolektiva.social · 2 hours ago

@mttaggart Yeah, that's the thing. I think that's awful motivated reasoning.

My actual process is that I've read a lot of books and once in a while I catch myself imitating them; and I find myself trying to steer out of cliches constantly.

In fact, an awful lot of fiction writers find themselves struggling with cliches constantly.

It's an awful similar problem.

Mx. Aria Stewart

@aredridel@kolektiva.social · 1 hour ago

@mttaggart And maybe it's okay to have different standards of what's acceptable! But I don't think it actually is all that different in what's going on.

But then again maybe the most obvious problem there with the generated text is that it's _bad_.

Taggart :ifin:

@mttaggart@infosec.exchange · 1 hour ago

@aredridel If you rip off the voice of a well-known author and publish, you will be roundly criticized for doing so. If you really steal from them, you'll probably get a letter from an attorney. There's opprobrium and social cost for even accidental, unexamined plagiarism. And legal cost for the real McCoy. At least, there used to be.

Here, there is no cost to theft. Nobody cares that the result sucks. The authors, as lawsuits suggest, really do care that the result comes from their work without credit or compensation.

Mx. Aria Stewart

@aredridel@kolektiva.social · 1 hour ago

@mttaggart What degree constitutes plagiarism? What actually _is_ the difference? What is aping someone's voice vs plagiarizing their work? Some of this is settled law, but there's plenty of murky out there.

I don't mean this as a defense of LLM writing, but every time someone goes for a take, I find it doesn't hold up to scrutiny particularly well. And, the interesting thing about the legal parts: transformative use _is_ allowed, and it's been called that and I can see the argument for it. Anthropic got in trouble for _copying books to train with_, but did not get in trouble _for training with books_.

Taggart :ifin:

@mttaggart@infosec.exchange · 1 hour ago

@aredridel You're right that it's subjective, but I don't actually think that matters nearly as much as you seem to think.

When the model does this, the reason is it's literally pulling from training data that was acquired without consent or compensation of the authors. I know you know this; I'm just saying it to ground the conversation. I also know what the one Anthropic verdict said. As you note, the rest is unsettled and I remain of the opinion that a massive-scale theft occurred.

There is no author that is so naively pulling snippets of text they remember and putting them together like a ransom note. If they stray too far into imitation, there's a claim of plagiarism. If they lift whole ideas, same thing. And if they actually rip excerpts of text, well, I've failed students for varying degrees of that behavior depending on context.