@aredridel Again, it depends. Your unique tokens will vary based on content and style.
But also, I don't really think that's the point here. Maybe you think that a human being who is writing—not plagiarizing, mind you—is not doing something different than a model iteratively predicting the next token from a corpus and model. I think they are different.
What I see here is a vast array of authored works that have been stolen, disassembled, and used as an ingredient in a word slurry without the consent or compensation of the creators.