@eloquence@social.coopNo compression algorithm can decompress your file into a poem about the complicated romantic relationship between a nematode and a rosemary bush.
Why not?
If the original file contained poems about complicated romantic relationships, and texts about rosemary bushes and about nematodes, a section of the compressed high dimensional matrix of frequenced can be extracted to maximize its statistical correlation with the prompt vectors.
That's exactly what happens in any
#LLM: given a lossy compression of a high dimensional frequency matrix, the software traverse and decompress unrelated fragments of the source texts according to their statistical correlation with the prompt (and the previously generated tokens).
Beyond "lossiness", it's the user-provided context and the prediction based on that context and their own generated text that makes these models functionally useful.
Plausible, not useful.
You can use them if all you need is to fool people about their interlocutor, for example to spread disinformation or win the imitation game, but that's all.
Whenever you need something correct, not just plausible to an uninformed human, they stop being useful. In fact a recent
#OpenAI study recognise
an error rate > 90% for its LLM on simple verifiable answer to questions.
As always, OpenAI is trying to set up a benchmark it can easily cheat, but the numbers are clear: even on basic tasks,
#GenAI is totally unreliable.
Predictive text generation algorithms are not a "weapon of oppression".
The technologies in itself (the algorithms described in papers and textbooks) are not a "weapon of oppression", but all of real world models are, no matter how you use them.
So if you build your LLM from scratch properly collecting and selecting all of the source texts, you
might get a model that is not harmful to people.
But if you hope to just use models from huggingface "for greated good", you are fooling yourself.