Since the topic came up in the whole "I run models locally so it's all fine" conversation:
"Open Source AI" does not meaningfully exist. It's just openwashing proprietary shit
https://tante.cc/2024/10/16/does-open-source-ai-really-exist/
Discussion
Since the topic came up in the whole "I run models locally so it's all fine" conversation:
"Open Source AI" does not meaningfully exist. It's just openwashing proprietary shit
https://tante.cc/2024/10/16/does-open-source-ai-really-exist/
@tante The "Open Source AI Definition" was the final straw for me to recognize OSI for what they are (at least have become): a capitalist apologist shill group.
@tante hey @pallenberg dieser Post von Tante hat mich sehr an deiner Episode mit Don im Techlounge Podcast erinnert.
Gibt es oder gibt es keine Open Source AI?
@SebasFC nach dieser Definition: nein!
Davon abgesehen, dass massenhaft Trainingsdaten der Modelle mit alles andere als freien Lizenzen versehen waren.
Ob man es eher Public Domain, was ja vor der Freeware Bewegung, vor allen Dingen in den 80er recht beliebt war, nennen mag.. keine Ahnung!
Tatsache ist aber auch, dass es nen Unterschied zwischen der chin. Strategie & der von OpenAI, Anthropic, Google & Co gibt.
Die ballern ihre aktuellen Foundation Modelle zum DL raus. Die US-Anbieter nicht!
@pallenberg "...too big to collect with care..."
https://dair-community.social/@emilymbender/116109627131276897
@pallenberg "...too big to collect with care..."
https://dair-community.social/@emilymbender/116109627131276897
@tante thanks for writing this. Actually, just reading your caption ("'Open Source AI' does not meaningfully exist. It's just openwashing proprietary shit") was a clarifying moment because I was wondering how practical a truly Open Source LLM (all the "sources", including the entire training data, bundled together into one big repo) would be
And then I realised: it's not about being "practical". The definition's job is to set the standard, and reaching that or not is the implementer's problem
@tante also, if people did want to make LLMs (or other models) up to those standards, they would—by creating or relicencing datasets, etc. It would be a humongous effort, of course, but nobody claimed earning your own things was *less* effort than stealing somebody else's
Also, this means we don't really *need* a separate definition specifically for LLMs. We can just use the same standards we've always used: full sources, including code and training data and everything 📦
@badrihippo exactly. Having a specific other definition for "AI" only serves the goal of watering down standards
@tante Curious, what do you think of apertus: https://www.swiss-ai.org/apertus ? The Swiss seem to be making a meaningful attempt? "Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins." (I haven't used it myself.)
@tante @colincornaby Have you looked at https://allenai.org/olmo? For most of the "open weight" models, I'd completely agree - but the Olmo3 work in particular exposes all of the training data as well, which I read as one of the core arguments in that piece. They not only share and show their data, they discuss - in quite some detail - their training processes, including experiments on the pros and cons for techniques on relatively weaker models.
If you haven't seen it, it's very worth looking.
@tante correct. burn the planet down from every desktop, now get to it
@tante There’s Mistral. They have models that have open training data. 🤔
@tante when ai is open source they just mean its proprietry and not on the cloud
The thing that is giving me the greatest joy this morning when I woke up was watching Chris Noland and his wife discuss how people are openly rejecting data centers, and they show a short clip of people in the streets, breaking out in cries of joy when one person announced that the data center project had been rejected from their city
cont'd
"Even with open-source AI, you still need huge amounts of data, labor, and infrastructure. They don’t challenge the concentration that includes distribution networks, economies of scale, entrenched reach, the ability to define the tooling and the standards, and so on. Claiming they do these things confuses and distracts us from the type of solutions we need."
/2 /end
Meredith Whittaker explains *why* open source ai is simply a masquerade:
https://ainowinstitute.org/publications/open-source
The key novelty of the current AI moment is the presence of concentrated amounts of data that had not been available before, and powerful distributed computational systems to process that data to train and perform inference on AI models.
1/
@tante olmoe by Allen.ai and some Firefox things
(I know there are niche attempts that work even worse than all the other models)
@tante Ah, so you're counting https://www.swiss-ai.org/apertus in "nice attempt, but ..." ?
@tante Fair enough.
I can see from their white paper that whilst they're being really very transparent... any duplication would still need to do all the scraping and cleaning of data themselves.
But, *given* that proviso, it does seem like one of, if not the, most open attempt. It's a hard problem, but they've certainly tried to be very ethical about it.
@tante Open Source LLMs do not exist. I refuse to limit the definition of "AI" to only GenAI LLM Nonsense. And on the ML side you have a lot of OSS