Post · bonfire.cafe

My 4-month-old kid is not DDoSing Wikipedia right now, nor will they ever do so before learning to speak, read, or write. Their entire "training corpus" will not top even 100 million "tokens" before they can speak & understand language, and do so with real intentionally.

Just to emphasize that point: 100 words-per-minute times 60 minutes-per-hour times 12 hours-per-day times 365 days-per-year times 4 years is a mere 105,120,000 words. That's a ludicrously high estimate of words-per-minute and hours-per-day, and 4 years old (the age of my other kid) is well after basic speech capabilities are developed in many children, etc. More likely the available "training data" is at least 1 or 2 orders of magnitude less than this.

The point here is that large language models, trained as they are on multiple billions of tokens, are not developing their behavioral capabilities in a way that's remotely similar to humans, even if you believe those capabilities are similar (they are by certain very biased ways of measurement; they very much aren't by others). This idea that humans must be naturally good at acquiring language is an old one (see e.g. https://en.m.wikipedia.org/wiki/Language_acquisition_device). Why should this matter though?

The AI hypelords are trying to argue (because it will benefit them personally, in most cases) that more research into LLMs alone will lead to so-called "Artificial General Intelligence." However, "general intelligence" doesn't have any widely-accepted definition (although Microsoft's contract with OpenAI seems to think the definition involves a certain level of profit: https://techcrunch.com/2024/12/26/microsoft-and-openai-have-a-financial-definition-of-agi-report/). But I think it's pretty fair to claim that a system which is bad at learning probably does not have something we'd want to call "general intelligence" since the capacity to learn is an important part of what intelligence is. It might have particular capabilities we'd call "intelligent" but by missing out on the capacity to learn, its "intelligence" would by definition be narrow.

Although there are definitely people working on LLM training efficiency, the underlying technical approach is fundamentally incompatible with the ability to learn language using merely millions of tokens. Any approach that achieves reasonable language capacity without billions of tokens of training data will deviate from the LLM blueprint in one of two ways: either it will start from a pre-trained model that has more data available to it, or it will be using other AI techniques to learn more efficiently.

"But wait, don't humans have a pre-trained language model in our DNA?" you might ask. We certainly have some capability that other species lack, but it's more likely a learning capability than just stored linguistic information, for a few reasons. First, any stored information would have to be completely language-agnostic, since genes don't vary by language spoken. Second, the entire human genome has a raw information content of ~700 MB (see: https://medium.com/precision-medicine/how-big-is-the-human-genome-e90caa3409b0). That's not nearly enough to encode a useful amount of pre-training data in modern model terms, and you've got to leave room to encode all of human biology... Just to emphasize this point, the "small" 8-billion-parameter Llama 3.1 model needs ~12 gigabytes of RAM to store the parameters (https://llamaimodel.com/requirements/#Llama3-1).

The point here is that if we want to be serious about a quest for "AGI," or it we're worried about whether "AGI is just around he corner," we can be pretty sure that more fundamental AI research breakthroughs stand between the state of the art and whatever our favorite idea of AGI is, and that these breakthroughs, assuming they will happen, will not happen as a result of investing more money in larger language models. They won't even need larger and more environmentally questionable datacenters to run, in fact, if they truly achieve efficient learning. Just the opposite in fact: LLM research and the datacenter arms race are currently sucking researcher time and material resources away from the investments that might achieve AGI. To put it more succinctly:

The bullshit-machine peddlers are peddling bullshit when it comes to AGI claims.

Have I mentioned my 4-year-old is also learning to draw?

#AI #LLM #AGI