
Large language models are essentially solutions to nonlinear optimization problems to predict the next token in the output text. Mathematically, these problems are quite similar to problems we routinely face in physics, like finding the ground state of a quantum system.
In these physics problems, we observe a recurring pattern: First, someone comes up with a new class of solutions (we call this a "variational ansatz") and there is tremendous progress, allowing to solve previously hard problems in a near-miraculous way. However, once the low-hanging fruits have been reaped, the remaining problems stay hard. Throwing vastly more computing power at it helps a little bit, but produces quickly diminishing returns.
I'm pretty sure that's exactly what has happened with the advent of transformers for natural language processing.