LLMs' surprise teaches us about ourselves

How does a language model know, understand, and become expert?

Jul 09, 2025

Apropos with Peter Strömberg was great! He showed us a lot of cool stuff that Calva can do. And also some cool tutorials he’s built into the Clojure IDE.

Apropos is now on break for the Summer. Enjoy!

LLMs’ surprise teaches us about ourselves

I studied AI in grad school. My master’s thesis was about combining Natural Language Processing (NLP) techniques with Machine Learning (ML) to answer very specific information from a large collection of text documents. That was back in 2008. When ChatGPT first came out, I was skeptical at first. But I gave it a shot and was very surprised. In this essay, I’d like to explore what was surprising to me before my memory of life before LLMs fades.

When I studied AI, NLP was a very different field than Knowledge Representation (KR) and Information Extraction (IE). NLP was about grammars, understanding sentence structure, and manipulating the structure of text, such as turning a declarative into an interrogative. Knowledge Representation was basically data modeling but for AI algorithms to be able to use. And IE was taking unstructured text and turning it into structured data.

Each of those parts had its own literature. Sure, any particular project was going to touch on multiple fields, but each had its own difficulties and hence its own literature. LLMs shatter these distinctions.

Take a basic statistical Markov Chain. It builds a statistical model of the probability distribution of the next word, given the last n words. It was clear, even back in 2008, that if you increased n and gave it more text to train on, the text that the Markov Chain generates sounds more and more human. It is more grammatically correct. And it sounds less and less like the source material, as differences of author start to average out. But also the text is meaningless. Your mind tries to make sense of it because it does sound like language. But if it does mean something, it’s more by coincidence than anything. Part of the joke was that you’d train a Markov Chain on a politician’s speeches and you could generate babble that sounds like them, often hitting the right topics, but never makes any sense. Get it? Politicians talk a lot but don’t say anything of substance.

The success of Markov Chains to produce language, but without any reliability of the content, was a kind of proof that you needed to integrate knowledge representation with the language generation. You needed a way to generate text under the constraint that it say something from the database.

LLMs ignored that problem and just kept increasing n (now called the window) and the size of the training set. They also moved from a statistical to a neural network model, which allows for much better generalization of learning. And there were some structural additions to classical neural nets. Transformers was one of them. It allows the network to attend to different parts of the window—which is now thousands or even millions of words long.

As I write this, LLMs can output flawless English text, meaning they don’t make grammatical mistakes. They’ve done so for a long time. In fact, they’re so good at language itself that they’re used to do fun linguistic things like converting an essay into pirate speak or a rap in the style of Shakespeare. Or even practical things like summarizing text.

What’s most surprising is that the models have improved their factual correctness. I know it may sound odd given how famous LLMs are for hallucination, but it is the most prominent improvement over classical Markov Chains. Whereas statistical Markov Chains said something factually true 1% of the time, LLMs are correct close to 100% of the time. They’ve gone from babbling, senile uncle to overconfident but smart uncle. Or maybe it’s more like a dancing dog. When a dog dances for a few seconds, we clap. But the dance isn’t very good by human standards. We clap for the fact that it danced at all. But a human dancing, that’s a different matter. We start to judge its faults. It missed a beat. There’s some awkward movements. LLMs have crossed over into the level where we notice its hallucinations more than the fact that it works at all.

LLMs began as NLP and mastered it, but in order to keep improving, had to get good at Knowledge Representation and Information Extraction. That’s the surprise. And, like all AI surprises, teaches us something about ourselves.

The lines between language and knowledge and reason have always been hard to draw. If you draw up a simple English grammar, then generate sentences using it, most of the sentences will be meaningless. But some will be awkward. It will say things like “Colorless green ideas sleep furiously.” That one is classically nonsensical. But it will also say things like “I ate three rices.” The trouble is that rice is a mass noun, not a collective noun. It can’t take a number as a modifier. It’s a grammatical distinction that we can incorporate into the simple grammar. But it’s also a very simple aspect of a model of the world. As we add more and more of these aspects to the grammar, it becomes more and more an understanding of the world. And there are countless fine distinctions in grammar that linguists have teased apart in English. LLMs, in my opinion, have creeped into knowledge representation through the kind of world knowledge embedded in grammar.

And they’ve moved far beyond it. However, grammar doesn’t know what the capital cities of countries are. They’re historical accidents. And that’s why it might hallucinate those kinds of facts. It’s still a wonder that they can know any of them at all.

But here’s where the information extraction comes in. In my thesis, I built a system to learn from text. I gave it all of the country names and all of the capital city names. Then I gave it all of the sentences in Wikipedia that contained both a country name and a city name. I parsed the sentences into a graph, then plucked out the path between the city and country names. Then I used good, old-fashioned graph matching to figure out which ones were correlated with capital cities.

The software broke about 80% accuracy with thousands of sentences. But LLMs are different. You can give it one sentence, like “Paris is the capital of France”, and then ask it: “What is the capital of France?” It will tell you. Or you can ask the other way, “What country is Paris the capital of?” Even if it doesn’t know what the capital of France is without you telling it (it probably does), it can use the information in the sentence you gave it to answer questions. What’s more is that you can give it a 10 page document full of information (including the sentence “Paris is the capital of France.”) and ask it those questions. It will still find the answer. Again, it’s approaching Information Extraction from an NLP doorway.

And this also tells us something about ourselves. The knowledge of capital cities is declarative knowledge. It’s the kind of knowledge that is easy to quiz for. I say UK, you say London. But imagine all of the knowledge you have to have to even begin answering the question. What is a country? What is a capital? What if I ask it more subtly, like “Where is the seat of government in the largest country by area?” How could it get that right without a deeper model of the world? Capitals are typically the seat of government. Cities are places. What is the largest country? What does area mean? In my opinion, this has been the key to unlocking Information Extraction—a critical mass of facts filling in an adequate model. And I don’t know how to argue for this, but I believe it: What we call understanding is largely linguistic, captured in our feelings of correct usage. It’s what Wittgenstein was on about. So as the LLM gets better at correct usage, it gets better at interpreting the meaning from text.

There’s one more thing that’s perhaps more subtle. When I first used ChatGPT, it was version 3.5. It hallucinated a lot. But more than that, you’d ask it a difficult question and it would pick a side. For instance, what is the best way to learn to a new language?It might say immersion. Or it might say study from books. Or take a class. All plausible. But when 4 came out, the same question was answered qualitatively differently. What is the best way to learn a new language? “There are differing opinions about what is the most efficient way to learn a language. Some believe immersion is best. Others believe personal study from books. But it depends largely on your goals with learning the language…” In short, it had a theory of knowledge, a kind of epistemology. It didn’t just know facts. It knew that different groups of people believed different facts. In fact, a better way to understand a topic is to understand the controversies of the field—and to be able to argue all sides.

Again, this is surprising. Simply by making the LLM bigger (bigger window, bigger network, bigger training) it has transcended basic models of truth and found a higher version of expertise. This also tells us about ourselves. As we become experts, we transcend any particular point of view and take on a meta point of view. We can hold multiple, often conflicting models in mind at once, and choose between them as we please, never really believing in them any more than we have to. Expertise is built on conflicting models.

Well, thanks for coming on this tour with me. I think the main point is that language is a doorway into our intelligence. It’s hard to know where language ends and understanding begins. And this has allowed LLMs to do many skillful things. Not everything, mind you. LLMs are not able to walk or drive. But they can do a lot. And like all interesting AI, they teach us about ourselves.

Eric Normand's Newsletter

Discussion about this post