So, what I think most people don't realize is that the amount of computation an LLM can do in one pass is strictly bounded. You can see that here with the layers. (This applies to a lot of neural networks [1].)
Remember, they feed in the context on one side of the network, pass it through each layer doing matrix multiplication, and get a value on the other end that we convert back into our representation space. You can view the bit in the middle as doing a kind of really fancy compression, if you like. The important thing is that there are only so many layers, and thus only so many operations.
Therefore, past a certain point they can't revise anything because it runs out of layers. This is one reason why reasoning can help answer more complicated questions. You can train a special token for this purpose [2].
There is no mechanism in transformer architecture for "internal" thinking ahead, or hierarchical generation. Attention only looks back from current token, ensuring that the model always falls into local maximum, even if it only leads to bad outcomes.
Not strictly true: while this was previously believed to be the case, Anthropic demonstrated that transformers can "think ahead" in some sense, for example when planning rhymes in a poem [1]:
> Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with "grab it". Then, with these plans in mind, it writes a line to end with the planned word.
They described the mechanism that it uses internally for planning [2]:
> Language models are trained to predict the next word, one word at a time. Given this, one might think the model would rely on pure improvisation. However, we find compelling evidence for a planning mechanism.
> Specifically, the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.
Thank you for these links! Their "circuits" research is fascinating. In the example you mention, note how the planned rhyme is piggybacking on the newline token. The internal state that the emergent circuits can use is 1:1 mapped to the tokens. Model cannot trigger an insertion of a "null" token for the purpose of storing this plan-ahead information during inference. Neither there are any sort of "registers" available aside from the tokens. The "thinking" LLMs are not quite that, because the thinking tokens are still forced to become text.
That's what reasoning models are for. You can get most of the benefit by saying an answer once in the reasoning section, because then it can read over it when it outputs it again in the answer section.
It could also have a "delete and revise" token, though you'd have to figure out how to teach it to get used.
Given how badly most models degrade once reaching a particular context size (any whitepapers on this welcome), reasoning does seem like quick hack, instead of a thought out architecture.
LLMs are just the speech center part of the brain, not a whole brain. It's like when you are speaking on autopilot, or reciting something by heart, it just comes out. There is no reflection or inner thought process. Now thinking models do actually do a bit of inner monologue before showing you the output so they have this problem to a much lesser degree.
If you did hide its thinking it could do that. But I'm pretty sure what happens here is that it has to go through those tokens for it to be clear that it's doing things wrong.
What I think that happens:
1. There's a question about a somewhat obscure thing.
2. LLM will never know the answer for sure, it has access to this sort of statistical, probability based compressed database on all the facts of the World. Because this allows to store more facts by relating things to each other, but never with 100% certainty.
3. There are particular obscure cases where it hits its initial "statistical intuition" that something is true, so it starts outputting its thoughts as expected for a question where something is likely true. Perhaps you could analyze what it's indicating probabilities on "Yes" vs "No" to estimate its confidence. Perhaps it will show much less likelihood for "Yes", than if the question was for a horse emoji, but in this case "Yes" is still high enough threshold to go through instead of "No".
4. However when it has to explain the exact answer, it's impossible to output an answer because it's false. E.g. seahorse emoji does not exist and it has to output it, previous tokens where "Yes, it exists, it's X", the X will be answers semantically close in meaning.
5. The next token will have context that "Yes, seahorse emoji exists, it is "[HORSE EMOJI]". Now it's clear that there's a conflict here, it's able to see that HORSE emoji is not seahorse emoji, but it had to output it in the line of previous tokens because the previous tokens statistically required an output of something.
It can't internally rewise. The last generation produces a distribution and sometimes the wrong answer gets sampled.
There is no "backspace" token, although it would be cool and fancy if we had that.
The more interesting thing is why does it revise its mistakes. The answer to that is having training examples of fixing your own mistakes in the training data plus some RL to bring out that effect more.
AIUI, they generally do all of that at the beginning. Another approach, I suppose, could be to have it generate a second pass? Though that would probably ~double the inference cost.
If you didn't have the luxury of a delete button, such as when you're just talking directly to someone IRL, you would probably say something like "no, wait, that doesn't make any sense, I think I'm confusing myself" and then either give it another go or just stop there.
I wish LLMs would do this rather than just bluster on ahead.
What I'd like to hear from the AI about seahorse emojis is "my dataset leads me to believe that seahorse emojis exist... but when I go look for one I can't actually find one."
There have been attempts to give LLMs backspace tokens. Since no frontier model uses it I can only guess it doesn't scale as well as just letting it correct itself in COT
You're describing why reasoning is such a big deal. It can do this freakout in a safe, internal environment, and once it's recent output is confident enough flip into the "actual output" mode.
> The odd thing is why it would output its own mistakes, instead of internally revising until it's actually satisfied.
Happens to me all the time. Sometimes in a fast-paced conversation you have to keep talking while you’re still figuring out what you’re trying to say. So you say something, realize it’s wrong, and correct yourself. Because if you think silently for too long, you lose your turn.
Are you sure? Because LLMs definitely have to respond to user queries in time to avoid being perceived as slow. Therefore, thinking internally for too long isn’t an option either.
LLMs spend a fixed amount of effort on each token they output, and in a feedforward manner. There's no recursion in the network other than through predicting predicated on the token that it just output. So it's not really time pressure in the same way that you might experience it, but it makes sense that sometimes the available compute is not enough for the next token (and sometimes it's excessive). Thinking modes try to improve this by essentially allowing the LLM to 'talk to itself' before sending anything to the user.
There’s no "thinking internally" in LLMs. They literally "think" by outputting tokens. The "thinking modes" supported by online services are just the LLM talking to itself.
That's not what I meant. "Thinking internally" referred to the user experience only, where the user is waiting for a reply from the model. And they are definitely optimised to limit that time.
There’s no waiting for reply, there’s only the wait between tokens output, which is fixed and mostly depends on hardware and model size. Inference is slower on larger models, but so is training, which is more of a bottleneck than user experience.
The model cannot think before it starts emitting tokens, the only way for it to "think" privately is by the interface hiding some of its output from the user, which is what happens in "think longer" and "search the web" modes.
If a online LLM doesn’t begin emitting a reply immediately, more likely the service is waiting for available GPU time or something like that, and/or prioritizing paying customers. Lag between tokens is also likely caused by large demand or throttling.
Of course there are many ways to optimize model speed that also make it less smart, and maybe even SOTA models have such optimizations these days. Difficult to know because they’re black boxes.
It’s a lot easier if you (I know I know) stop thinking of them as algorithms and anthropomorphize them more. People frequently say stuff like this, and its pretty clear that our minds process thoughts differently when we directly articulate them than if we act on “latent thoughts” or impulses.
Yell at me all you want about how “LLMs don’t think”, if a mental model is useful, I’m gonna use it.
The odd thing is why it would output its own mistakes, instead of internally revising until it's actually satisfied.