You're missing one key point, which is what makes this failure mode unusual. Nam...

D-Machine · 2025-10-06T06:44:49 1759733089

You're right, seahorse emoji is almost certainly in the training data, so we should amend my explanation to say that "seahorse emoji" is not just close to the training manifold, but almost certainly right smack on it. The rest of what I said would still apply, and my explanation would also to apply to where other commenters note that this behaviour is emitted to some degree with similar other "plausible" but non-existent emoji (but which are less likely to be in the training data, a priori). EDIT FOR THIS PARAGRAPH ONLY: Technically, on reflection, since all fitting methods employ regularization methods, it is still in fact unlikely the fitted manifold passes exactly through all / most training data points, and saying that "seahorse emoji" is "very close" to the training manifold is still actually technically probably most accurate here.

You're also right that it is a long discussion to say to what extent LLMs are statistical or probabilistic, but, I would maybe briefly say that if one looks into issues like calibration, conformal prediction, and Bayesian neural nets, it is clear most LLMs that people are talking about today are not really statistical in any serious sense (softmax values are scores, not probabilities, and nothing about pre-training or tuning typically involves calibration—or even estimation—in LLMs).

Yes, you can use statistics to (help) explain the behaviour of deep models or certain layers (usually making assumptions that are of dubious relevance to actual practice), but geometric analogies, regularization methods, and matrix conditioning intuitions are what have clearly guided almost all major deep learning advances, with statistical language and theory largely being post-hoc, hand-wavey, and (IMO) for the purpose of publication / marketing. I really think we could de-mystify a huge amount of deep learning if we were just honest it was mostly fancy curve fitting with some intuitive tricks for smoothing and regularization that clearly worked long before any rigorous statistical justification (or which still clearly work in complicated ways, despite such an absence of statistical understanding; e.g. dropout, norm layers, the attention layer itself, and etc).

Just, it gets complicated when you get into diffusion models and certain other specific models that are in fact more explicitly driven by e.g. stochastic differential equations and the like.

bravura · 2025-10-06T07:23:08 1759735388

"my explanation would also to apply to where other commenters note that this behaviour is emitted to some degree with similar other "plausible" but non-existent emoji (but which are less likely to be in the training data, a priori)."

I agree with you partially. I just want to argue there are several factors that lead to this perverse behavior.

Empirically:

Use web gpt-5-instant in TEMPORARY mode. If you ask for "igloo emoji" it confidently (but ONLY in temporary mode) says that "Yes, igloo emoji is in Unicode 12 and is [house-emoji ice-emoji]." Then it basically stops. But it has satisfied its condition of confidently expressing its false knowledge. (Igloo emoji doesn't exist. gpt-5-instant in non-temporary mode says no. This is also weird because it suggests the temporary mode system prompt is laxer or different.)

The mechanism you describe partially explains why "seahorse emoji" leads to babbling: As it outputs the next token, it realizes that the explanation would be worse off it if next emits stop token, so instead it apologizes and attempts to repair. And cannot satisfy its condition of expressing something confidently.

The upstream failure is poor knowledge. That combined with being tuned to be helpful and explanatory, and having no grounding (e.g. websearch) forces it to continue. Finally, the token distance from the manifold is the final piece of the puzzle in this unholy pathological brew.

You're incorrect that statistical language modeling is "post-hoc", it's rather "pre-hoc" / "pre-hack". Most foundational works in language modeling started as pure statistical models (for example, classic ngram models and Bengio's original neural language model from 2003), and it was later that hacks got introduced that removed statistical properties but actually just worked (Collobert and Weston 2008, as influenced by Bottou and LeCun). Where I agree with you is that we should have done away with the statistical story long ago. LeCun's been on about energy-based models forever. Even on HN last week, punters criticize him that JEPA hasn't had impact yet, as if he were behind the curve instead of way ahead of it.

People like statistical stories but, similarly to you, I also think they are a distraction.

D-Machine · 2025-10-06T07:49:36 1759736976

Right, I kind of suspect we don't really disagree on anything too fundamental here re: the looping behaviour (or statistics, actually). E.g. when I said earlier:

>> "the algorithm probably is doing something like an equivalent of a random walk on the manifold, staying close to wherever 'seahorse emoji' landed, but never really converging, because the tokenization ensures that you can never really land back 'close enough' to the base position"

"converging" is deeply under-specified. Of course, we mean that a stop or <EOS> token of some kind is generated, and this happens when the generated sequence up to that stop token has some low enough score / loss. When I say "you can never really land back 'close enough' to the base position", this is really that the output tokenization is lossy enough that this threshold is never reached, since, when recursing, we keep getting weird output tokens contaminating the sequence, so that we don't get close enough to the original "seahorse emoji" embedding, and so prevent the score / loss from getting small enough. In your language, the model "cannot satisfy its condition of expressing something confidently".

The way you present your timelines, I think we basically actually are in agreement re: statistics. Yes, if you go back far enough, statistics did indeed guide model development and successes (and still does in some narrow cases). But, also yes, as soon as you get into "modern" neural nets that actually make huge progress on things like MNIST, CIFAR, and language modeling, yeah, we are way, way past statistical intuitions being necessary or superior to intuitions based on curve fitting and smoothing / gradient conditioning and the like.

For dating this shift, I was personally thinking to something like the Hinton dropout paper which I checked was around 2012 (my work has been more in computer vision), but, yeah, about 2008, as you say, also seems close enough if you consider NLP.

Really appreciate your comments here. EDIT: and yes, energy models are the bomb.

bravura · 2025-10-06T14:42:45 1759761765

Yeah, overall I think we agree.

If you want to read some mind blowing early neural language sequence modeling approaches that everyone completely slept on, look at Pollack's work on "recursive auto-associative memory" (RAAM) and Sperduti's later labeled RAAM (LRAAM) work. Both from the early 90s. Didn't have a probabilistic interpretation IIRC.

Yoshua was always sort of agnostic about probabilistic approaches and used them when they made sense. 50% of his work included them, and other like early deep vision works of his purely motivated the use of deep models in terms of circuit theory and compactness / model complexity.

Collobert and Weston taught us we could train Yoshua's NLM models much much faster using negative sampling and a hinge loss, thus dropping the probabilistic story entirely.

I suspect the historical reason is that in the mid 2000s, the NLP community only very broadly started adopting statistical methods. (i.e. grad started began to be more likely to use them than not, which hadn't been true historically when linguistics not stats drove many intuitions, and using a CRF felt sort of next-level). So once every got comfortable with stats as table-stakes, they felt a sort of whiplash to stop approaching things through this lens.

D-Machine · 2025-10-06T15:33:29 1759764809

I would also broadly agree that the overuse of statistical language and explanations is probably more driven by historical trends in NLP. I was always more interested in computer vision (including segmentation) and even deep regression. Especially in the case of deep regression, with the absence of a softmax and the ease of constructing task-specific custom loss functions (or like you say, the hinge loss example), it always seemed to me pretty clear none of this was all ever really particularly statistical in the first place.

I will definitely check out those RAAM and LRAAM papers, thanks for the references. You definitely seem to have a more rich historical knowledge than I do on these topics.