"my explanation would also to apply to where other commenters note that this beh...

D-Machine · 2025-10-06T07:49:36 1759736976

Right, I kind of suspect we don't really disagree on anything too fundamental here re: the looping behaviour (or statistics, actually). E.g. when I said earlier:

>> "the algorithm probably is doing something like an equivalent of a random walk on the manifold, staying close to wherever 'seahorse emoji' landed, but never really converging, because the tokenization ensures that you can never really land back 'close enough' to the base position"

"converging" is deeply under-specified. Of course, we mean that a stop or <EOS> token of some kind is generated, and this happens when the generated sequence up to that stop token has some low enough score / loss. When I say "you can never really land back 'close enough' to the base position", this is really that the output tokenization is lossy enough that this threshold is never reached, since, when recursing, we keep getting weird output tokens contaminating the sequence, so that we don't get close enough to the original "seahorse emoji" embedding, and so prevent the score / loss from getting small enough. In your language, the model "cannot satisfy its condition of expressing something confidently".

The way you present your timelines, I think we basically actually are in agreement re: statistics. Yes, if you go back far enough, statistics did indeed guide model development and successes (and still does in some narrow cases). But, also yes, as soon as you get into "modern" neural nets that actually make huge progress on things like MNIST, CIFAR, and language modeling, yeah, we are way, way past statistical intuitions being necessary or superior to intuitions based on curve fitting and smoothing / gradient conditioning and the like.

For dating this shift, I was personally thinking to something like the Hinton dropout paper which I checked was around 2012 (my work has been more in computer vision), but, yeah, about 2008, as you say, also seems close enough if you consider NLP.

Really appreciate your comments here. EDIT: and yes, energy models are the bomb.

bravura · 2025-10-06T14:42:45 1759761765

Yeah, overall I think we agree.

If you want to read some mind blowing early neural language sequence modeling approaches that everyone completely slept on, look at Pollack's work on "recursive auto-associative memory" (RAAM) and Sperduti's later labeled RAAM (LRAAM) work. Both from the early 90s. Didn't have a probabilistic interpretation IIRC.

Yoshua was always sort of agnostic about probabilistic approaches and used them when they made sense. 50% of his work included them, and other like early deep vision works of his purely motivated the use of deep models in terms of circuit theory and compactness / model complexity.

Collobert and Weston taught us we could train Yoshua's NLM models much much faster using negative sampling and a hinge loss, thus dropping the probabilistic story entirely.

I suspect the historical reason is that in the mid 2000s, the NLP community only very broadly started adopting statistical methods. (i.e. grad started began to be more likely to use them than not, which hadn't been true historically when linguistics not stats drove many intuitions, and using a CRF felt sort of next-level). So once every got comfortable with stats as table-stakes, they felt a sort of whiplash to stop approaching things through this lens.

D-Machine · 2025-10-06T15:33:29 1759764809

I would also broadly agree that the overuse of statistical language and explanations is probably more driven by historical trends in NLP. I was always more interested in computer vision (including segmentation) and even deep regression. Especially in the case of deep regression, with the absence of a softmax and the ease of constructing task-specific custom loss functions (or like you say, the hinge loss example), it always seemed to me pretty clear none of this was all ever really particularly statistical in the first place.

I will definitely check out those RAAM and LRAAM papers, thanks for the references. You definitely seem to have a more rich historical knowledge than I do on these topics.