> it gave me a fairly sensible answer (similar to what is said in this article, ...

> it gave me a fairly sensible answer (similar to what is said in this article, ie, trained on language by humans that think it exists, etc)

That's more of a throwaway remark. The article spends its time on a very different explanation.

Within the model, this ultimate output:

 [severed horse head emoji] can be produced by this sequence of tokens: horse [emoji indicator]

If you specify "horse [emoji indicator]" somewhere in the middle levels, you will get output that is an actual horse emoji.

This also works for other emoji.

It could, in theory, work fine for "kilimanjaro [emoji indicator]" or "seahorse [emoji indicator]", except that those can't convert into Kilimanjaro or seahorse emoji because the emoji don't exist. But it's not a strange idea to have.

So, the model predicts that "there is a seahorse emoji: " will be followed by a demonstration of the seahorse emoji, and codes for that using its internal representation. Everything produces some output, so it gets incorrect output. Then it predicts that "there is a seahorse emoji: [severed terrestrial horse head]" will be followed by something along the lines of "oops!".