> it gave me a fairly sensible answer (similar to what is said in this article, ie, trained on language by humans that think it exists, etc)
That's more of a throwaway remark. The article spends its time on a very different explanation.
Within the model, this ultimate output:
[severed horse head emoji] can be produced by this sequence of tokens: horse [emoji indicator]
If you specify "horse [emoji indicator]" somewhere in the middle levels, you will get output that is an actual horse emoji.
This also works for other emoji.
It could, in theory, work fine for "kilimanjaro [emoji indicator]" or "seahorse [emoji indicator]", except that those can't convert into Kilimanjaro or seahorse emoji because the emoji don't exist. But it's not a strange idea to have.
So, the model predicts that "there is a seahorse emoji: " will be followed by a demonstration of the seahorse emoji, and codes for that using its internal representation. Everything produces some output, so it gets incorrect output. Then it predicts that "there is a seahorse emoji: [severed terrestrial horse head]" will be followed by something along the lines of "oops!".
That's more of a throwaway remark. The article spends its time on a very different explanation.
Within the model, this ultimate output:
If you specify "horse [emoji indicator]" somewhere in the middle levels, you will get output that is an actual horse emoji.This also works for other emoji.
It could, in theory, work fine for "kilimanjaro [emoji indicator]" or "seahorse [emoji indicator]", except that those can't convert into Kilimanjaro or seahorse emoji because the emoji don't exist. But it's not a strange idea to have.
So, the model predicts that "there is a seahorse emoji: " will be followed by a demonstration of the seahorse emoji, and codes for that using its internal representation. Everything produces some output, so it gets incorrect output. Then it predicts that "there is a seahorse emoji: [severed terrestrial horse head]" will be followed by something along the lines of "oops!".