comex
yesterday at 8:08 PM
> Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
But it's probably not that mysterious either. Or at least, this test doesn't show it to be so. For example, I doubt that the chain of thought in these examples secretly encodes "I'm going to cheat". It's more that the chain of thought is irrelevant. The model thinks it already knows the correct answer just by looking at the question, so the task shifts to coming up with the best excuse it can think of to reach that answer. But that doesn't say much, one way or the other, about how the model treats the chain of thought when it legitimately is relying on it.
It's like a young human taking a math test where you're told to "show your work". What I remember from high school is that the "work" you're supposed to show has strict formatting requirements, and may require you to use a specific method. Often there are other, easier methods to find the correct answer: for example, visual estimation in a geometry problem, or just using a different algorithm. So in practice you often figure out the answer first and then come up with the justification. As a result, your "work" becomes pretty disconnected from the final answer. If you don't understand the intended method, the "work" might end up being pretty BS while mysteriously still leading to the correct answer.
But that only applies if you know an easier method! If you don't, then the work you show will be, essentially, your actual reasoning process. At most you might neglect to write down auxiliary factors that hint towards or away from a specific answer. If some number seems too large, or too difficult to compute for a test meant to be taken by hand, then you might think you've made a mistake; if an equation turns out to unexpectedly simplify, then you might think you're onto something. You're not supposed to write down that kind of intuition, only concrete algorithmic steps. But the concrete steps are still fundamentally an accurate representation of your thought process.
(Incidentally, if you literally tell a CoT model to solve a math problem, it is allowed to write down those types of auxiliary factors, and probably will. But I'm treating this more as an analogy for CoT in general.)
Also, a model has a harder time hiding its work than a human taking a math test. In a math test you can write down calculations that don't end up being part of the final shown work. A model can't, so any hidden computations are limited to the ones it can do "in its head". Though admittedly those are very different from what a human can do in their head.