Model fails the strawberry test
Turns out the models cram the answers not think about it. How disapointing.
Conversation:
User:
how many r's are in the word strawberry when it is falling from a height of 20m.
Assistant:
<Thought> Alright, I need to determine how many 'r's are in the word "strawberry" when it is falling from a height of 20 meters. Let's break this down step by step to ensure accuracy.
First, let's focus on the word "strawberry." To count the number of 'r's in this word, I'll examine each letter individually:
S – Not an 'r'.
T – Not an 'r'.
R – This is the first 'r'.
A – Not an 'r'.
W – Not an 'r'.
E – Not an 'r'.
B – Not an 'r'.
E – Not an 'r'.
R – This is the second 'r'.
Y – Not an 'r'.
So, in the word "strawberry," there are two 'r's.
Now, considering the height of 20 meters that the word is falling from, I need to understand the context. Is the height affecting the number of 'r's? If we're talking about the word physically falling, the height might represent time or distance, but since the number of letters in a word doesn't change with physical movement, the height doesn't influence the count of 'r's.
Therefore, regardless of the height from which the word is falling, the number of 'r's remains the same.
</Thought> <Output> There are 2 'r's in the word "strawberry." </Output>
Screenshot:
Thank you for your attention.
We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.
However, this still demonstrates some limitations of our model at its current stage. Our work is ongoing, and we will continue to optimize and iterate in the future. Thank you again for your attention.
No, the model just happened to give a "right" answer, which doesn't necessarily mean that the model is capable in reasoning the number of "r"s. Though the final answer is right, the reasoning process is wrong: the model spelled "strawberry" in a wrong way. As for subsequent work, according to physics of language models, finetuning, as well as techniques adopted at even later stage such as sampling and decoding, might not bring as much improvement in reasoning as expected.
Yes, we have found that most of the errors occur due to incorrect spelling of words or skipping the 'letter by letter' step.
The current work is not perfect, and we are striving to improve it. However, just like ORM ->PRM, we believe that this issue can be resolved in future work. Other directions, such as addressing model hallucinations and enhancing instruction-following capabilities, are also worth exploring in my opinion.
Likewisely, according to physics of language models, hallucinations are the result of llm learning language format faster than knowledge, so solution to this problem might still lie in the pretraining stage. Methods including mixed pretraining, data augmentation, etc, will be helpful. Thank you for your work and I'm looking forward to your latest updates 🚀
The challenge here is that models are not good at spelling. Its not what they are designed to do. They have to associate the token for a single letter to a token that represents multiple letters. They do not see a word the same way we do. To further compound the problem, this is a chinese model, not an english one, even if it can translate. It translates from a token representing a chinese word to a token representing a an english word or part of a word.
So until there is stronger training on individual letters and their associations with each other and the final words, most models will struggle a lot with this.
Frankly, I would prefer they not waste the parameters of a 7b model on spelling. The embedding engine handles it fine.
It's very interesting, when I asked it what 5X12 + 80X3/12 equals, it kept telling me it was 25 and wouldn't change.so I feel that 7B parameters are not enough for rigorous reasoning, and most of the inference results are unreliable.
Reasoning and math solving ability is related to model depth, i.e., number of layers. Pretrain a deeper model will increase its math and reasoning ability, as long as it can converge. And, 7B is large enough to store all english knowledge, as physics of language model claims.
I may conclude that the model does NOT have the ability to perform math reasoning. According to saisai1053's log, the model failed to solve the DAG, i.e., it failed to determine variables' dependency. I suggest model devs do some model probing to exam if the model is soving DAG correctly even before it output any token.
Oh, still something I fogot. @Sniper , may I ask what decoding strategy you are using, and temperature, samplers?
We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.
Adding randomness to inferencing doesn't prove that the model is capable of something.
OH that reminds me! I was using Koboldcpp and I leaving temperature at 0.8,and that error happend....but I think 0.8 is a reasonble enough value,or I just feel wrong? I did reset the setting to Godlike today and it gose to 0.7,and now it's gose right..
If it's my fault to useing a wrong setting,then i sorry about all the mess.
Oh, still something I fogot. @Sniper , may I ask what decoding strategy you are using, and temperature, samplers?
We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.
Adding randomness to inferencing doesn't prove that the model is capable of something.
Yes. I just randomly used a temperature of 0.7 without employing a specific decoding strategy.
I agree that using random sampling doesn't necessarily reflect true performance. Our initial intention for this attempt was to determine our model's test@k. We believe that good test@k performance will be beneficial for subsequent search and optimization.
And we are actively iterating on our model and believe that future versions will perform better on tasks like this.