Bad at Math and "Intuitive" Logic

#12
by Raspbfox - opened

Just some observations on Raven (7b and 14b tested) being pretty bad at simple mathematics and logic. Tried those many, many items, and it only gets it right in ~5% of cases.

What can be the sources of the issue?
Bad data?
Too small of a size to learn or memorize the mathematical operations?
The inherent issue of the architecture?

Examples

Q: How much is 21 + 9 * 2?
A: 21 + 9 * 2 = 42

Q: How much is 21 + (9 * 2)?
A: 21 + (9 * 2) = 30 + 6 = 36

Q: How much is 21 + 9 * 2, step by step?
A: 21 + 9 * 2 = 36
21 + 9 = 30
21 + 9 = 30

Q: Continue the sentence: a ball is a ball, a quare is a square, an apple is an apple, a boopliebeep is a 
A: A ball is a ball, a quare is a square, an apple is an apple, and a boopliebeep is a musical instrument.

Some comparisons

The most interesting comparison for me was the Claude Instant, which is an extremely fast commercial LLM. However, it gets it right every time.

Q: How much is 21 + 9 * 2?
A:   21 + 9 * 2 = 
21 + 18 = 
39

...however, it still fails the "logic" test.

Q: Continue the sentence: a ball is a ball, a quare is a square, an apple is an apple, a boopliebeep is a
A:  a boopliebeep is a nonsense word. 
  1. Reduce topp to 0.2

  2. Simply because it's not trained with such mathematics data. I will add some in v11.

  3. It's very likely that all other open-source models of similar size are bad at your examples too.

  4. Test real user scenarios (including coding) instead of trying to break the model - It's easy to find such edge cases but they are all easily fixable by adding them to the training data.

Thanks for the answer! I know it's very unfair to compare even the 14b raven model to the commercial ones from OpenAI and Anthropic, in the end, we are talking about an order of magnitude difference in size, and they have access to commercial data sets πŸ‘€ I just feel like it's already a crazy achievement, that we can compare them, and it's also useful to see areas where models are still lacking.

One area I am currently investigating, which is critical for some use-cases, is the ability to follow and output structured data. For example, always returned a pre-defined JSON which should only contain pre-defined fields.

One area I am currently investigating, which is critical for some use-cases, is the ability to follow and output structured data. For example, always returned a pre-defined JSON which should only contain pre-defined fields.

It will be great if you can collect all "bad cases" questions with correct answer (say from ChatGPT) and I can use them to train the model :)

Roger-roger!

@Raspbfox Let's collect failed tasks here: https://github.com/BlinkDL/LM-Trick-Questions

BlinkDL changed discussion status to closed

Sign up or log in to comment