Observation: 4-bit quantization can't answer the Strawberry prompt
Just that, thought you guys might want to know.
tl;dr;
The IQ4_XS
successfully answered it: **There are 3 R's in "strawberry".**
Questions
What were your complete setup details? Inference engine and exact git hash version? System Prompt? Did you use correct prompt template? Samplings and temps?
Would be interesting to see the difference in our experiments! Thanks for sharing your findings!
Setup
$ cd llama.cpp
$ git rev-parse --short HEAD
10433e8b
$ ./llama-server \
--model "../models/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF/Llama-3.1-Nemotron-70B-Instruct-HF-IQ4_XS.gguf" \
--n-gpu-layers 46 \
--parallel 1 \
--ctx-size 8192 \
--cache-type-k f16 \
--cache-type-v f16 \
--threads 16 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
I'm running llama-cpp-api-client with Llama-3 format prompt.
Sampling/Temps
{
"temperature": 0.1,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
"repeat_penalty": 1.0,
"n_predict": -1,
"seed": -1
}
System Prompt
You are an expert in mathmatics, linguistics, and brain teasers. Answer the user's questions by repeating the question and using Chain of Thought (CoT) and Chain of Preference Optimizaiton (CPO) before giving the final answer.
Prompt
How many r in strawberry?
Response
**Repeating the Question:**
How many R's are in "strawberry"?
**Chain of Thought (CoT) Analysis:**
1. **Initial Assessment**: The question asks for the count of a specific letter, "R", within a given word, "strawberry".
2. **Break Down the Word**: Divide the word into its individual letters to inspect each one: s-t-r-a-w-b-e-r-r-y.
3. **Identify Target Letter**: The target letter is "R".
4. **Scan for "R"**: Inspect each letter in the sequence for "R".
5. **Count Instances of "R"**: Upon inspection, identify how many times "R" appears.
**Chain of Preference Optimization (CPO) for Accuracy:**
1. **Direct Count Preference**: Prioritize a direct, visual count of "R"s over assumptions.
2. **Eliminate Similar-Looking Letters**: Distinguish "R" from other letters that might be confused with it, ensuring accuracy.
3. **Verify Count**: Double-check the count to ensure no "R" is overlooked or miscounted.
**Final Answer:**
After conducting a thorough, step-by-step analysis via CoT and ensuring accuracy through CPO, the answer is:
**There are 3 R's in "strawberry".**
I'm using the same quant (IQ4_XS).
All config is vanilla llama-server (b3883).
System Prompt: "You are an AI Assistant"
In the original model card from NVIDIA they say:
See details on our paper at https://arxiv.org/abs/2410.01257 - as a preview, this model can correctly the question How many r in strawberry? without specialized prompting or additional reasoning tokens:
So, that's what I did.
Tested your system prompt and got this:
"User's Question: How many r in strawberry?
Chain of Thought (CoT) and Chain of Preference Optimization (CPO):
- Initially, I will take a broad perspective by considering the possible number of r's in the word "strawberry".
- Upon closer examination, I will break down the word into individual letters and count the number of r's.
- I will also consider the possibility of other letters or patterns that may affect the count of r's.
Chain of Preference Optimization (CPO):
- I will prioritize the most straightforward approach, which is to directly count the number of r's in the word.
- If the count is not clear, I will consider alternative methods, such as using linguistic patterns or mathematical operations.
Final Answer: There are 2 r's in the word "strawberry".
lol... Interesting... Yeah, I re-ran again and it gave me the wrong answer of 2 this time haha...
Oddly, I checked out and clean build llama.cpp@b3883
and it got it right 5 out of 5 tries with the simple system prompt and temp=0.1
.
I ran llama-cli
and the default sampler params seem to have a higher temperature than i'm using e.g. temp = 0.800
.
So I ran my experiment again with temp=0.8
and it got it right 4 out of 5 times with more variation in the wording.
System Prompt
You are an AI Assistant.
Prompt
How many r in strawberry?
Response
A sweet question!
Let's count the "R"s in "Strawberry":
1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y
There are **3** "R"s in the word "Strawberry".
llama-cli defaults
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Did some more tries (with b3930) and it got it right.
System Prompt: ""
Prompt: "How many Rs in Strawberry?"
Answer:
"Sweet question!
Let's count the R's in "strawberry":
- S
- T
- R
- A
- W
- B
- E
- R
- R
- Y
There are 3 R's in the word "strawberry".
Seems to be bit stochastic huh. Excellent model nevertheless. Thanks for the back and forth.
For me it worked on first try (IQ4_XS). Though I did ask for number of "a" in salamander.
Yeah, it varies. Sometimes it gets it right, sometimes it doesn't. Still super impressive for a model this size and quant, right?
How about you try this simple question that some small models answer correctly. "I have 50 books. I read 7 of them, how many books i have left?"
Sure. Here it goes.
A simple yet fun math problem!
Let's break it down:
Initial Number of Books: 50
Number of Books Read: 7
To find out how many books you have left, subtract the number of books read from the initial number of books:
Books Left = Initial Number of Books - Number of Books Read
= 50 - 7
= 43
So, you have 43 books left to read and enjoy!
Which is a wrong answer. That's why i pointed it out.
Pretty easy one, qwen 2.5 14b has no problem answering it correctly.
Got any more of those questions lying around? Could be useful.
How about you try this simple question that some small models answer correctly. "I have 50 books. I read 7 of them, how many books i have left?"
Problem is this is more linguistic than math problem. In speech you often leave some things unsaid - in newspapers you can sometimes see those added in brackets. And so this question can be easily interpreted from context (so the question makes sense): I have 50 books. I read 7 of them, how many books I have left (to read)?
Honestly if you asked me on the street I may answer 43 too. In math class I would probably answer 50 (as in math I require formality and not adding what is unsaid).
So I rephrased the question to make it clear (if you ask from math point of view, you must be precise):
Q: I have 50 books. I read 7 of them, how many books I own?
A: The answer is... still 50!
Reading 7 of your books doesn't change the number of books you own. You've only changed the status of 7 books from "unread" to "read". The total count of books in your possession remains the same: 50.
So, to recap:
Owned books: 50 (no change)
Read books: 7 (newly updated status)
Unread books: 50 - 7 = 43 (if you're curious about the remaining unread count)
This answer is from IQ4_XS with temp 0.9, MinP 0.02 and everything else in samplers neutral.
Frequently, the error lies within the question. As McUH has documented.