TinyLlama-1.1B

We used this version of TinyLlama as a base model: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

The goal was to improve performance on basic algebra (i.e. solving systems of linear equations).

The base model was fine tuned on 8k rows synthetic solution data generated by OpenMath-Mistral-7B-v0.1-hf on ALG-514.

We used the NeMo Skills pipeline for inference with code execution and generating the synthetic data. HuggingFace's SFTTrainer was used for fine tuning, as the NeMo Skills pipeline was too difficult to work with. It took 30 minutes to fine tune on an RTX3090.

Sample Input:

System:
You're an expert Python programmer and mathematician.
Help the user to solve this problem using code when necessary.
Make sure to put the answer (and only answer) inside \boxed{}.

User:
High school graduating class is made up of 466 students. There are 212 more girls than boys. How many boys are in the class?

Assistant:

Sample Output:

Let's solve this problem using Python code.
<llm-code>
total_students = 466
more_girls = 212
number_of_girls = total_students - more_girls
number_of_boys = number_of_girls / 2
number_of_boys
</llm-code>
<llm-code-output>
127.0
</llm-code-output>
Thus the number of boys in the class is \boxed{127}.

The code interpreter halts generation at </llm-code>, inserts the result of the executed code in <llm-code-output></llm-code-output>, then continues generation. If there is no code interpreter, it will just guess numbers for the output.

Notes from previous model cards:

We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

Eval

Checkpoint	Coverage
1600	0.890244
2200	0.890244
2400	0.890244
2600	0.878049
1200	0.878049
2800	0.853659
2000	0.853659
800	0.841463
1000	0.829268
1800	0.829268
1400	0.817073
mistral	0.804878
3000	0.780488
600	0.768293
400	0.731707
200	0.682927
0	0.000000

Note that checkpoint_0 is the base model and checkpoint_mistral is OpenMath-Mistral-7B-v0.1-hf. Also note that after 800 steps the fine tuned model had better coverage than the much larger teacher model.

The zero shot performance is _not good_™, but this model could be used to quickly generate synthetic data since the coverage is decent. The uploaded model is checkpoint-2.6k (best zero-shot performance and top 4 coverage).

People involved in creating this fine tune:

Coulton Theuer [theuerc@umich.edu]
Bret Ellenbogen [bretelle@umich.edu]
Victoria Chang [vgc@umich.edu]