TinyGSM: achieving >80% on GSM8k with small language models

Published on Dec 14, 2023
· Featured in Daily Papers on Dec 15, 2023


Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.


This will open the door to having very specific model runs locally, making AI accessible for all children everywhere. Instead of needing an AI tutor with high generalization, we can have a tutor that answers the same questions asked many years ago. By using a verifier trained on a tiny amount of data from GSM, we can intentionally contaminate it, resulting in a SML that is good at answering GSM-like questions. This is indeed a smart move!

Intentional overfitting or contamination can be beneficial, especially for educational AI tutors. For instance, Grade 7 math questions haven't changed significantly over time. A specialized AI tutor for this grade should focus on these specific questions, using overfitting as a tool for precision rather than generalization. This approach aligns with the educational domain's needs, ensuring that the AI remains focused on relevant material.

I wonder if you guys can share the TinyGSM dataset? I like to try your approach for other STEM topics and different grades, to have many SMLs each sophisticated on one topic and one grade.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Thank you for sharing the work!

Regarding the construction of the TinyGSM dataset used for training, I was wondering if some arrangements/checks were made to avoid coincidental leakage of duplicates or near-duplicates of GSM8K's test set. As the scale and diversity were the main objectives in creating the dataset, it might be worth checking.

Once the TinyGSM dataset is available, we can also run the check ourselves, like we did with other math reasoning datasets where we found this to be a common issue.

Dear Ronen Eldan, I would like to try to finetune this Math GSM model to improve its performance. If it the model is reliable then it could become a module in the TimeCapsuleTeacher(TM) platform, for teaching Math. Can you give me the and and configs and remote access to fast GPU/TPU to finetune a separate version of the GSM model weights according to my own MathTrain.txt and finetuning methods? Is there a simple way to automatically test the performance-benchmarks of the finetuned model periodically during finetuning? (One local save of model weights to support an instance of inference mode, while the finetuning still proceeds.)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 20