--- license: apache-2.0 datasets: - gsm8k metrics: - accuracy --- # Model Card for Model ID We distill math reasoning ability from large language model gpt-3.5-turbo to the open code small language model [Salesforce/codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py), and math-codet5p-770m-py achieves 44.88% accuracy on GSM8K testing dataset. ### Model Description - **Developed by:** Xunyu Zhu - **Model type:** encoder-decoder - **Language(s) (NLP):** python - **License:** apache-2.0 - **Finetuned from model:** [Salesforce/codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py) ## Uses ### Direct Use This model can be easily loaded using the AutoModelForSeq2SeqLM functionality and employs the same tokenizer as original [Salesforce/codet5p-770m-py](https://huggingface.co/Salesforce/codet5p-770m-py). When given a question, the prompt "\nProgram: Let’s design executable python program (return ans) to solve the question." is needed to add as the input to instruct the model to generate reasoning results. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer def safe_execute(code_string: str, keys=None): def execute(x): try: exec(x) locals_ = locals() if keys is None: return locals_.get('ans', None) else: return [locals_.get(k, None) for k in keys] except Exception: return None try: ans = func_timeout.func_timeout(5, execute, args=(code_string,)) except func_timeout.FunctionTimedOut: ans = None return ans checkpoint = "zhuxunyu/math-codet5p-770m-py" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to(device) question = "Question: Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nProgram: Let’s design executable python program (return ans) to solve the question.". input = tokenizer(question, max_length=256, padding="max_length", truncation=True, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**input, max_length=256) generation = tokenizer.decode(output, skip_special_tokens=True) ans = safe_execute(generation) print(float(ans)) ``` ## Training Details ### Training Data We prompt gpt-3.5-turbo to generate reasoning programs to solve questions in GSM8K training dataset, and each question includes 4 reasoning programs. Then, questions in GSM8K training dataset and their corresponding reasoning programs are built as a training dataset, and we use the training dataset to fine-tune the LM. ## Evaluation ### Testing Data The testing data is GSM8K testing dataset. ### Results math-codet5p-770m-py achieves 44.88% accuracy on GSM8K testing dataset. ## Citation **BibTeX:** ``` @misc{zhu2023mathcodet5plus, title={math-codet5p-770m-py}, author={Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang}, year={2023} } ```