JiuZhang-8B Model
JiuZhang-8B model is a math-specific base model obtained by performing continued pre-training on Llama3-8B using 140B tokens (of which 100B are math-related tokens).
Features
Excellent reasoning performance: JiuZhang-8B has achieved comparable accuracy to the Qwen2.5-Math-7B base model on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO. It surpasses base models with more than 70B parameters such as LLaMa3.1-70B and Qwen2-72B.
Good general capabilities: JiuZhang-8B has obtained a score of 0.622 on MMLU, which is consistent with the performance of the base model Llama3-8B. It maintains general performance on tasks other than math reasoning.
Self-correction ability: JiuZhang-8B can self-check and correct errors in the reasoning process. This is the result of using a large proportion of synthetic data. It has not undergone any post-training process and can perform instruction fine-tuning or format training as needed.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
example = "Find $x$ such that $\\lceil x \\rceil + x = \\dfrac{23}{7}$. Express $x$ as a common fraction."
prompt = f"Solve the following problem step by step. Question: {example}\nSolution:"
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, temperature=0, max_length=2048)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
print(response)
Perfermance
We have compared the performance of JiuZhang-8B on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO with popular models.
- The reasoning results in the following table are all based on greedy decoding, and the better one under the zero-shot and few-shot settings is used as the accuracy rate of the data set.
- We use a compare model to compare the answers of the reasoning results.
- In addition, we provide an arithmetic average of the accuracy rates of the four data sets for comparison.
Model | GSM8K | Math | Gaokao | Zhongkao | Average |
---|---|---|---|---|---|
General Model | |||||
Meta-Llama-3-8B | 58.38 | 17.04 | 13.62 | 42.61 | 32.91 |
Meta-Llama-3-70B | 82.34 | 38.42 | 28.09 | 64.02 | 53.21 |
Meta-Llama-3.1-8B | 56.79 | 19.70 | 11.49 | 44.70 | 33.17 |
Meta-Llama-3.1-70B | 81.73 | 39.66 | 31.06 | 64.77 | 54.31 |
Qwen2-7B | 80.44 | 47.82 | 27.23 | 70.45 | 56.49 |
Qwen2-72B | 86.58 | 56.88 | 45.11 | 73.67 | 65.56 |
Qwen2.5-7B | 84.61 | 53.22 | 45.53 | 80.30 | 65.92 |
Qwen2.5-72B | 90.60 | 59.38 | 56.60 | 82.95 | 72.38 |
Specific Model | |||||
Llemma-7B | 41.47 | 18.94 | 14.89 | 45.08 | 30.10 |
Deepseek-Math-7B-Base | 65.73 | 33.40 | 23.83 | 62.69 | 46.41 |
Qwen2-Math-7B | 80.67 | 53.02 | 42.13 | 77.08 | 63.22 |
Qwen2-Math-72B | 88.63 | 61.88 | 51.91 | 81.25 | 70.92 |
Qwen2.5-Math-7B | 85.44 | 59.10 | 53.19 | 78.79 | 69.13 |
Qwen2.5-Math-72B | 88.70 | 67.10 | 53.62 | 81.63 | 72.76 |
JiuZhang-8B | 81.20 | 60.38 | 60.43 | 80.49 | 70.62 |
Acknowledgements
Thanks to all contributors who have helped in developing this model.
- Downloads last month
- 4