Quantization made by Richard Erkhov.

dart-math-dsmath-7b-prop2diff - GGUF

Model creator: https://huggingface.co/hkust-nlp/
Original model: https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff/

Name	Quant method	Size
dart-math-dsmath-7b-prop2diff.Q2_K.gguf	Q2_K	2.52GB
dart-math-dsmath-7b-prop2diff.IQ3_XS.gguf	IQ3_XS	2.78GB
dart-math-dsmath-7b-prop2diff.IQ3_S.gguf	IQ3_S	2.91GB
dart-math-dsmath-7b-prop2diff.Q3_K_S.gguf	Q3_K_S	2.91GB
dart-math-dsmath-7b-prop2diff.IQ3_M.gguf	IQ3_M	3.05GB
dart-math-dsmath-7b-prop2diff.Q3_K.gguf	Q3_K	3.21GB
dart-math-dsmath-7b-prop2diff.Q3_K_M.gguf	Q3_K_M	3.21GB
dart-math-dsmath-7b-prop2diff.Q3_K_L.gguf	Q3_K_L	3.48GB
dart-math-dsmath-7b-prop2diff.IQ4_XS.gguf	IQ4_XS	3.54GB
dart-math-dsmath-7b-prop2diff.Q4_0.gguf	Q4_0	3.71GB
dart-math-dsmath-7b-prop2diff.IQ4_NL.gguf	IQ4_NL	3.73GB
dart-math-dsmath-7b-prop2diff.Q4_K_S.gguf	Q4_K_S	3.74GB
dart-math-dsmath-7b-prop2diff.Q4_K.gguf	Q4_K	3.92GB
dart-math-dsmath-7b-prop2diff.Q4_K_M.gguf	Q4_K_M	3.92GB
dart-math-dsmath-7b-prop2diff.Q4_1.gguf	Q4_1	4.09GB
dart-math-dsmath-7b-prop2diff.Q5_0.gguf	Q5_0	4.47GB
dart-math-dsmath-7b-prop2diff.Q5_K_S.gguf	Q5_K_S	4.47GB
dart-math-dsmath-7b-prop2diff.Q5_K.gguf	Q5_K	4.57GB
dart-math-dsmath-7b-prop2diff.Q5_K_M.gguf	Q5_K_M	4.57GB
dart-math-dsmath-7b-prop2diff.Q5_1.gguf	Q5_1	4.84GB
dart-math-dsmath-7b-prop2diff.Q6_K.gguf	Q6_K	5.27GB
dart-math-dsmath-7b-prop2diff.Q8_0.gguf	Q8_0	6.82GB

Original model description:

language: - en license: other license_name: deepseek license_link: https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL library_name: transformers tags: - mathematics datasets: - hkust-nlp/dart-math-hard metrics: - accuracy pipeline_tag: text-generation base_model: deepseek-ai/deepseek-math-7b-base model-index: - name: dart-math-dsmath-7b-prop2diff results: - task: type: text-generation name: Mathematical Problem-Solving dataset: type: hendrycks/competition_math name: MATH split: test metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 53.6 - task: type: text-generation name: Mathematical Problem-Solving dataset: type: openai/gsm8k name: GSM8K config: main split: test metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 86.8 - task: type: text-generation name: Mathematical Problem-Solving dataset: type: college-math name: CollegeMath metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 40.7 - task: type: text-generation name: Mathematical Problem-Solving dataset: type: deepmind-mathematics name: DeepMind-Mathematics metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 61.6 - task: type: text-generation name: Mathematical Problem-Solving dataset: type: Hothan/OlympiadBench name: OlympiadBench-OE_TO_maths_en_COMP config: OE_TO_maths_en_COMP split: train metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 21.7 - task: type: text-generation name: Mathematical Problem-Solving dataset: type: TIGER-Lab/TheoremQA name: TheoremQA split: test metrics: - type: accuracy name: Pass@1 (0-shot CoT) value: 32.2

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub

🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

🔥 Excited to find our DART-Math-DSMath-7B (Prop2Diff) comparable to the AIMO winner NuminaMath-7B on CoT, but based solely on MATH & GSM8K prompt set, leaving much room to improve! Besides, our DART method is also fully compatible with tool-integrated reasoning. Find more details and join the discussion under this X thread!

Models: `DART-Math`

DART-Math models achieve performance superior or competitive to previous SOTAs on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks, despite using much smaller datasets and no proprietary model like GPT-4.

Model	MATH	GSM8K	College	DM	Olympiad	Theorem	AVG
GPT-4 (0314)	52.6	94.7	24.4	--	--	--	--
Llama-3-70B-MetaMath	44.9	88.0	31.9	53.2	11.6	21.9	41.9
`DART-Math-Llama-3-70B` (Uniform)	54.9	90.4	38.5	64.1	19.1	27.4	49.1
`DART-Math-Llama-3-70B` (Prop2Diff)	56.1	89.6	37.9	64.1	20.0	28.2	49.3
DeepSeekMath-7B-MetaMath	43.7	81.8	33.7	53.0	13.6	23.2	41.5
DeepSeekMath-7B-RL	53.1	88.4	41.3	58.3	18.7	35.9	49.3
`DART-Math-DSMath-7B` (Uniform)	52.9	88.2	40.1	60.2	21.3	32.5	49.2
`DART-Math-DSMath-7B` (Prop2Diff)	53.6	86.8	40.7	61.6	21.7	32.2	49.4
Mistral-7B-MetaMath	29.8	76.5	19.3	28.0	5.9	14.0	28.9
`DART-Math-Mistral-7B` (Uniform)	43.5	82.6	26.9	42.0	13.2	16.4	27.4
`DART-Math-Mistral-7B` (Prop2Diff)	45.5	81.1	29.4	45.1	14.7	17.0	38.8
Llama-3-8B-MetaMath	32.5	77.3	20.6	35.0	5.5	13.8	30.8
`DART-Math-Llama-3-8B` (Uniform)	45.3	82.5	27.1	48.2	13.6	15.4	38.7
`DART-Math-Llama-3-8B` (Prop2Diff)	46.6	81.1	28.8	48.0	14.5	19.4	39.7

Abbreviations: College (CollegeMath), DM (DeepMind Mathematics), Olympiad (OlympiadBench-Math), Theorem (TheoremQA). Bold means the best score by SFT on the respective base model here. To reproduce our results, please refer to the DART-Math GitHub repository.

Prompt Template

All the DART-Math models use the Alpaca prompt template:


Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n###Instruction:\n{query}\n\n### Response:\n

Training Dataset

We construct our traning datasets by applying Difficulty-Aware Rejection Sampling (DARS) to the MATH and GSM8K training sets.

DARS tackle severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries, in previous datasets.

These biases are primarily caused by vanilla rejection sampling, where the same number of responses is sampled for each query, yet the likelihood of obtaining correct responses for difficult queries is significantly lower, sometimes even zero.

Please refer to DART-Math-Hard / DART-Math-Uniform for more details.

Training Setup

We perform standard instruction tuning to several base models including Llama3-8B & Mistral-7B & Llama3-70B as representatives of general models and DeepSeekMath- 7B as the representative of math-specialized model on our synthetic datasets DART-Math-Hard & DART-Math-Uniform, leading to DART-Math (Prop2Diff) & DART-Math (Uniform) respectively.

For simplicity, we keep most hyper-parameters the same across different models and datasets:

Model max length (of packed sequence): 4096
Batch size: 64
Warm-up ratio: 0.03
Learning rate scheduler: cosine
Prompt template: Alpaca

Several other key hyper-parameters are tuned as follow:

Base Model	Max. L.R.	# of Epochs	# of Grad. Acc. Steps	# of A100 GPUs
Mistral-7B	`1e-5`	3	1	8
Llama3-8B	`5e-5`	1	2	8
Llama3-70B	`2e-5`	1	1	32
DeepSeekMath-7B	`5e-5`	3	1	8

For maximum learning rate, we determine the values by searching through 1e-6,5e-6,1e-5,2e-5,5e-5,1e-4 according to the MATH performance after training on MMIQC for 1 epoch, except for Llama3-70B that is so expensive to search for that we derive from Llama3-8B’s learning rate in analogy to the relationship of (per-training) learning rates between Llama2-7B and Llama2-70B (~2:1).
For Llama3 models, preliminary experiments indicate that training for 1 epoch consistently outperforms 3 epochs.

Please refer to Appendix A.1 of our paper for more details.

Other Details

For Mistral-7B-based models, we disable sliding_window by default following the newest Mistral-7B-Instruct (Flash Attention 2 does not support sliding_window and XFormer backend in vLLM has throughput ~10% lower in our experiments.)

Citation

If you find our data, model or code useful for your work, please kindly cite our paper:

@article{tong2024dartmath,
  title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
  author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
  year={2024},
  eprint={2407.13690},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.13690},
}