Control-LLM-Llama3.1-8B-Math16
This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.
Evaluation Results
Here is an overview of the evaluation results and findings:
Benchmark Results Table
The table below summarizes evaluation results across mathematical tasks and original capabilities.
Model | MH | M | G8K | M-Avg | ARC | GPQA | MLU | MLUP | O-Avg | Overall |
---|---|---|---|---|---|---|---|---|---|---|
Llama3.1-8B-Inst | 23.7 | 50.9 | 85.6 | 52.1 | 83.4 | 29.9 | 72.4 | 46.7 | 60.5 | 56.3 |
Control LLM* | 36.0 | 61.7 | 89.7 | 62.5 | 82.5 | 30.8 | 71.6 | 45.4 | 57.6 | 60.0 |
Explanation:
- MH: MathHard
- M: Math
- G8K: GSM8K
- M-Avg: Math - Average across MathHard, Math, and GSM8K
- ARC: ARC benchmark
- GPQA: General knowledge QA
- MLU: MMLU (Massive Multitask Language Understanding)
- MLUP: MMLU Pro
- O-Avg: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
- Overall: Combined average across all tasks
Catastrophic Forgetting on OpenMath
The following plot illustrates and compares catastrophic forgetting mitigation during training
Alignment Result
The plot below highlights the alignment result of the model trained with Control LLM.
- Downloads last month
- 7
Model tree for ControlLLM/Control-LLM-Llama3.1-8B-Math16-Instruct
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct
Dataset used to train ControlLLM/Control-LLM-Llama3.1-8B-Math16-Instruct
Evaluation results
- exact_match,none on Math, Math Hard, GSM8Kself-reported0.621
- exact_match,none (gsm8k_0shot_instruct) on Math, Math Hard, GSM8Kself-reported0.897
- exact_match,none (meta_math_0shot_instruct) on Math, Math Hard, GSM8Kself-reported0.617
- exact_match,none (meta_math_hard_0shot_instruct) on Math, Math Hard, GSM8Kself-reported0.360
- exact_match,strict-match on Llama-3.1-8B-Instruct-evals Datasetself-reported0.600
- exact_match,strict-match (meta_arc_0shot_instruct) on Llama-3.1-8B-Instruct-evals Datasetself-reported0.825
- exact_match,strict-match (meta_gpqa_0shot_cot_instruct) on Llama-3.1-8B-Instruct-evals Datasetself-reported0.308
- exact_match,strict-match (meta_mmlu_0shot_instruct) on Llama-3.1-8B-Instruct-evals Datasetself-reported0.716
- exact_match,strict-match (meta_mmlu_pro_5shot_instruct) on Llama-3.1-8B-Instruct-evals Datasetself-reported0.454