Control-LLM-Llama3.1-8B-Math16

This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.

Linked Paper

This model is associated with the paper: Control-LLM.

This model is associated with the github: Control-LLM.

Here is an overview of the evaluation results and findings:

The table below summarizes evaluation results across mathematical tasks and original capabilities.

Model	MH	M	G8K	M-Avg	ARC	GPQA	MLU	MLUP	O-Avg	Overall
Llama3.1-8B-Inst	23.7	50.9	85.6	52.1	83.4	29.9	72.4	46.7	60.5	56.3
Control LLM*	36.0	61.7	89.7	62.5	82.5	30.8	71.6	45.4	57.6	60.0

The following plot illustrates and compares catastrophic forgetting mitigation during training

The plot below highlights the alignment result of the model trained with Control LLM.

Dataset used to train ControlLLM/Control-LLM-Llama3.1-8B-Math16-Instruct

exact_match,none on Math, Math Hard, GSM8K
self-reported

0.621
exact_match,none (gsm8k_0shot_instruct) on Math, Math Hard, GSM8K
self-reported

0.897
exact_match,none (meta_math_0shot_instruct) on Math, Math Hard, GSM8K
self-reported

0.617
exact_match,none (meta_math_hard_0shot_instruct) on Math, Math Hard, GSM8K
self-reported

0.360
exact_match,strict-match on Llama-3.1-8B-Instruct-evals Dataset
self-reported

0.600
exact_match,strict-match (meta_arc_0shot_instruct) on Llama-3.1-8B-Instruct-evals Dataset
self-reported

0.825
exact_match,strict-match (meta_gpqa_0shot_cot_instruct) on Llama-3.1-8B-Instruct-evals Dataset
self-reported

0.308
exact_match,strict-match (meta_mmlu_0shot_instruct) on Llama-3.1-8B-Instruct-evals Dataset
self-reported

0.716
exact_match,strict-match (meta_mmlu_pro_5shot_instruct) on Llama-3.1-8B-Instruct-evals Dataset
self-reported

0.454