metadata

license: cc-by-nc-4.0
base_model: mlabonne/NeuralMonarch-7B
tags:
  - generated_from_trainer
  - axolotl
  - mistral
  - instruct
  - finetune
  - chatml
  - gpt4
  - synthetic data
  - distillation
model-index:
  - name: AlphaMonarch-laser
    results: []
datasets:
  - argilla/OpenHermes2.5-dpo-binarized-alpha
language:
  - en
library_name: transformers
pipeline_tag: text-generation

AlphaMonarch-laser

AlphaMonarch-laser is a DPO fine-tuned of mlabonne/NeuralMonarch-7B using the argilla/OpenHermes2.5-dpo-binarized-alpha preference dataset but achieves better performance then mlabonne/AlphaMonarch-7B using LaserQLoRA. We have fine-tuned this model only on half of the projections, but have achieved better results as compared to the version released by Maximme Labonne. We have trained this model for 1080 steps.

AlphaMonarch-laser is ranking 1 on YALL - Yet Another LLM Leaderboard.

🏆 Evaluation results

Nous Benchmark

AGIEVAL

Task	Version	Metric	Value	StdErr
agieval_aqua_rat	0	acc	28.35%	2.83%
agieval_aqua_rat	0	acc_norm	26.38%	2.77%
agieval_logiqa_en	0	acc	38.25%	1.91%
agieval_logiqa_en	0	acc_norm	38.10%	1.90%
agieval_lsat_ar	0	acc	23.91%	2.82%
agieval_lsat_ar	0	acc_norm	23.48%	2.80%
agieval_lsat_lr	0	acc	52.75%	2.21%
agieval_lsat_lr	0	acc_norm	53.92%	2.21%
agieval_lsat_rc	0	acc	66.91%	2.87%
agieval_lsat_rc	0	acc_norm	67.29%	2.87%
agieval_sat_en	0	acc	78.64%	2.86%
agieval_sat_en	0	acc_norm	78.64%	2.86%
agieval_sat_en_without_passage	0	acc	45.15%	3.48%
agieval_sat_en_without_passage	0	acc_norm	44.17%	3.47%
agieval_sat_math	0	acc	33.18%	3.18%
agieval_sat_math	0	acc_norm	31.36%	3.14%
Average: 28.41%

GPT4ALL

Task	Version	Metric	Value	StdErr
arc_challenge	0	acc	66.30%	± 1.38%
		acc_norm	68.26%	± 1.36%
arc_easy	0	acc	86.57%	± 0.70%
		acc_norm	80.81%	± 0.81%
boolq	1	acc	87.16%	± 0.59%
hellaswag	0	acc	69.60%	± 0.46%
		acc_norm	87.45%	± 0.33%
openbookqa	0	acc	39.20%	± 2.19%
		acc_norm	49.60%	± 2.24%
piqa	0	acc	83.03%	± 0.88%
		acc_norm	84.87%	± 0.84%
winogrande	0	acc	81.06%	± 1.10%
Average: 76.98%

TRUTHFUL-QA

Task	Version	Metric	Value	StdErr
truthfulqa_mc	1	mc1	63.04%	± 1.69%
truthfulqa_mc	1	mc2	78.39%	± 1.37%
Average: 70.71%

BIGBENCH

Task	Version	Metric	Value	StdErr
bigbench_causal_judgement	0	multiple_choice_grade	60.00%	± 3.56%
bigbench_date_understanding	0	multiple_choice_grade	62.06%	± 2.53%
bigbench_disambiguation_qa	0	multiple_choice_grade	54.26%	± 3.11%
bigbench_geometric_shapes	0	multiple_choice_grade	23.96%	± 2.26%
		exact_str_match	0.00%	± 0.00%
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	32.80%	± 2.10%
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.86%	± 1.61%
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	59.33%	± 2.84%
bigbench_movie_recommendation	0	multiple_choice_grade	58.00%	± 2.21%
bigbench_navigate	0	multiple_choice_grade	56.00%	± 1.57%
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	69.20%	± 1.03%
bigbench_ruin_names	0	multiple_choice_grade	55.36%	± 2.35%
bigbench_salient_translation_error_detection	0	multiple_choice_grade	41.48%	± 1.56%
bigbench_snarks	0	multiple_choice_grade	73.48%	± 3.29%
bigbench_sports_understanding	0	multiple_choice_grade	76.06%	± 1.36%
bigbench_temporal_sequences	0	multiple_choice_grade	55.50%	± 1.57%
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	23.28%	± 1.20%
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	19.37%	± 0.94%
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	59.33%	± 2.84%
Average: 55.37%

Openllm Benchmark

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	70.12	±	1.30
		acc_norm	73.27	±	1.29
hellaswag	0	acc	71.80	±	0.44
		acc_norm	89.20	±	0.30
gsm8k	0	acc	66.77	±	1.2
winogrande	0	acc	84.6	±	1.0

Average: 73.5%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	62.79	±	1.69
		mc2	77.90	±	1.37

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 1080

📝 Axolotl Configuration

base_model: mlabonne/NeuralMonarch-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: true
strict: false
rl: dpo
chat_template: chatml
datasets:
  - path: mlabonne/chatml-OpenHermes2.5-dpo-binarized-alpha
    split: train
    type: chatml.intel
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./out
adapter: qlora
lora_model_dir:
sequence_len: 1800
sample_packing: false
pad_to_sequence_len: false
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
 - layers.1.self_attn.q_proj
 - layers.0.self_attn.q_proj
 - layers.15.self_attn.q_proj
 - layers.12.self_attn.q_proj
 - layers.11.self_attn.q_proj
 - layers.14.self_attn.q_proj
 - layers.9.self_attn.q_proj
 - layers.16.self_attn.q_proj
 - layers.30.self_attn.q_proj
 - layers.18.self_attn.q_proj
 - layers.13.self_attn.q_proj
 - layers.10.self_attn.q_proj
 - layers.7.self_attn.q_proj
 - layers.8.self_attn.q_proj
 - layers.4.self_attn.q_proj
 - layers.19.self_attn.q_proj
 - layers.27.self_attn.k_proj
 - layers.24.self_attn.k_proj
 - layers.25.self_attn.k_proj
 - layers.22.self_attn.k_proj
 - layers.26.self_attn.k_proj
 - layers.29.self_attn.k_proj
 - layers.23.self_attn.k_proj
 - layers.28.self_attn.k_proj
 - layers.21.self_attn.k_proj
 - layers.31.self_attn.k_proj
 - layers.30.self_attn.k_proj
 - layers.20.self_attn.k_proj
 - layers.5.self_attn.k_proj
 - layers.19.self_attn.k_proj
 - layers.17.self_attn.k_proj
 - layers.18.self_attn.k_proj
 - layers.19.self_attn.v_proj
 - layers.24.self_attn.v_proj
 - layers.18.self_attn.v_proj
 - layers.5.self_attn.v_proj
 - layers.3.self_attn.v_proj
 - layers.16.self_attn.v_proj
 - layers.23.self_attn.v_proj
 - layers.27.self_attn.v_proj
 - layers.25.self_attn.v_proj
 - layers.26.self_attn.v_proj
 - layers.20.self_attn.v_proj
 - layers.6.self_attn.v_proj
 - layers.15.self_attn.v_proj
 - layers.17.self_attn.v_proj
 - layers.29.self_attn.v_proj
 - layers.22.self_attn.v_proj
 - layers.12.self_attn.o_proj
 - layers.9.self_attn.o_proj
 - layers.14.self_attn.o_proj
 - layers.0.self_attn.o_proj
 - layers.6.self_attn.o_proj
 - layers.8.self_attn.o_proj
 - layers.10.self_attn.o_proj
 - layers.11.self_attn.o_proj
 - layers.13.self_attn.o_proj
 - layers.24.self_attn.o_proj
 - layers.7.self_attn.o_proj
 - layers.15.self_attn.o_proj
 - layers.5.self_attn.o_proj
 - layers.17.self_attn.o_proj
 - layers.25.self_attn.o_proj
 - layers.4.self_attn.o_proj
 - layers.31.mlp.gate_proj
 - layers.30.mlp.gate_proj
 - layers.4.mlp.gate_proj
 - layers.3.mlp.gate_proj
 - layers.29.mlp.gate_proj
 - layers.28.mlp.gate_proj
 - layers.6.mlp.gate_proj
 - layers.27.mlp.gate_proj
 - layers.5.mlp.gate_proj
 - layers.26.mlp.gate_proj
 - layers.25.mlp.gate_proj
 - layers.7.mlp.gate_proj
 - layers.2.mlp.gate_proj
 - layers.24.mlp.gate_proj
 - layers.23.mlp.gate_proj
 - layers.10.mlp.gate_proj
 - layers.6.mlp.up_proj
 - layers.4.mlp.up_proj
 - layers.5.mlp.up_proj
 - layers.27.mlp.up_proj
 - layers.25.mlp.up_proj
 - layers.26.mlp.up_proj
 - layers.17.mlp.up_proj
 - layers.24.mlp.up_proj
 - layers.7.mlp.up_proj
 - layers.10.mlp.up_proj
 - layers.3.mlp.up_proj
 - layers.11.mlp.up_proj
 - layers.23.mlp.up_proj
 - layers.9.mlp.up_proj
 - layers.14.mlp.up_proj
 - layers.18.mlp.up_proj
 - layers.19.mlp.down_proj
 - layers.20.mlp.down_proj
 - layers.18.mlp.down_proj
 - layers.21.mlp.down_proj
 - layers.29.mlp.down_proj
 - layers.1.mlp.down_proj
 - layers.22.mlp.down_proj
 - layers.28.mlp.down_proj
 - layers.23.mlp.down_proj
 - layers.30.mlp.down_proj
 - layers.17.mlp.down_proj
 - layers.4.mlp.down_proj
 - layers.2.mlp.down_proj
 - layers.15.mlp.down_proj
 - layers.5.mlp.down_proj
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 5e-7
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 1080
max_steps: 1080
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Framework versions

Transformers 4.38.0.dev0
Pytorch 2.1.2+cu118
Datasets 2.17.0
Tokenizers 0.15.0
axolotl: 0.4.0