bartowski's picture
Quant for 3.5
f2fdd56 verified
metadata
license: cc-by-nc-4.0
base_model: mlabonne/NeuralMonarch-7B
tags:
  - generated_from_trainer
  - axolotl
  - mistral
  - instruct
  - finetune
  - chatml
  - gpt4
  - synthetic data
  - distillation
model-index:
  - name: AlphaMonarch-laser
    results: []
datasets:
  - argilla/OpenHermes2.5-dpo-binarized-alpha
language:
  - en
library_name: transformers
pipeline_tag: text-generation

AlphaMonarch-laser

image/jpeg

AlphaMonarch-laser is a DPO fine-tuned of mlabonne/NeuralMonarch-7B using the argilla/OpenHermes2.5-dpo-binarized-alpha preference dataset but achieves better performance then mlabonne/AlphaMonarch-7B using LaserQLoRA. We have fine-tuned this model only on half of the projections, but have achieved better results as compared to the version released by Maximme Labonne. We have trained this model for 1080 steps.

AlphaMonarch-laser is ranking 1 on YALL - Yet Another LLM Leaderboard. image/png

🏆 Evaluation results

Nous Benchmark

AGIEVAL

Task Version Metric Value StdErr
agieval_aqua_rat 0 acc 28.35% 2.83%
agieval_aqua_rat 0 acc_norm 26.38% 2.77%
agieval_logiqa_en 0 acc 38.25% 1.91%
agieval_logiqa_en 0 acc_norm 38.10% 1.90%
agieval_lsat_ar 0 acc 23.91% 2.82%
agieval_lsat_ar 0 acc_norm 23.48% 2.80%
agieval_lsat_lr 0 acc 52.75% 2.21%
agieval_lsat_lr 0 acc_norm 53.92% 2.21%
agieval_lsat_rc 0 acc 66.91% 2.87%
agieval_lsat_rc 0 acc_norm 67.29% 2.87%
agieval_sat_en 0 acc 78.64% 2.86%
agieval_sat_en 0 acc_norm 78.64% 2.86%
agieval_sat_en_without_passage 0 acc 45.15% 3.48%
agieval_sat_en_without_passage 0 acc_norm 44.17% 3.47%
agieval_sat_math 0 acc 33.18% 3.18%
agieval_sat_math 0 acc_norm 31.36% 3.14%
Average: 28.41%

GPT4ALL

Task Version Metric Value StdErr
arc_challenge 0 acc 66.30% ± 1.38%
acc_norm 68.26% ± 1.36%
arc_easy 0 acc 86.57% ± 0.70%
acc_norm 80.81% ± 0.81%
boolq 1 acc 87.16% ± 0.59%
hellaswag 0 acc 69.60% ± 0.46%
acc_norm 87.45% ± 0.33%
openbookqa 0 acc 39.20% ± 2.19%
acc_norm 49.60% ± 2.24%
piqa 0 acc 83.03% ± 0.88%
acc_norm 84.87% ± 0.84%
winogrande 0 acc 81.06% ± 1.10%
Average: 76.98%

TRUTHFUL-QA

Task Version Metric Value StdErr
truthfulqa_mc 1 mc1 63.04% ± 1.69%
truthfulqa_mc 1 mc2 78.39% ± 1.37%
Average: 70.71%

BIGBENCH

Task Version Metric Value StdErr
bigbench_causal_judgement 0 multiple_choice_grade 60.00% ± 3.56%
bigbench_date_understanding 0 multiple_choice_grade 62.06% ± 2.53%
bigbench_disambiguation_qa 0 multiple_choice_grade 54.26% ± 3.11%
bigbench_geometric_shapes 0 multiple_choice_grade 23.96% ± 2.26%
exact_str_match 0.00% ± 0.00%
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 32.80% ± 2.10%
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.86% ± 1.61%
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 59.33% ± 2.84%
bigbench_movie_recommendation 0 multiple_choice_grade 58.00% ± 2.21%
bigbench_navigate 0 multiple_choice_grade 56.00% ± 1.57%
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 69.20% ± 1.03%
bigbench_ruin_names 0 multiple_choice_grade 55.36% ± 2.35%
bigbench_salient_translation_error_detection 0 multiple_choice_grade 41.48% ± 1.56%
bigbench_snarks 0 multiple_choice_grade 73.48% ± 3.29%
bigbench_sports_understanding 0 multiple_choice_grade 76.06% ± 1.36%
bigbench_temporal_sequences 0 multiple_choice_grade 55.50% ± 1.57%
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.28% ± 1.20%
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 19.37% ± 0.94%
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 59.33% ± 2.84%
Average: 55.37%

Openllm Benchmark

Task Version Metric Value Stderr
arc_challenge 0 acc 70.12 ± 1.30
acc_norm 73.27 ± 1.29
hellaswag 0 acc 71.80 ± 0.44
acc_norm 89.20 ± 0.30
gsm8k 0 acc 66.77 ± 1.2
winogrande 0 acc 84.6 ± 1.0

Average: 73.5%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 62.79 ± 1.69
mc2 77.90 ± 1.37

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 1
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • training_steps: 1080

📝 Axolotl Configuration

base_model: mlabonne/NeuralMonarch-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: true
strict: false
rl: dpo
chat_template: chatml
datasets:
  - path: mlabonne/chatml-OpenHermes2.5-dpo-binarized-alpha
    split: train
    type: chatml.intel
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./out
adapter: qlora
lora_model_dir:
sequence_len: 1800
sample_packing: false
pad_to_sequence_len: false
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
 - layers.1.self_attn.q_proj
 - layers.0.self_attn.q_proj
 - layers.15.self_attn.q_proj
 - layers.12.self_attn.q_proj
 - layers.11.self_attn.q_proj
 - layers.14.self_attn.q_proj
 - layers.9.self_attn.q_proj
 - layers.16.self_attn.q_proj
 - layers.30.self_attn.q_proj
 - layers.18.self_attn.q_proj
 - layers.13.self_attn.q_proj
 - layers.10.self_attn.q_proj
 - layers.7.self_attn.q_proj
 - layers.8.self_attn.q_proj
 - layers.4.self_attn.q_proj
 - layers.19.self_attn.q_proj
 - layers.27.self_attn.k_proj
 - layers.24.self_attn.k_proj
 - layers.25.self_attn.k_proj
 - layers.22.self_attn.k_proj
 - layers.26.self_attn.k_proj
 - layers.29.self_attn.k_proj
 - layers.23.self_attn.k_proj
 - layers.28.self_attn.k_proj
 - layers.21.self_attn.k_proj
 - layers.31.self_attn.k_proj
 - layers.30.self_attn.k_proj
 - layers.20.self_attn.k_proj
 - layers.5.self_attn.k_proj
 - layers.19.self_attn.k_proj
 - layers.17.self_attn.k_proj
 - layers.18.self_attn.k_proj
 - layers.19.self_attn.v_proj
 - layers.24.self_attn.v_proj
 - layers.18.self_attn.v_proj
 - layers.5.self_attn.v_proj
 - layers.3.self_attn.v_proj
 - layers.16.self_attn.v_proj
 - layers.23.self_attn.v_proj
 - layers.27.self_attn.v_proj
 - layers.25.self_attn.v_proj
 - layers.26.self_attn.v_proj
 - layers.20.self_attn.v_proj
 - layers.6.self_attn.v_proj
 - layers.15.self_attn.v_proj
 - layers.17.self_attn.v_proj
 - layers.29.self_attn.v_proj
 - layers.22.self_attn.v_proj
 - layers.12.self_attn.o_proj
 - layers.9.self_attn.o_proj
 - layers.14.self_attn.o_proj
 - layers.0.self_attn.o_proj
 - layers.6.self_attn.o_proj
 - layers.8.self_attn.o_proj
 - layers.10.self_attn.o_proj
 - layers.11.self_attn.o_proj
 - layers.13.self_attn.o_proj
 - layers.24.self_attn.o_proj
 - layers.7.self_attn.o_proj
 - layers.15.self_attn.o_proj
 - layers.5.self_attn.o_proj
 - layers.17.self_attn.o_proj
 - layers.25.self_attn.o_proj
 - layers.4.self_attn.o_proj
 - layers.31.mlp.gate_proj
 - layers.30.mlp.gate_proj
 - layers.4.mlp.gate_proj
 - layers.3.mlp.gate_proj
 - layers.29.mlp.gate_proj
 - layers.28.mlp.gate_proj
 - layers.6.mlp.gate_proj
 - layers.27.mlp.gate_proj
 - layers.5.mlp.gate_proj
 - layers.26.mlp.gate_proj
 - layers.25.mlp.gate_proj
 - layers.7.mlp.gate_proj
 - layers.2.mlp.gate_proj
 - layers.24.mlp.gate_proj
 - layers.23.mlp.gate_proj
 - layers.10.mlp.gate_proj
 - layers.6.mlp.up_proj
 - layers.4.mlp.up_proj
 - layers.5.mlp.up_proj
 - layers.27.mlp.up_proj
 - layers.25.mlp.up_proj
 - layers.26.mlp.up_proj
 - layers.17.mlp.up_proj
 - layers.24.mlp.up_proj
 - layers.7.mlp.up_proj
 - layers.10.mlp.up_proj
 - layers.3.mlp.up_proj
 - layers.11.mlp.up_proj
 - layers.23.mlp.up_proj
 - layers.9.mlp.up_proj
 - layers.14.mlp.up_proj
 - layers.18.mlp.up_proj
 - layers.19.mlp.down_proj
 - layers.20.mlp.down_proj
 - layers.18.mlp.down_proj
 - layers.21.mlp.down_proj
 - layers.29.mlp.down_proj
 - layers.1.mlp.down_proj
 - layers.22.mlp.down_proj
 - layers.28.mlp.down_proj
 - layers.23.mlp.down_proj
 - layers.30.mlp.down_proj
 - layers.17.mlp.down_proj
 - layers.4.mlp.down_proj
 - layers.2.mlp.down_proj
 - layers.15.mlp.down_proj
 - layers.5.mlp.down_proj
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 5e-7
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 1080
max_steps: 1080
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Framework versions

  • Transformers 4.38.0.dev0
  • Pytorch 2.1.2+cu118
  • Datasets 2.17.0
  • Tokenizers 0.15.0
  • axolotl: 0.4.0

Built with Axolotl