Edit model card

Mistral-v0.3-6B

Brief continued pretraining @ ctx 4096 to 'heal' the layer-pruning.

Model description

This model is a fine-tuned version of pszemraj/Mistral-7B-v0.3-prune6 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2860

Built with Axolotl

See axolotl config

axolotl version: 0.4.0

base_model: pszemraj/Mistral-7B-v0.3-prune6
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

strict: false
seed: 80085
max_steps: 2000
# dataset
datasets:
    - path: BEE-spoke-data/knowledge-inoc-concat-v1
      name: smorgasbord-tb-quality
      type: completion 
      field: text 
val_set_size: 0.01

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: false
train_on_inputs: false
group_by_length: false

# WANDB
wandb_project: llama3-pruning
wandb_entity: pszemraj
wandb_watch: gradients
wandb_name: Mistral-6B-v0.3-v0.1-ii
hub_model_id: pszemraj/Mistral-v0.3-6B-ii
hub_strategy: every_save

gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_32bit
weight_decay: 0.1
lr_scheduler: cosine
learning_rate: 2e-5
warmup_ratio: 0.1

load_in_8bit: false
load_in_4bit: false
bfloat16: true
tf32: true

flash_attention: true
torch_compile: true 
torch_compile_backend: inductor 
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# hyperparams for freq of evals, saving, etc
evals_per_epoch: 5
saves_per_epoch: 5
save_safetensors: true
save_total_limit: 1
output_dir: /workspace/output-axolotl/output-model-6b
logging_steps: 6

deepspeed:

special_tokens:

Quick eval

Quick eval for: pszemraj/Mistral-v0.3-6B-ii

bootstrapping for stddev: perplexity hf (pretrained=pszemraj/Mistral-v0.3-6B-ii,trust_remote_code=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.7109 ± 0.0093
none 0 acc_norm 0.6654 ± 0.0097
boolq 2 none 0 acc 0.7930 ± 0.0071
lambada_openai 1 none 0 perplexity 4.9892 ± 0.1269
none 0 acc 0.6746 ± 0.0065
openbookqa 1 none 0 acc 0.2460 ± 0.0193
none 0 acc_norm 0.3700 ± 0.0216
piqa 1 none 0 acc 0.7350 ± 0.0103
none 0 acc_norm 0.7350 ± 0.0103
winogrande 1 none 0 acc 0.6930 ± 0.0130

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 80085
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 200
  • training_steps: 2000

Training results

Training Loss Epoch Step Validation Loss
No log 0.0002 1 1.5980
1.578 0.0955 400 1.4028
1.5828 0.1911 800 1.3809
1.4355 0.2866 1200 1.3152
1.4618 0.3822 1600 1.2877
1.4551 0.4777 2000 1.2860

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.3.0+cu118
  • Datasets 2.19.1
  • Tokenizers 0.19.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 49.23
AI2 Reasoning Challenge (25-Shot) 45.14
HellaSwag (10-Shot) 71.65
MMLU (5-Shot) 51.83
TruthfulQA (0-shot) 45.64
Winogrande (5-shot) 72.77
GSM8k (5-shot) 8.34
Downloads last month
299
Safetensors
Model size
5.94B params
Tensor type
BF16
·

Finetuned from

Evaluation results