Mistral-v0.3-6B

Brief continued pretraining @ ctx 4096 to 'heal' the layer-pruning.

Model description

This model is a fine-tuned version of pszemraj/Mistral-7B-v0.3-prune6 on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.2860

See axolotl config

axolotl version: 0.4.0

base_model: pszemraj/Mistral-7B-v0.3-prune6
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

strict: false
seed: 80085
max_steps: 2000
# dataset
datasets:
    - path: BEE-spoke-data/knowledge-inoc-concat-v1
      name: smorgasbord-tb-quality
      type: completion 
      field: text 
val_set_size: 0.01

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: false
train_on_inputs: false
group_by_length: false

# WANDB
wandb_project: llama3-pruning
wandb_entity: pszemraj
wandb_watch: gradients
wandb_name: Mistral-6B-v0.3-v0.1-ii
hub_model_id: pszemraj/Mistral-v0.3-6B-ii
hub_strategy: every_save

gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_32bit
weight_decay: 0.1
lr_scheduler: cosine
learning_rate: 2e-5
warmup_ratio: 0.1

load_in_8bit: false
load_in_4bit: false
bfloat16: true
tf32: true

flash_attention: true
torch_compile: true 
torch_compile_backend: inductor 
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# hyperparams for freq of evals, saving, etc
evals_per_epoch: 5
saves_per_epoch: 5
save_safetensors: true
save_total_limit: 1
output_dir: /workspace/output-axolotl/output-model-6b
logging_steps: 6

deepspeed:

special_tokens:

Quick eval

Quick eval for: pszemraj/Mistral-v0.3-6B-ii

bootstrapping for stddev: perplexity hf (pretrained=pszemraj/Mistral-v0.3-6B-ii,trust_remote_code=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2

Tasks	Version	Filter	Metric	Value		Stderr
arc_easy	1	none	acc	0.7109	±	0.0093
		none	acc_norm	0.6654	±	0.0097
boolq	2	none	acc	0.7930	±	0.0071
lambada_openai	1	none	perplexity	4.9892	±	0.1269
		none	acc	0.6746	±	0.0065
openbookqa	1	none	acc	0.2460	±	0.0193
		none	acc_norm	0.3700	±	0.0216
piqa	1	none	acc	0.7350	±	0.0103
		none	acc_norm	0.7350	±	0.0103
winogrande	1	none	acc	0.6930	±	0.0130

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 80085
gradient_accumulation_steps: 16
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 200
training_steps: 2000

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0002	1	1.5980
1.578	0.0955	400	1.4028
1.5828	0.1911	800	1.3809
1.4355	0.2866	1200	1.3152
1.4618	0.3822	1600	1.2877
1.4551	0.4777	2000	1.2860

Framework versions

Transformers 4.40.2
Pytorch 2.3.0+cu118
Datasets 2.19.1
Tokenizers 0.19.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	49.23
AI2 Reasoning Challenge (25-Shot)	45.14
HellaSwag (10-Shot)	71.65
MMLU (5-Shot)	51.83
TruthfulQA (0-shot)	45.64
Winogrande (5-shot)	72.77
GSM8k (5-shot)	8.34

pszemraj
/

Mistral-v0.3-6B

Mistral-v0.3-6B

Model description

Quick eval

Training procedure

Training hyperparameters

Training results

Framework versions

Open LLM Leaderboard Evaluation Results

Finetuned from

Evaluation results

Mistral-v0.3-6B

Model description

Quick eval

Training procedure

Training hyperparameters

Training results

Framework versions

Open LLM Leaderboard Evaluation Results

Finetuned from pszemraj/Mistral-7B-v0.3-prune6

Evaluation results

Finetuned from