Edit model card

Llama-3-6.3b-v0.1

This is a layer pruning experiment based off of the original llama-3-8b:

  • 8 layers pruned with PruneMe/MergeKit
  • brief subsequent continued pretraining @ ctx 4096
    • data: 10k rows of FineWeb (different than pruning data) + some curated data
  • wandb here

quick eval

hf (pretrained=pszemraj/Llama-3-6.3b-v0.1,trust_remote_code=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.7109 ± 0.0093
none 0 acc_norm 0.6843 ± 0.0095
boolq 2 none 0 acc 0.7920 ± 0.0071
lambada_openai 1 none 0 perplexity 4.5411 ± 0.1073
none 0 acc 0.6734 ± 0.0065
openbookqa 1 none 0 acc 0.3000 ± 0.0205
none 0 acc_norm 0.4140 ± 0.0220
piqa 1 none 0 acc 0.7443 ± 0.0102
none 0 acc_norm 0.7530 ± 0.0101
winogrande 1 none 0 acc 0.7127 ± 0.0127

Details

Built with Axolotl

See axolotl config

axolotl version: 0.4.0

base_model: pszemraj/llama-3-prune_8
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

strict: false
seed: 80085

# dataset
datasets:
    - path: BEE-spoke-data/KI-smorgasbord_fw-small
      type: completion # format from earlier
      field: text # Optional[str] default: text, field to use for completion data
val_set_size: 0.015

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: false
train_on_inputs: false
group_by_length: false

# WANDB
wandb_project: llama3-pruning
wandb_entity: pszemraj
wandb_watch: gradients
wandb_name: Llama-3-6.3b-v0.1
hub_model_id: pszemraj/Llama-3-6.3b-v0.1
hub_strategy: every_save

gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch_fused # paged_adamw_32bit
weight_decay: 0.05
lr_scheduler: cosine
learning_rate: 4e-5
warmup_ratio: 0.1

load_in_8bit: false
load_in_4bit: false
bfloat16: true
tf32: true

flash_attention: true
torch_compile: true # requires >= torch 2.0, may sometimes cause problems
torch_compile_backend: inductor # Optional[str]
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# hyperparams for freq of evals, saving, etc
evals_per_epoch: 5
saves_per_epoch: 3
save_safetensors: true
save_total_limit: 1
output_dir: ./output-axolotl/output-model-6.3b
logging_steps: 8

deepspeed:

special_tokens:
  pad_token: <|end_of_text|>

Training results

Training Loss Epoch Step Validation Loss
No log 0.0006 1 7.8100
2.2782 0.2002 320 2.3728
2.2699 0.4004 640 2.3265
2.3761 0.6006 960 2.2849
2.2448 0.8008 1280 2.2702

Downloads last month
30
Safetensors
Model size
6.29B params
Tensor type
BF16
·

Finetuned from