Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Benchmark Metrics Comparison

Metric	attn_layer_mapper=all, attn_loss_fn=logsum, attn_projector=miles	attn_layer_mapper=all, attn_loss_fn=raw_mse, attn_projector=miles	teacher
ai2_arc (acc)	0.228	0.256	0.304
ai2_arc (acc_norm)	0.258	0.267	0.309
arc_challenge (acc)	0.186	0.177	0.184
arc_challenge (acc_norm)	0.227	0.202	0.214
arc_easy (acc)	0.27	0.335	0.424
arc_easy (acc_norm)	0.288	0.332	0.405
boolq (acc)	0.375	0.377	0.541
cola (mcc)	0.0	0.0	0.009
glue (acc)	0.454	0.444	0.41
glue (f1)	0.0	0.279	0.526
glue (mcc)	0.0	0.0	0.009
hellaswag (acc)	0.282	0.302	0.337
hellaswag (acc_norm)	0.275	0.308	0.384
mnli (acc)	0.326	0.331	0.323
mnli_mismatch (acc)	0.295	0.367	0.344
mrpc (acc)	0.316	0.336	0.515
mrpc (f1)	0.0	0.075	0.631
qnli (acc)	0.527	0.519	0.472
qqp (acc)	0.673	0.515	0.34
qqp (f1)	0.0	0.363	0.483
rte (acc)	0.52	0.57	0.516
sst2 (acc)	0.492	0.498	0.511
wikitext (bits_per_byte)	1.888	1.273	0.98
wikitext (byte_perplexity)	3.701	2.416	1.973
wikitext (word_perplexity)	1094.0	111.9	37.82
wnli (acc)	0.437	0.521	0.451

Resource Usage Comparison

VRAM Use: 7.7871 GB

Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): torch.bfloat16 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=cos, layer_mapper=layer-2, projector=miles))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=cos, layer_mapper=layer-2, projector=miles))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fae8845cd00>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.3.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

distily
/

distily_test_attn_miles