baby-llama-58m / README.md
timinar's picture
Update README.md
f697871
metadata
license: unknown
language:
  - en

Baby Llama

Our submission to the strict-small track of the BabyLM challenge.

Baby Llama is a 58M-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the babylm_10M dataset.

See the associated paper for a detailed discussion of the training procedure and of the model performance. The training code is available at https://github.com/timinar/BabyLlama.

Hyperparameters for the tasks that require fine-tuning

When evaluating the model on the tasks that require fine-tuning, we noticed that the default hyperparameters suggested by the BabyLM organizers lead to severe overfitting in a number of tasks. To avoid this issue, we have re-tuned those hyperparameters. The sets of hyperparameters selected for each task are listed in the table below.

Task Maximum learning rate Batch size Maximum epochs Patience Evaluate every (steps) Random seed
CoLA 4e-5 64 3 10 20 12
SST-2 5e-5 64 6 10 200 12
MRPC 3e-5 64 3 10 20 12
QQP 4e-5 64 10 10 1000 12
MNLI 5e-5 64 6 10 200 12
MNLI-mm 5e-5 64 6 10 200 12
QNLI 5e-5 64 6 10 200 12
RTE 5e-5 64 6 10 200 12
BoolQ 3e-4 16 10 10 10 12
MultiRC 1e-4 64 7 10 1000 42
WSC 5e-7 1 10 1000 2000 12
CR (Control) 5e-5 64 10 10 100 12
LC (Control) 1e-3 64 1 2 10 12
MV (Control) 5e-5 64 6 10 200 12
RP (Control) 1e-3 64 1 10 10 12
SC (Control) 1e-3 64 2 10 10 12
CR_LC 1e-3 64 2 10 10 12
CR_RTP 5e-5 64 6 10 200 12
MV_LC 5e-5 64 6 10 200 12
MV_RTP 5e-5 64 6 10 200 12
SC_LC 1e-3 64 2 10 10 12
SC_RP 1e-3 64 2 10 10 12