DistilBERT
The DistilBERT model is a BERT model fine-tuned on the
NewsQA dataset.
Hyperparameters
batch_size = 16
n_epochs = 3
max_seq_len = 512
learning_rate = 2e-5
optimizer=AdamW
lr_schedule = LinearWarmup
weight_decay=0.01
embeds_dropout_prob = 0.1