Spaces:
Running
Running
Fine-tuning details
For each task (GLUE and PAWS), we perform hyperparam search for each model, and report the mean and standard deviation across 5 seeds of the best model. First, get the datasets following the instructions in RoBERTa fine-tuning README. Alternatively, you can use huggingface datasets to get the task data:
from datasets import load_dataset
import pandas as pd
from pathlib import Path
key2file = {
"paws": {
"loc": "paws_data",
"columns": ["id", "sentence1", "sentence2", "label"],
"train": "train.tsv",
"validation": "dev.tsv",
"test": "test.tsv"
}
}
task_data = load_dataset("paws", "labeled_final")
task_config = key2file["paws"]
save_path = Path(task_config["loc"])
save_path.mkdir(exist_ok=True, parents=True)
for key, fl in task_config.items():
if key in ["loc", "columns"]:
continue
print(f"Reading {key}")
columns = task_config["columns"]
df = pd.DataFrame(task_data[key])
print(df.columns)
df = df[columns]
print(f"Got {len(df)} records")
save_loc = save_path / fl
print(f"Saving to : {save_loc}")
df.to_csv(save_loc, sep="\t", header=None, index=None)
- Preprocess using RoBERTa GLUE preprocessing script, while keeping in mind the column numbers for
sentence1
,sentence2
andlabel
(which is 0,1,2 if you save the data according to the above example.) - Then, fine-tuning is performed similarly to RoBERTa (for example, in case of RTE):
TOTAL_NUM_UPDATES=30875 # 10 epochs through RTE for bsz 16
WARMUP_UPDATES=1852 # 6 percent of the number of updates
LR=2e-05 # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2
MAX_SENTENCES=16 # Batch size.
SHUFFLED_ROBERTA_PATH=/path/to/shuffled_roberta/model.pt
CUDA_VISIBLE_DEVICES=0 fairseq-train RTE-bin/ \
--restore-file $SHUFFLED_ROBERTA_PATH \
--max-positions 512 \
--batch-size $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--find-unused-parameters \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
TOTAL_NUM_UPDATES
is computed based on the--batch_size
value and the dataset size.WARMUP_UPDATES
is computed as 6% ofTOTAL_NUM_UPDATES
- Best hyperparam of
--lr
and--batch_size
is reported below:
--lr
name | RTE | MRPC | SST-2 | CoLA | QQP | QNLI | MNLI | PAWS | |
---|---|---|---|---|---|---|---|---|---|
0 | original | 2e-05 | 2e-05 | 1e-05 | 2e-05 | 1e-05 | 1e-05 | 1e-05 | 2e-05 |
1 | n_1 | 2e-05 | 1e-05 | 1e-05 | 1e-05 | 3e-05 | 1e-05 | 2e-05 | 2e-05 |
2 | n_2 | 2e-05 | 2e-05 | 1e-05 | 1e-05 | 2e-05 | 1e-05 | 1e-05 | 3e-05 |
3 | n_3 | 3e-05 | 1e-05 | 2e-05 | 2e-05 | 3e-05 | 1e-05 | 1e-05 | 2e-05 |
4 | n_4 | 3e-05 | 1e-05 | 2e-05 | 2e-05 | 2e-05 | 1e-05 | 1e-05 | 2e-05 |
5 | r512 | 1e-05 | 3e-05 | 2e-05 | 2e-05 | 3e-05 | 2e-05 | 3e-05 | 2e-05 |
6 | rand_corpus | 2e-05 | 1e-05 | 3e-05 | 1e-05 | 3e-05 | 3e-05 | 3e-05 | 2e-05 |
7 | rand_uniform | 2e-05 | 1e-05 | 3e-05 | 2e-05 | 3e-05 | 3e-05 | 3e-05 | 1e-05 |
8 | rand_init | 1e-05 | 1e-05 | 3e-05 | 1e-05 | 1e-05 | 1e-05 | 2e-05 | 1e-05 |
9 | no_pos | 1e-05 | 3e-05 | 2e-05 | 1e-05 | 1e-05 | 1e-05 | 1e-05 | 1e-05 |
--batch_size
name | RTE | MRPC | SST-2 | CoLA | QQP | QNLI | MNLI | PAWS | |
---|---|---|---|---|---|---|---|---|---|
0 | orig | 16 | 16 | 32 | 16 | 16 | 32 | 32 | 16 |
1 | n_1 | 32 | 32 | 16 | 32 | 32 | 16 | 32 | 16 |
2 | n_2 | 32 | 16 | 32 | 16 | 32 | 32 | 16 | 32 |
3 | n_3 | 32 | 32 | 16 | 32 | 32 | 16 | 32 | 32 |
4 | n_4 | 32 | 16 | 32 | 16 | 32 | 32 | 32 | 32 |
5 | r512 | 32 | 16 | 16 | 32 | 32 | 16 | 16 | 16 |
6 | rand_corpus | 16 | 16 | 16 | 16 | 32 | 16 | 16 | 32 |
7 | rand_uniform | 16 | 32 | 16 | 16 | 32 | 16 | 16 | 16 |
8 | rand_init | 16 | 16 | 32 | 16 | 16 | 16 | 32 | 16 |
9 | no_pos | 16 | 32 | 16 | 16 | 32 | 16 | 16 | 16 |
- Perform inference similar to RoBERTa as well:
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained(
'checkpoints/',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='PAWS-bin'
)
label_fn = lambda label: roberta.task.label_dictionary.string(
[label + roberta.task.label_dictionary.nspecial]
)
ncorrect, nsamples = 0, 0
roberta.cuda()
roberta.eval()
with open('paws_data/dev.tsv') as fin:
fin.readline()
for index, line in enumerate(fin):
tokens = line.strip().split('\t')
sent1, sent2, target = tokens[0], tokens[1], tokens[2]
tokens = roberta.encode(sent1, sent2)
prediction = roberta.predict('sentence_classification_head', tokens).argmax().item()
prediction_label = label_fn(prediction)
ncorrect += int(prediction_label == target)
nsamples += 1
print('| Accuracy: ', float(ncorrect)/float(nsamples))