Fine-tuning details

For each task (GLUE and PAWS), we perform hyperparam search for each model, and report the mean and standard deviation across 5 seeds of the best model. First, get the datasets following the instructions in RoBERTa fine-tuning README. Alternatively, you can use huggingface datasets to get the task data:

from datasets import load_dataset
import pandas as pd
from pathlib import Path

key2file = {
"paws": {
        "loc": "paws_data",
        "columns": ["id", "sentence1", "sentence2", "label"],
        "train": "train.tsv",
        "validation": "dev.tsv",
        "test": "test.tsv"
  }
}

task_data = load_dataset("paws", "labeled_final")
task_config = key2file["paws"]
save_path = Path(task_config["loc"])
save_path.mkdir(exist_ok=True, parents=True)
for key, fl in task_config.items():
    if key in ["loc", "columns"]:
        continue
    print(f"Reading {key}")
    columns = task_config["columns"]
    df = pd.DataFrame(task_data[key])
    print(df.columns)
    df = df[columns]
    print(f"Got {len(df)} records")
    save_loc = save_path / fl
    print(f"Saving to : {save_loc}")
    df.to_csv(save_loc, sep="\t", header=None, index=None)

Preprocess using RoBERTa GLUE preprocessing script, while keeping in mind the column numbers for sentence1, sentence2 and label (which is 0,1,2 if you save the data according to the above example.)
Then, fine-tuning is performed similarly to RoBERTa (for example, in case of RTE):

TOTAL_NUM_UPDATES=30875  # 10 epochs through RTE for bsz 16
WARMUP_UPDATES=1852      # 6 percent of the number of updates
LR=2e-05                # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2
MAX_SENTENCES=16        # Batch size.
SHUFFLED_ROBERTA_PATH=/path/to/shuffled_roberta/model.pt

CUDA_VISIBLE_DEVICES=0 fairseq-train RTE-bin/ \
    --restore-file $SHUFFLED_ROBERTA_PATH \
    --max-positions 512 \
    --batch-size $MAX_SENTENCES \
    --max-tokens 4400 \
    --task sentence_prediction \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --arch roberta_large \
    --criterion sentence_prediction \
    --num-classes $NUM_CLASSES \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --max-epoch 10 \
    --find-unused-parameters \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;

TOTAL_NUM_UPDATES is computed based on the --batch_size value and the dataset size.
WARMUP_UPDATES is computed as 6% of TOTAL_NUM_UPDATES
Best hyperparam of --lr and --batch_size is reported below:

`--lr`

	name	RTE	MRPC	SST-2	CoLA	QQP	QNLI	MNLI	PAWS
0	original	2e-05	2e-05	1e-05	2e-05	1e-05	1e-05	1e-05	2e-05
1	n_1	2e-05	1e-05	1e-05	1e-05	3e-05	1e-05	2e-05	2e-05
2	n_2	2e-05	2e-05	1e-05	1e-05	2e-05	1e-05	1e-05	3e-05
3	n_3	3e-05	1e-05	2e-05	2e-05	3e-05	1e-05	1e-05	2e-05
4	n_4	3e-05	1e-05	2e-05	2e-05	2e-05	1e-05	1e-05	2e-05
5	r512	1e-05	3e-05	2e-05	2e-05	3e-05	2e-05	3e-05	2e-05
6	rand_corpus	2e-05	1e-05	3e-05	1e-05	3e-05	3e-05	3e-05	2e-05
7	rand_uniform	2e-05	1e-05	3e-05	2e-05	3e-05	3e-05	3e-05	1e-05
8	rand_init	1e-05	1e-05	3e-05	1e-05	1e-05	1e-05	2e-05	1e-05
9	no_pos	1e-05	3e-05	2e-05	1e-05	1e-05	1e-05	1e-05	1e-05

`--batch_size`

	name	RTE	MRPC	SST-2	CoLA	QQP	QNLI	MNLI	PAWS
0	orig	16	16	32	16	16	32	32	16
1	n_1	32	32	16	32	32	16	32	16
2	n_2	32	16	32	16	32	32	16	32
3	n_3	32	32	16	32	32	16	32	32
4	n_4	32	16	32	16	32	32	32	32
5	r512	32	16	16	32	32	16	16	16
6	rand_corpus	16	16	16	16	32	16	16	32
7	rand_uniform	16	32	16	16	32	16	16	16
8	rand_init	16	16	32	16	16	16	32	16
9	no_pos	16	32	16	16	32	16	16	16

Perform inference similar to RoBERTa as well:

from fairseq.models.roberta import RobertaModel

roberta = RobertaModel.from_pretrained(
    'checkpoints/',
    checkpoint_file='checkpoint_best.pt',
    data_name_or_path='PAWS-bin'
)

label_fn = lambda label: roberta.task.label_dictionary.string(
    [label + roberta.task.label_dictionary.nspecial]
)
ncorrect, nsamples = 0, 0
roberta.cuda()
roberta.eval()
with open('paws_data/dev.tsv') as fin:
    fin.readline()
    for index, line in enumerate(fin):
        tokens = line.strip().split('\t')
        sent1, sent2, target = tokens[0], tokens[1], tokens[2]
        tokens = roberta.encode(sent1, sent2)
        prediction = roberta.predict('sentence_classification_head', tokens).argmax().item()
        prediction_label = label_fn(prediction)
        ncorrect += int(prediction_label == target)
        nsamples += 1
print('| Accuracy: ', float(ncorrect)/float(nsamples))