julien-c HF staff commited on
Commit
05d5220
1 Parent(s): 4fae3e3

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/mrm8488/roberta-large-finetuned-wsc/README.md

Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RoBERTa (large) fine-tuned on Winograd Schema Challenge (WSC) data
2
+
3
+ Step from its original [repo](https://github.com/pytorch/fairseq/blob/master/examples/roberta/wsc/README.md)
4
+
5
+ The following instructions can be used to finetune RoBERTa on the WSC training
6
+ data provided by [SuperGLUE](https://super.gluebenchmark.com/).
7
+
8
+ Note that there is high variance in the results. For our GLUE/SuperGLUE
9
+ submission we swept over the learning rate (1e-5, 2e-5, 3e-5), batch size (16,
10
+ 32, 64) and total number of updates (500, 1000, 2000, 3000), as well as the
11
+ random seed. Out of ~100 runs we chose the best 7 models and ensembled them.
12
+
13
+ **Approach:** The instructions below use a slightly different loss function than
14
+ what's described in the original RoBERTa arXiv paper. In particular,
15
+ [Kocijan et al. (2019)](https://arxiv.org/abs/1905.06290) introduce a margin
16
+ ranking loss between `(query, candidate)` pairs with tunable hyperparameters
17
+ alpha and beta. This is supported in our code as well with the `--wsc-alpha` and
18
+ `--wsc-beta` arguments. However, we achieved slightly better (and more robust)
19
+ results on the development set by instead using a single cross entropy loss term
20
+ over the log-probabilities for the query and all mined candidates. **The
21
+ candidates are mined using spaCy from each input sentence in isolation, so the
22
+ approach remains strictly pointwise.** This reduces the number of
23
+ hyperparameters and our best model achieved 92.3% development set accuracy,
24
+ compared to ~90% accuracy for the margin loss. Later versions of the RoBERTa
25
+ arXiv paper will describe this updated formulation.
26
+
27
+ ### 1) Download the WSC data from the SuperGLUE website:
28
+ ```bash
29
+ wget https://dl.fbaipublicfiles.com/glue/superglue/data/v2/WSC.zip
30
+ unzip WSC.zip
31
+
32
+ # we also need to copy the RoBERTa dictionary into the same directory
33
+ wget -O WSC/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
34
+ ```
35
+
36
+ ### 2) Finetune over the provided training data:
37
+ ```bash
38
+ TOTAL_NUM_UPDATES=2000 # Total number of training steps.
39
+ WARMUP_UPDATES=250 # Linearly increase LR over this many steps.
40
+ LR=2e-05 # Peak LR for polynomial LR scheduler.
41
+ MAX_SENTENCES=16 # Batch size per GPU.
42
+ SEED=1 # Random seed.
43
+ ROBERTA_PATH=/path/to/roberta/model.pt
44
+
45
+ # we use the --user-dir option to load the task and criterion
46
+ # from the examples/roberta/wsc directory:
47
+ FAIRSEQ_PATH=/path/to/fairseq
48
+ FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/wsc
49
+
50
+ CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train WSC/ \
51
+ --restore-file $ROBERTA_PATH \
52
+ --reset-optimizer --reset-dataloader --reset-meters \
53
+ --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
54
+ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
55
+ --valid-subset val \
56
+ --fp16 --ddp-backend no_c10d \
57
+ --user-dir $FAIRSEQ_USER_DIR \
58
+ --task wsc --criterion wsc --wsc-cross-entropy \
59
+ --arch roberta_large --bpe gpt2 --max-positions 512 \
60
+ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
61
+ --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
62
+ --lr-scheduler polynomial_decay --lr $LR \
63
+ --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_NUM_UPDATES \
64
+ --max-sentences $MAX_SENTENCES \
65
+ --max-update $TOTAL_NUM_UPDATES \
66
+ --log-format simple --log-interval 100 \
67
+ --seed $SEED
68
+ ```
69
+
70
+ The above command assumes training on 4 GPUs, but you can achieve the same
71
+ results on a single GPU by adding `--update-freq=4`.
72
+
73
+ ### 3) Evaluate
74
+ ```python
75
+ from fairseq.models.roberta import RobertaModel
76
+ from examples.roberta.wsc import wsc_utils # also loads WSC task and criterion
77
+ roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'WSC/')
78
+ roberta.cuda()
79
+ nsamples, ncorrect = 0, 0
80
+ for sentence, label in wsc_utils.jsonl_iterator('WSC/val.jsonl', eval=True):
81
+ pred = roberta.disambiguate_pronoun(sentence)
82
+ nsamples += 1
83
+ if pred == label:
84
+ ncorrect += 1
85
+ print('Accuracy: ' + str(ncorrect / float(nsamples)))
86
+ # Accuracy: 0.9230769230769231
87
+ ```
88
+
89
+ ## RoBERTa training on WinoGrande dataset
90
+ We have also provided `winogrande` task and criterion for finetuning on the
91
+ [WinoGrande](https://mosaic.allenai.org/projects/winogrande) like datasets
92
+ where there are always two candidates and one is correct.
93
+ It's more efficient implementation for such subcases.
94
+
95
+ ```bash
96
+ TOTAL_NUM_UPDATES=23750 # Total number of training steps.
97
+ WARMUP_UPDATES=2375 # Linearly increase LR over this many steps.
98
+ LR=1e-05 # Peak LR for polynomial LR scheduler.
99
+ MAX_SENTENCES=32 # Batch size per GPU.
100
+ SEED=1 # Random seed.
101
+ ROBERTA_PATH=/path/to/roberta/model.pt
102
+
103
+ # we use the --user-dir option to load the task and criterion
104
+ # from the examples/roberta/wsc directory:
105
+ FAIRSEQ_PATH=/path/to/fairseq
106
+ FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/wsc
107
+
108
+ cd fairseq
109
+ CUDA_VISIBLE_DEVICES=0 fairseq-train winogrande_1.0/ \
110
+ --restore-file $ROBERTA_PATH \
111
+ --reset-optimizer --reset-dataloader --reset-meters \
112
+ --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
113
+ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
114
+ --valid-subset val \
115
+ --fp16 --ddp-backend no_c10d \
116
+ --user-dir $FAIRSEQ_USER_DIR \
117
+ --task winogrande --criterion winogrande \
118
+ --wsc-margin-alpha 5.0 --wsc-margin-beta 0.4 \
119
+ --arch roberta_large --bpe gpt2 --max-positions 512 \
120
+ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
121
+ --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
122
+ --lr-scheduler polynomial_decay --lr $LR \
123
+ --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_NUM_UPDATES \
124
+ --max-sentences $MAX_SENTENCES \
125
+ --max-update $TOTAL_NUM_UPDATES \
126
+ --log-format simple --log-interval 100
127
+ ```
128
+ [Original repo](https://github.com/pytorch/fairseq/tree/master/examples/roberta/wsc)