Pengcheng He commited on
Commit
5385f0f
1 Parent(s): 701b7f5

Add deepspeed config

Browse files
Files changed (2) hide show
  1. README.md +37 -5
  2. ds_config.json +23 -0
README.md CHANGED
@@ -36,16 +36,48 @@ We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.
36
  --------
37
  #### Notes.
38
  - <sup>1</sup> Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on [DeBERTa-Large-MNLI](https://huggingface.co/microsoft/deberta-large-mnli), [DeBERTa-XLarge-MNLI](https://huggingface.co/microsoft/deberta-xlarge-mnli), [DeBERTa-V2-XLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xlarge-mnli), [DeBERTa-V2-XXLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli). The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when start from MNLI fine-tuned models, however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
39
- - <sup>2</sup> To try the **XXLarge** model with **[HF transformers](https://huggingface.co/transformers/main_classes/trainer.html)**, you need to specify **--sharded_ddp**
40
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```bash
42
  cd transformers/examples/text-classification/
43
- export TASK_NAME=mrpc
44
- python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \
45
- --task_name $TASK_NAME --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 4 \
46
  --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
47
  ```
48
 
 
49
  ### Citation
50
 
51
  If you find DeBERTa useful for your work, please cite the following paper:
36
  --------
37
  #### Notes.
38
  - <sup>1</sup> Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on [DeBERTa-Large-MNLI](https://huggingface.co/microsoft/deberta-large-mnli), [DeBERTa-XLarge-MNLI](https://huggingface.co/microsoft/deberta-xlarge-mnli), [DeBERTa-V2-XLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xlarge-mnli), [DeBERTa-V2-XXLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli). The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when start from MNLI fine-tuned models, however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
39
+ - <sup>2</sup> To try the **XXLarge** model with **[HF transformers](https://huggingface.co/transformers/main_classes/trainer.html)**, we recommand using **deepspeed** as it's faster and saves memory.
40
+
41
+ Run with `Deepspeed`,
42
+
43
+ ```bash
44
+ pip install datasets
45
+ pip install deepspeed
46
+
47
+ # Download the deepspeed config file
48
+ wget https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli/resolve/main/ds_config.json -O ds_config.json
49
+
50
+ export TASK_NAME=rte
51
+ output_dir="ds_results"
52
+ num_gpus=8
53
+ batch_size=4
54
+ python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
55
+ run_glue.py \
56
+ --model_name_or_path microsoft/deberta-v2-xxlarge-mnli \
57
+ --task_name $TASK_NAME \
58
+ --do_train \
59
+ --do_eval \
60
+ --max_seq_length 256 \
61
+ --per_device_train_batch_size ${batch_size} \
62
+ --learning_rate 3e-6 \
63
+ --num_train_epochs 3 \
64
+ --output_dir $output_dir \
65
+ --overwrite_output_dir \
66
+ --logging_steps 10 \
67
+ --logging_dir $output_dir \
68
+ --deepspeed ds_config.json
69
+ ```
70
+
71
+ You can also run with `--sharded_ddp`
72
  ```bash
73
  cd transformers/examples/text-classification/
74
+ export TASK_NAME=rte
75
+ python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge-mnli \
76
+ --task_name $TASK_NAME --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 4 \
77
  --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
78
  ```
79
 
80
+
81
  ### Citation
82
 
83
  If you find DeBERTa useful for your work, please cite the following paper:
ds_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fp16": {
3
+
4
+ "enabled": true,
5
+
6
+ "initial_scale_power": 12
7
+
8
+ },
9
+ "zero_optimization": {
10
+
11
+ "stage": 2,
12
+
13
+ "reduce_bucket_size": 5e7,
14
+
15
+ "allgather_bucket_size": 1.25e9,
16
+
17
+ "overlap_comm": true,
18
+
19
+ "contiguous_gradients": true
20
+
21
+ },
22
+ "zero_allow_untested_optimizer": true
23
+ }