mrm8488 commited on
Commit
211d1d2
1 Parent(s): 9c61b9e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -58
README.md CHANGED
@@ -41,64 +41,6 @@ It achieves the following results on the evaluation set:
41
  Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
42
  In [DeBERTa V3](https://arxiv.org/abs/2111.09543), we replaced the MLM objective with the RTD(Replaced Token Detection) objective introduced by ELECTRA for pre-training, as well as some innovations to be introduced in our upcoming paper. Compared to DeBERTa-V2, our V3 version significantly improves the model performance in downstream tasks. You can find a simple introduction about the model from the appendix A11 in our original [paper](https://arxiv.org/abs/2006.03654), but we will provide more details in a separate write-up.
43
  The DeBERTa V3 small model comes with 6 layers and a hidden size of 768. Its total parameter number is 143M since we use a vocabulary containing 128K tokens which introduce 98M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.
44
- #### Fine-tuning on NLU tasks
45
- We present the dev results on SQuAD 1.1/2.0 and MNLI tasks.
46
- | Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m |
47
- |-------------------|-----------|-----------|--------|
48
- | RoBERTa-base | 91.5/84.6 | 83.7/80.5 | 87.6 |
49
- | XLNet-base | -/- | -/80.2 | 86.8 |
50
- |DeBERTa-base |93.1/87.2| 86.2/83.1| 88.8|
51
- | **DeBERTa-v3-small** | -/- | -/- | 88.2 |
52
- | DeBERTa-v3-small+SiFT | -/- | -/- | 88.8 |
53
- #### Fine-tuning with HF transformers
54
- ```bash
55
- #!/bin/bash
56
- cd transformers/examples/pytorch/text-classification/
57
- pip install datasets
58
- export TASK_NAME=mnli
59
- output_dir="ds_results"
60
- num_gpus=8
61
- batch_size=8
62
- python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
63
- run_glue.py \
64
- --model_name_or_path microsoft/deberta-v3-small \
65
- --task_name $TASK_NAME \
66
- --do_train \
67
- --do_eval \
68
- --evaluation_strategy steps \
69
- --max_seq_length 256 \
70
- --warmup_steps 1000 \
71
- --per_device_train_batch_size ${batch_size} \
72
- --learning_rate 3e-5 \
73
- --num_train_epochs 3 \
74
- --output_dir $output_dir \
75
- --overwrite_output_dir \
76
- --logging_steps 1000 \
77
- --logging_dir $output_dir
78
- ```
79
- ### Citation
80
- If you find DeBERTa useful for your work, please cite the following paper:
81
- ``` latex
82
- @misc{he2021debertav3,
83
- title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
84
- author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
85
- year={2021},
86
- eprint={2111.09543},
87
- archivePrefix={arXiv},
88
- primaryClass={cs.CL}
89
- }
90
- ```
91
- ``` latex
92
- @inproceedings{
93
- he2021deberta,
94
- title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
95
- author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
96
- booktitle={International Conference on Learning Representations},
97
- year={2021},
98
- url={https://openreview.net/forum?id=XPZIaotutsD}
99
- }
100
- ```
101
-
102
 
103
  ## Intended uses & limitations
104
 
41
  Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
42
  In [DeBERTa V3](https://arxiv.org/abs/2111.09543), we replaced the MLM objective with the RTD(Replaced Token Detection) objective introduced by ELECTRA for pre-training, as well as some innovations to be introduced in our upcoming paper. Compared to DeBERTa-V2, our V3 version significantly improves the model performance in downstream tasks. You can find a simple introduction about the model from the appendix A11 in our original [paper](https://arxiv.org/abs/2006.03654), but we will provide more details in a separate write-up.
43
  The DeBERTa V3 small model comes with 6 layers and a hidden size of 768. Its total parameter number is 143M since we use a vocabulary containing 128K tokens which introduce 98M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Intended uses & limitations
46