README.md · microsoft/deberta-xxlarge-v2 at 2b201b49a23a71c6c16599fbef70af268570979d

metadata

thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
license: mit

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data.

Please check the official repository for more details and updates.

This the DeBERTa V2 xxlarge model(60%) with 48 layers, 1536 hidden size. Total parameters 1.5B.

Fine-tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI-m/mm	SST-2	QNLI	CoLA	RTE	MRPC(acc/f1)	QQP	STS-B
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3	90.0
RoBERTa-Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2	92.4
XLNet-Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3	92.5
DeBERTa-Large	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	86.6	92.6/94.6	92.3	92.5
DeBERTa-XLarge	-/-	-/-	91.5/91.0	-	-	-	89.5	92.1/94.3	-	-
DeBERTa-XLarge-V2	-	-	91.7/91.6	-	-	-	-	-	-	-
DeBERTa-XXLarge-V2(60%)	96.1/91.4	92.2/89.7	91.7/91.8	-	-	-	-	-	-	-
DeBERTa-XLarge-V2-mnli	-	-	91.7/91.6	-	-	-	93.9	-	-	-
DeBERTa-XXLarge-V2-mnli	-	-	91.7/91.8	-	-	-	93.5	-	-	-

Note

To try the XXLarge model with HF transformers, you need to specify --sharded_ddp


cd transformers/examples/text-classification/

python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta-xxlarge-v2   \
--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 4   \
--learning_rate 3e-6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp

Citation

If you find DeBERTa useful for your work, please cite the following paper:

@misc{he2020deberta,
    title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
    author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
    year={2020},
    eprint={2006.03654},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
        }