microsoft
/

deberta-xxlarge-v2

Inference Endpoints

Model card Files Files and versions Community

deberta-xxlarge-v2 / README.md

DeBERTa's picture

Update README.md

b6a2bc6 over 3 years ago

|

raw history blame

No virus

3.41 kB

	---
	thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
	license: mit
	---

	## DeBERTa: Decoding-enhanced BERT with Disentangled Attention

	[DeBERTa](https://arxiv.org/abs/2006.03654) improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data.

	Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.

	This is the DeBERTa V2 xxlarge model with 48 layers, 1536 hidden size. Total parameters 1.5B. It's trained with 160GB data.


	#### Fine-tuning on NLU tasks

	We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

	\| Model \| SQuAD 1.1 \| SQuAD 2.0 \| MNLI-m/mm \| SST-2 \| QNLI \| CoLA \| RTE \| MRPC(acc/f1) \| QQP \|STS-B\|
	\|---------------------------\|-----------\|-----------\|-------------\|-------\|------\|------\|--------\|--------------\|------\|-----\|
	\| BERT-Large \| 90.9/84.1 \| 81.8/79.0 \| 86.6/- \| 93.2 \| 92.3 \| 60.6 \| 70.4 \| 88.0/- \| 91.3 \|90.0 \|
	\| RoBERTa-Large \| 94.6/88.9 \| 89.4/86.5 \| 90.2/- \| 96.4 \| 93.9 \| 68.0 \| 86.6 \| 90.9/- \| 92.2 \|92.4 \|
	\| XLNet-Large \| 95.1/89.7 \| 90.6/87.9 \| 90.8/- \| 97.0 \| 94.9 \| 69.0 \| 85.9 \| 90.8/- \| 92.3 \|92.5 \|
	\| [DeBERTa-Large](https://huggingface.co/microsoft/deberta-large) \| 95.5/90.1 \| 90.7/88.0 \| 91.3/91.1 \| 96.5 \| 95.3 \| 69.5 \| 86.6 \| 92.6/94.6 \| 92.3 \|92.5 \|
	\| [DeBERTa-XLarge](https://huggingface.co/microsoft/deberta-xlarge) \| -/- \| -/- \| 91.5/91.2 \| - \| - \| - \| 89.5 \| 92.1/94.3 \| - \|- \|
	\| [DeBERTa-XLarge-V2](https://huggingface.co/microsoft/deberta-xlarge-v2) \| - \| - \| 91.7/91.6 \| - \| - \| - \| - \| - \| - \|- \|
	\|[DeBERTa-XXLarge-V2](https://huggingface.co/microsoft/deberta-xxlarge-v2)\|96.1/91.4\|92.2/89.7\|91.7/91.9\| - \| - \| - \| - \| - \| - \|- \|
	\|[DeBERTa-XLarge-V2-MNLI](https://huggingface.co/microsoft/deberta-xlarge-v2-mnli)\| - \| - \| 91.7/91.6 \| - \| - \| - \| 93.9 \| - \| - \|- \|
	\|[DeBERTa-XXLarge-V2-MNLI](https://huggingface.co/microsoft/deberta-xxlarge-v2-mnli)\| - \| - \|91.7/91.9\| - \| - \| - \| 93.5 \| - \| - \|- \|




	## Note

	To try the XXLarge model with [HF transformers](https://huggingface.co/transformers/main_classes/trainer.html), you need to specify --sharded_ddp

	```bash

	cd transformers/examples/text-classification/
	export TASK_NAME=mrpc
	python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-xxlarge-v2 \
	--task_name $TASK_NAME --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 4 \
	--learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
	```

	### Citation

	If you find DeBERTa useful for your work, please cite the following paper:

	``` latex
	@misc{he2020deberta,
	title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
	author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
	year={2020},
	eprint={2006.03654},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```