microsoft
/

xtremedistil-l6-h256-uncased

Text Classification

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

xtremedistil-l6-h256-uncased / README.md

Subhabrata Mukherjee

Create README.md

402ff14 over 3 years ago

|

2.98 kB

	---
	language: en
	thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
	tags:
	- text-classification
	license: mit
	---

	# XtremeDistil-Transformers for Distilling Massive Neural Networks

	XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "[XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf)" and "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)" with the following "[Github code](https://github.com/microsoft/xtreme-distil-transformers)".

	This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.

	The following table shows the results on GLUE dev set and SQuAD-v2.

	\| Models \| #Params \| Speedup \| MNLI \| QNLI \| QQP \| RTE \| SST \| MRPC \| SQUAD2 \| Avg \|
	\|----------------\|--------\|---------\|------\|------\|------\|------\|------\|------\|--------\|-------\|
	\| BERT \| 109 \| 1x \| 84.5 \| 91.7 \| 91.3 \| 68.6 \| 93.2 \| 87.3 \| 76.8 \| 84.8 \|
	\| DistilBERT \| 66 \| 2x \| 82.2 \| 89.2 \| 88.5 \| 59.9 \| 91.3 \| 87.5 \| 70.7 \| 81.3 \|
	\| TinyBERT \| 66 \| 2x \| 83.5 \| 90.5 \| 90.6 \| 72.2 \| 91.6 \| 88.4 \| 73.1 \| 84.3 \|
	\| MiniLM \| 66 \| 2x \| 84.0 \| 91.0 \| 91.0 \| 71.5 \| 92.0 \| 88.4 \| 76.4 \| 84.9 \|
	\| MiniLM \| 22 \| 5.3x \| 82.8 \| 90.3 \| 90.6 \| 68.9 \| 91.3 \| 86.6 \| 72.9 \| 83.3 \|
	\| XtremeDistil-l6-h256 \| 13 \| 8.7x \| 83.9 \| 89.5 \| 90.6 \| 80.1 \| 91.2 \| 90.0 \| 74.1 \| 85.6 \|
	\| XtremeDistil-l6-h384 \| 22 \| 5.3x \| 85.4 \| 90.3 \| 91.0 \| 80.9 \| 92.3 \| 90.0 \| 76.6 \| 86.6 \|
	\| XtremeDistil-l12-h384 \| 33 \| 2.7x \| 87.2 \| 91.9 \| 91.3 \| 85.6 \| 93.1 \| 90.4 \| 80.2 \| 88.5 \|

	Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`

	If you use this checkpoint in your work, please cite:

	``` latex
	@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
	title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
	author = "Mukherjee, Subhabrata and
	Hassan Awadallah, Ahmed",
	booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
	month = jul,
	year = "2020",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/2020.acl-main.202",
	doi = "10.18653/v1/2020.acl-main.202",
	pages = "2221--2234",
	}
	```

	``` latex
	@misc{wang2020minilm,
	title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
	author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
	year={2020},
	eprint={2002.10957},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```