microsoft
/

xtremedistil-l6-h256-uncased

Text Classification

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

xtremedistil-l6-h256-uncased / README.md

Subhabrata Mukherjee

Update README.md

26c662c over 3 years ago

|

2.89 kB

	---
	language: en
	thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
	tags:
	- text-classification
	license: mit
	---

	# XtremeDistilTransformers for Distilling Massive Neural Networks

	XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).

	We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).


	This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.

	Other available checkpoints: [xtremedistil-l6-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased) and [xtremedistil-l12-h384-uncased](https://huggingface.co/microsoft/xtremedistil-l12-h384-uncased)

	The following table shows the results on GLUE dev set and SQuAD-v2.

	\| Models \| #Params \| Speedup \| MNLI \| QNLI \| QQP \| RTE \| SST \| MRPC \| SQUAD2 \| Avg \|
	\|----------------\|--------\|---------\|------\|------\|------\|------\|------\|------\|--------\|-------\|
	\| BERT \| 109 \| 1x \| 84.5 \| 91.7 \| 91.3 \| 68.6 \| 93.2 \| 87.3 \| 76.8 \| 84.8 \|
	\| DistilBERT \| 66 \| 2x \| 82.2 \| 89.2 \| 88.5 \| 59.9 \| 91.3 \| 87.5 \| 70.7 \| 81.3 \|
	\| TinyBERT \| 66 \| 2x \| 83.5 \| 90.5 \| 90.6 \| 72.2 \| 91.6 \| 88.4 \| 73.1 \| 84.3 \|
	\| MiniLM \| 66 \| 2x \| 84.0 \| 91.0 \| 91.0 \| 71.5 \| 92.0 \| 88.4 \| 76.4 \| 84.9 \|
	\| MiniLM \| 22 \| 5.3x \| 82.8 \| 90.3 \| 90.6 \| 68.9 \| 91.3 \| 86.6 \| 72.9 \| 83.3 \|
	\| XtremeDistil-l6-h256 \| 13 \| 8.7x \| 83.9 \| 89.5 \| 90.6 \| 80.1 \| 91.2 \| 90.0 \| 74.1 \| 85.6 \|
	\| XtremeDistil-l6-h384 \| 22 \| 5.3x \| 85.4 \| 90.3 \| 91.0 \| 80.9 \| 92.3 \| 90.0 \| 76.6 \| 86.6 \|
	\| XtremeDistil-l12-h384 \| 33 \| 2.7x \| 87.2 \| 91.9 \| 91.3 \| 85.6 \| 93.1 \| 90.4 \| 80.2 \| 88.5 \|

	Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`

	If you use this checkpoint in your work, please cite:

	``` latex
	@misc{mukherjee2021xtremedistiltransformers,
	title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation},
	author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
	year={2021},
	eprint={2106.04563},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```