# XtremeDistil-Transformers for Distilling Massive Neural Networks

XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "XtremeDistil: Multi-stage Distillation for Massive Multilingual Models" and "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers" with the following "Github code".

This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.

The following table shows the results on GLUE dev set and SQuAD-v2.

Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
XtremeDistil 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6

If you use this checkpoint in your work, please cite:

@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
author = "Mukherjee, Subhabrata  and
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.202",
doi = "10.18653/v1/2020.acl-main.202",
pages = "2221--2234",
}
@misc{wang2020minilm,
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
year={2020},
eprint={2002.10957},
archivePrefix={arXiv},
primaryClass={cs.CL}
}