microsoft
/

xtremedistil-l6-h256-uncased

Text Classification

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

subho commited on Jun 8, 2021

Commit

402ff14

•

1 Parent(s): 2cac937

Create README.md

Files changed (1) hide show

README.md +57 -0

README.md CHANGED Viewed

	@@ -0,0 +1,57 @@

+---
+language: en
+thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
+tags:
+- text-classification
+license: mit
+---
+# XtremeDistil-Transformers for Distilling Massive Neural Networks
+XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "[XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf)" and "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)" with the following "[Github code](https://github.com/microsoft/xtreme-distil-transformers)".
+This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
+The following table shows the results on GLUE dev set and SQuAD-v2.
+| Models         | #Params | Speedup | MNLI | QNLI | QQP  | RTE  | SST  | MRPC | SQUAD2 | Avg   |
+|----------------|--------|---------|------|------|------|------|------|------|--------|-------|
+| BERT        | 109    | 1x       | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8   | 84.8 |
+| DistilBERT  | 66     | 2x       | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7   | 81.3 |
+| TinyBERT    | 66     | 2x       | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1   | 84.3 |
+| MiniLM      | 66     | 2x       | 84.0   | 91.0   | 91.0   | 71.5 | 92.0   | 88.4 | 76.4   | 84.9  |
+| MiniLM      | 22     | 5.3x     | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9   | 83.3 |
+| XtremeDistil-l6-h256   | 13     | 8.7x     | 83.9 | 89.5 | 90.6   | 80.1 | 91.2 | 90.0   | 74.1   | 85.6 |
+| XtremeDistil-l6-h384   | 22     | 5.3x     | 85.4 | 90.3 | 91.0   | 80.9 | 92.3 | 90.0   | 76.6   | 86.6 |
+| XtremeDistil-l12-h384   | 33     | 2.7x     | 87.2 | 91.9 | 91.3   | 85.6 | 93.1 | 90.4   | 80.2   | 88.5 |
+Tested with `tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0`
+If you use this checkpoint in your work, please cite:
+``` latex
+@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
+    title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
+    author = "Mukherjee, Subhabrata  and
+      Hassan Awadallah, Ahmed",
+    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
+    month = jul,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2020.acl-main.202",
+    doi = "10.18653/v1/2020.acl-main.202",
+    pages = "2221--2234",
+}
+```
+``` latex
+@misc{wang2020minilm,
+    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
+    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
+    year={2020},
+    eprint={2002.10957},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```