Subhabrata Mukherjee
commited on
Commit
•
6e8c0eb
1
Parent(s):
5e83505
Update README.md
Browse files
README.md
CHANGED
@@ -6,9 +6,11 @@ tags:
|
|
6 |
license: mit
|
7 |
---
|
8 |
|
9 |
-
#
|
10 |
|
11 |
-
|
|
|
|
|
12 |
|
13 |
This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
|
14 |
|
|
|
6 |
license: mit
|
7 |
---
|
8 |
|
9 |
+
# XtremeDistilTransformers for Distilling Massive Neural Networks
|
10 |
|
11 |
+
XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).
|
12 |
+
|
13 |
+
We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).
|
14 |
|
15 |
This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
|
16 |
|