Subhabrata Mukherjee commited on
Commit
6e8c0eb
1 Parent(s): 5e83505

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -6,9 +6,11 @@ tags:
6
  license: mit
7
  ---
8
 
9
- # XtremeDistil-Transformers for Distilling Massive Neural Networks
10
 
11
- XtremeDistil is a distilled task-agnostic transformer model leveraging multi-task distillation techniques from the paper "[XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf)" and "[MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957)" with the following "[Github code](https://github.com/microsoft/xtreme-distil-transformers)".
 
 
12
 
13
  This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
14
 
 
6
  license: mit
7
  ---
8
 
9
+ # XtremeDistilTransformers for Distilling Massive Neural Networks
10
 
11
+ XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation](https://arxiv.org/abs/2106.04563).
12
+
13
+ We leverage task transfer combined with multi-task distillation techniques from the papers [XtremeDistil: Multi-stage Distillation for Massive Multilingual Models](https://www.aclweb.org/anthology/2020.acl-main.202.pdf) and [MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers](https://arxiv.org/abs/2002.10957) with the following [Github code](https://github.com/microsoft/xtreme-distil-transformers).
14
 
15
  This l6-h384 checkpoint with **6** layers, **384** hidden size, **12** attention heads corresponds to **22 million** parameters with **5.3x** speedup over BERT-base.
16