Edit model card

AutoDisProxyT-QNLI for Distilling Massive Neural Networks

AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models.

This AutoDisProxyT checkpoint with 7 layers, 160 hidden size, 10 attention heads corresponds to 6.88 million parameters and 0.27G FLOPs.

The following table shows the results on GLUE dev set.

Models #Params (M) #FLOPs (G) MNLI QNLI QQP RTE SST-2 MRPC CoLA Avg
BERT 109 11.2 84.5 91.7 91.3 68.6 93.2 87.3 53.5 82.2
BERTSMALL 66 5.66 81.8 89.8 90.6 67.9 91.2 84.9 53.5 80.0
TruncatedBERT 66 5.66 81.2 87.9 90.4 65.5 90.8 82.7 41.4 77.1
DistilBERT 66 5.66 82.2 89.2 88.5 59.9 91.3 87.5 51.3 78.6
TinyBERT 66 5.66 83.5 90.5 90.6 72.2 91.6 88.4 42.8 79.9
MiniLM 66 5.66 84.0 91.0 91.0 71.5 92.0 88.4 49.2 81.0
AutoTinyBERT-KD-S1 30.0 1.69 82.3 89.7 89.9 71.1 91.4 88.5 47.3 80.0
DynaBERT 37.7 1.81 82.3 88.5 90.4 63.2 92.0 81.4 76.4 43.7
NAS-BERT10 10.0 2.30 76.4 86.3 88.5 66.6 88.6 79.1 34.0 74.2
AutoTinyBERT-KD-S4 66 5.66 76.0 85.5 86.9 64.9 86.8 81.4 20.4 71.7
NAS-BERT5 66 5.66 74.4 84.9 85.8 66.6 87.3 79.6 19.8 71.2
AutoDisProxyT 6.88 0.27 79.0 86.4 89.1 64.3 85.9 78.5 24.8 72.6

Tested with torch 1.6.0

If you use this checkpoint in your work, please cite:

@article{xu2022autodistil,
  title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
  author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2201.12507},
  year={2022}
}
Downloads last month
6
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.