AutoDisProxyT-QNLI for Distilling Massive Neural Networks

AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models.

This AutoDisProxyT checkpoint with 7 layers, 160 hidden size, 10 attention heads corresponds to 6.88 million parameters and 0.27G FLOPs.

The following table shows the results on GLUE dev set.

Models	#Params (M)	#FLOPs (G)	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	Avg
BERT	109	11.2	84.5	91.7	91.3	68.6	93.2	87.3	53.5	82.2
BERT_SMALL	66	5.66	81.8	89.8	90.6	67.9	91.2	84.9	53.5	80.0
TruncatedBERT	66	5.66	81.2	87.9	90.4	65.5	90.8	82.7	41.4	77.1
DistilBERT	66	5.66	82.2	89.2	88.5	59.9	91.3	87.5	51.3	78.6
TinyBERT	66	5.66	83.5	90.5	90.6	72.2	91.6	88.4	42.8	79.9
MiniLM	66	5.66	84.0	91.0	91.0	71.5	92.0	88.4	49.2	81.0
AutoTinyBERT-KD-S1	30.0	1.69	82.3	89.7	89.9	71.1	91.4	88.5	47.3	80.0
DynaBERT	37.7	1.81	82.3	88.5	90.4	63.2	92.0	81.4	76.4	43.7
NAS-BERT₁₀	10.0	2.30	76.4	86.3	88.5	66.6	88.6	79.1	34.0	74.2
AutoTinyBERT-KD-S4	66	5.66	76.0	85.5	86.9	64.9	86.8	81.4	20.4	71.7
NAS-BERT₅	66	5.66	74.4	84.9	85.8	66.6	87.3	79.6	19.8	71.2
AutoDisProxyT	6.88	0.27	79.0	86.4	89.1	64.3	85.9	78.5	24.8	72.6

Tested with torch 1.6.0

If you use this checkpoint in your work, please cite:

@article{xu2022autodistil,
  title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
  author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2201.12507},
  year={2022}
}