Add evaluation results on the plain_text config of squad (#2)

ae2b9e9 almost 2 years ago

No virus

3.16 kB

	---
	widget:
	- context: While deep and large pre-trained models are the state-of-the-art for various
	natural language processing tasks, their huge size poses significant challenges
	for practical uses in resource constrained settings. Recent works in knowledge
	distillation propose task-agnostic as well as task-specific methods to compress
	these models, with task-specific ones often yielding higher compression rate.
	In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers
	that leverages the advantage of task-specific methods for learning a small universal
	model that can be applied to arbitrary tasks and languages. To this end, we study
	the transferability of several source tasks, augmentation resources and model
	architecture for distillation. We evaluate our model performance on multiple tasks,
	including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD
	question answering dataset and a massive multi-lingual NER dataset with 41 languages.
	example_title: xtremedistil q1
	text: What is XtremeDistil?
	- context: While deep and large pre-trained models are the state-of-the-art for various
	natural language processing tasks, their huge size poses significant challenges
	for practical uses in resource constrained settings. Recent works in knowledge
	distillation propose task-agnostic as well as task-specific methods to compress
	these models, with task-specific ones often yielding higher compression rate.
	In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers
	that leverages the advantage of task-specific methods for learning a small universal
	model that can be applied to arbitrary tasks and languages. To this end, we study
	the transferability of several source tasks, augmentation resources and model
	architecture for distillation. We evaluate our model performance on multiple tasks,
	including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD
	question answering dataset and a massive multi-lingual NER dataset with 41 languages.
	example_title: xtremedistil q2
	text: On what is the model validated?
	datasets:
	- squad_v2
	metrics:
	- f1
	- exact
	tags:
	- question-answering
	model-index:
	- name: nbroad/xdistil-l12-h384-squad2
	results:
	- task:
	type: question-answering
	name: Question Answering
	dataset:
	name: squad_v2
	type: squad_v2
	config: squad_v2
	split: validation
	metrics:
	- name: Exact Match
	type: exact_match
	value: 75.4591
	verified: true
	- name: F1
	type: f1
	value: 79.3321
	verified: true
	- task:
	type: question-answering
	name: Question Answering
	dataset:
	name: squad
	type: squad
	config: plain_text
	split: validation
	metrics:
	- name: Exact Match
	type: exact_match
	value: 81.8604
	verified: true
	- name: F1
	type: f1
	value: 89.6654
	verified: true
	---

	xtremedistil-l12-h384 trained on SQuAD 2.0

	"eval_exact": 75.45691906005221
	"eval_f1": 79.32502968532793