Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator

a8bbb79 about 2 years ago

4.33 kB

	---
	language: en
	license: cc-by-4.0
	tags:
	- question-answering
	datasets:
	- squad_v2
	metrics:
	- f1
	- exact
	widget:
	- context: DeBERTa improves the BERT and RoBERTa models using disentangled attention
	and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa
	on a majority of NLU tasks with 80GB training data. In DeBERTa V3, we further
	improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient
	Disentangled Embedding Sharing. Compared to DeBERTa, our V3 version significantly
	improves the model performance on downstream tasks. You can find more technique
	details about the new model from our paper. Please check the official repository
	for more implementation details and updates.
	example_title: DeBERTa v3 Q1
	text: How is DeBERTa version 3 different than previous ones?
	- context: DeBERTa improves the BERT and RoBERTa models using disentangled attention
	and enhanced mask decoder. With those two improvements, DeBERTa out perform RoBERTa
	on a majority of NLU tasks with 80GB training data. In DeBERTa V3, we further
	improved the efficiency of DeBERTa using ELECTRA-Style pre-training with Gradient
	Disentangled Embedding Sharing. Compared to DeBERTa, our V3 version significantly
	improves the model performance on downstream tasks. You can find more technique
	details about the new model from our paper. Please check the official repository
	for more implementation details and updates.
	example_title: DeBERTa v3 Q2
	text: Where do I go to see new info about DeBERTa?
	model-index:
	- name: DeBERTa v3 xsmall squad2
	results:
	- task:
	type: question-answering
	name: Question Answering
	dataset:
	name: SQuAD2.0
	type: question-answering
	metrics:
	- type: f1
	value: 81.5
	name: f1
	- type: exact
	value: 78.3
	name: exact
	- task:
	type: question-answering
	name: Question Answering
	dataset:
	name: squad_v2
	type: squad_v2
	config: squad_v2
	split: validation
	metrics:
	- type: exact_match
	value: 78.5341
	name: Exact Match
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZTk0ZGQ1YjU1YmQ5NTc2M2RmNjg2OGViYjcyODZkOTc1MDBkNmI5MDc0MzEyMzZmNDg3Yzc4ZTA3ZjAwM2M5ZiIsInZlcnNpb24iOjF9.ewKF-UetUoxKDeXgnM6vqy8nBC9c3qh7dLZhdQlgSxPut3LjAhpCh2fJGir-OVcfzWzxsPhcZQEpdnxR8oZnAA
	- type: f1
	value: 81.6408
	name: F1
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOTQwZDdjY2ZlOGVhM2E5NGM3OGNkNTk2NWFkYTg1Y2Q0YWFlYWJmMGIyZWM5ZjMyYTYyODUzMDA0NWU0ZGVkZCIsInZlcnNpb24iOjF9.BHJNhS1YisUIkjcpIMdwXurTewak9dkkpGXC2vHvUB4qUEuk_p3V-orhmeFyTxzLaWRwrZVGVz-NSfqFr4n1Ag
	- type: total
	value: 11870
	name: total
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzNiZDQ3MDAyNzljMDI4NTRlYzZiZjE4ODJhZDhmZWE2ZjcwNjg2ZWJmNjUyMTUzZDk4ODNjNDExYTk1YWNlOCIsInZlcnNpb24iOjF9.3BlfmMvbV86Ua39ToqnMmgpGS0ZTew0UFFYWGyTkS3u7jaAXCfYkFkNJXw806f2uFFkKr1hqlzzKfivV0wUjCg
	- task:
	type: question-answering
	name: Question Answering
	dataset:
	name: squad
	type: squad
	config: plain_text
	split: validation
	metrics:
	- type: exact_match
	value: 84.1741
	name: Exact Match
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYTA0MDVlYWI5NzdiNjllM2NmZTYwYmQ5YzE0ODgwOTA3MWZjZDkxNDFmZDM1OTQzMzgwNWI4NDc5NThhM2VhZSIsInZlcnNpb24iOjF9.lc2nUBxSu2_0_a5lyVsV51UAmkE8WHDTwGHvt3n9zvCbcJ1ylOg2xovF0_j0hZS16lv1DEw5XV8EW_ZS7mfvBg
	- type: f1
	value: 91.0771
	name: F1
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODQxMjkxOWJlZTc2MmE5YzVmMjNhOTkwNDdiMDBhNWUwMDU3MDI1MmJiNDY4MjczYjIwM2U1NDhlYmZlZWQwMSIsInZlcnNpb24iOjF9.x_axHiBX5d3UIi1UbJT3kVbdX4kX9XFLQSg-l16-AAK9tiyutT-yaYJOi8LSb2lR4677tJpf3itu4eriJRU2Cg
	---



	# DeBERTa v3 xsmall SQuAD 2.0

	[Microsoft reports that this model can get 84.8/82.0](https://huggingface.co/microsoft/deberta-v3-xsmall#fine-tuning-on-nlu-tasks) on f1/em on the dev set.

	I got 81.5/78.3 but I only did one run and I didn't use the official squad2 evaluation script. I will do some more runs and show the results on the official script soon.