sciroshot / README.md

Updated citation

0aae365 over 1 year ago

9.18 kB

	---
	pipeline_tag: zero-shot-classification
	license: apache-2.0
	language:
	- en
	tags:
	- zero-shot
	- text-classification
	- science
	- mag
	widget:
	- text: Leo Messi is the best player ever
	candidate_labels: politics, science, sports, environment
	multi_class: true
	---

	# SCIroShot

	## Overview

	<details>
	<summary>Click to expand</summary>

	- Model type: Language Model
	- Architecture: RoBERTa-large
	- Language: English
	- License: Apache 2.0
	- Task: Zero-Shot Text Classification
	- Data: Microsoft Academic Graph
	- Additional Resources:
	- [Paper]() <-- WiP (soon to be published in EACL 2023)
	- [GitHub](https://github.com/TeMU-BSC/sciroshot)
	</details>

	## Model description

	SCIroShot is an entailment-based Zero-Shot Text Classification model that
	has been fine-tuned using a self-made dataset composed of scientific articles
	from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
	(MAG). The resulting model achieves SOTA
	performance in the scientific domain and very competitive results in other areas.

	## Intended Usage

	This model is intended to be used for zero-shot text classification in English.

	## How to use

	```python
	from transformers import pipeline

	zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")

	sentence = "Leo Messi is the best player ever."
	candidate_labels = ["politics", "science", "sports", "environment"]
	template = "This example is {}"

	output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)

	print(output)
	print(f'Predicted class: {output["labels"][0]}')
	```

	## Limitations and bias

	No measures have been taken to estimate the bias and toxicity embedded in the model.

	Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).

	## Training

	### Training data

	Our data builds on top of scientific-domain
	annotated data from Microsoft Academic Graph (MAG).
	This database consists of a heterogeneous
	graph with billions of records from both scientific
	publications and patents, in addition to metadata
	information such as the authors, institutions, journals,
	conferences and their citation relationships.
	The documents are organized in a six-level hierarchical
	structure of scientific concepts, where the two
	top-most levels are manually curated in order to
	guarantee a high level of accuracy.

	To create the training corpus, a random sample of
	scientific articles with a publication year between
	2000 and 2021 were retrieved from MAG with their respective
	titles and abstracts in English. This results in over 2M documents
	with their corresponding Field Of Study, which was obtained from
	the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
	or "Transport Engineering").

	The fine-tuning dataset was constructed in a weakly supervised
	manner by converting text classification data to the entailment format.
	Using the relationship between scientific texts
	and their matching concepts in the 1-level MAG
	taxonomy we are able to generate the premise-
	hypothesis pairs corresponding to the entailment
	label. Conversely, we generate the pairs for the
	neutral label by removing the actual relationship
	between the texts and their scientific concepts and
	creating a virtual relationship with those to which
	they are not matched.

	### Training procedure

	The newly-created scientific dataset described in the previous section
	was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
	To do so, the model has to compute the entailment score between every text that
	is fed to it and all candidate labels. The final prediction would be the
	highest-scoring class in a single-label classification setup, or the N classes
	above a certain threshold in a multi-label scenario.

	A subset of 52 labels from the training data were kept apart so that they
	could be used as a development set of fully-unseen classes.
	As a novelty, the validation was not performed on the entailment task (which is used a proxy)
	but directly on the target text classification task. This allows us to stop training at the right
	time via early stopping, which prevents the model from "overfitting" to the training task. This method
	was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed
	that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
	improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.

	Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.

	## Evaluation

	### Evaluation data

	The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).

	The following table provides an overview of the number of examples and labels for each dataset:
	\| Dataset \| Labels \| Size \|
	\|------------------\|--------\|--------\|
	\| arXiv \| 11 \| 3,838 \|
	\| SciDocs-MeSH \| 11 \| 16,433 \|
	\| SciDocs-MAG \| 19 \| 17,501 \|
	\| Konstanz \| 24 \| 10,000 \|
	\| Elsevier \| 26 \| 14,738 \|
	\| PubMed \| 109 \| 5,000 \|
	\| Topic Categorization (Yahoo! Answers) \| 10 \| 60,000 \|
	\| Emotion Detection (UnifyEmotion) \| 10 \| 15,689 \|
	\| Situation Frame Detection (Situation Typing) \| 12 \| 3,311 \|

	Please refer to the paper for further details on each particular dataset.

	### Evaluation results

	These are the official results reported in the paper:

	#### Scientific domain benchmark
	\| Model \| arXiv \| SciDocs-MesH \| SciDocs-MAG \| Konstanz \| Elsevier \| PubMed \|
	\|-------\|-------\|--------------\|-------------\|----------\|----------\|--------\|
	\| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) \| 33.28 \| 66.18🔥 \| 51.77 \| 54.62 \| 28.41 \| 31.59🔥 \|
	\| SCIroShot \| 42.22🔥 \| 59.34 \| 69.86🔥 \| 66.07🔥 \| 54.42🔥 \| 27.93 \|

	#### General domain benchmark
	\| Model \| Topic \| Emotion \| Situation \|
	\|-------\|-------\|---------\|-----------\|
	\| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) \| 43.8 \| 12.6 \| 37.2🔥 \|
	\| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) \| 40.1 \| 24.7 \| 21.0 \|
	\| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) \| 37.9 \| 22.3 \| 15.4 \|
	\| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) \| 50.6 \| 16.5 \| 25.8 \|
	\| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) \| 53.1 \| 16.1 \| 19.9 \|
	\| SCIroShot \| 59.08🔥 \| 24.94🔥 \| 27.42

	All the numbers reported above represent label-wise weighted F1 except for the Topic classification dataset, which is evaluated in terms of accuracy following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).

	## Additional information

	### Authors

	- SIRIS Lab, Research Division of SIRIS Academic.
	- Language Technologies Unit, Barcelona Supercomputing Center.

	### Contact

	For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.

	### License

	This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

	### Funding

	This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
	- INODE (grant agreement No 863410).
	- IntelComp (grant agreement No 101004870).

	### Citation

	```bibtex
	@inproceedings{pamies2023weakly,
	title={A weakly supervised textual entailment approach to zero-shot text classification},
	author={P{\`a}mies, Marc and Llop, Joan and Multari, Francesco and Duran-Silva, Nicolau and Parra-Rojas, C{\'e}sar and Gonz{\'a}lez-Agirre, Aitor and Massucci, Francesco Alessandro and Villegas, Marta},
	booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
	pages={286--296},
	year={2023}
	}
	```

	### Disclaimer

	<details>
	<summary>Click to expand</summary>

	The model published in this repository is intended for a generalist purpose
	and is made available to third parties under a Apache v2.0 License.

	Please keep in mind that the model may have bias and/or any other undesirable distortions.
	When third parties deploy or provide systems and/or services to other parties using this model
	(or a system based on it) or become users of the model itself, they should note that it is under
	their responsibility to mitigate the risks arising from its use and, in any event, to comply with
	applicable regulations, including regulations regarding the use of Artificial Intelligence.

	In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
	</details>