sciroshot / README.md
mapama247's picture
update eacl paper link
a220c74
---
pipeline_tag: zero-shot-classification
license: apache-2.0
language:
- en
tags:
- zero-shot
- text-classification
- science
- mag
widget:
- text: Leo Messi is the best player ever
candidate_labels: politics, science, sports, environment
multi_class: true
---
# SCIroShot
## Overview
<details>
<summary>Click to expand</summary>
- **Model type:** Language Model
- **Architecture:** RoBERTa-large
- **Language:** English
- **License:** Apache 2.0
- **Task:** Zero-Shot Text Classification
- **Data:** Microsoft Academic Graph
- **Additional Resources:**
- [Paper](https://aclanthology.org/2023.eacl-main.22/)
- [GitHub](https://github.com/bsc-langtech/sciroshot)
</details>
## Model description
SCIroShot is an entailment-based Zero-Shot Text Classification model that
has been fine-tuned using a self-made dataset composed of scientific articles
from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
(MAG). The resulting model achieves SOTA
performance in the scientific domain and very competitive results in other areas.
## Intended Usage
This model is intended to be used for zero-shot text classification in English.
## How to use
```python
from transformers import pipeline
zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")
sentence = "Leo Messi is the best player ever."
candidate_labels = ["politics", "science", "sports", "environment"]
template = "This example is {}"
output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
print(output)
print(f'Predicted class: {output["labels"][0]}')
```
## Limitations and bias
No measures have been taken to estimate the bias and toxicity embedded in the model.
Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).
## Training
### Training data
Our data builds on top of scientific-domain
annotated data from Microsoft Academic Graph (MAG).
This database consists of a heterogeneous
graph with billions of records from both scientific
publications and patents, in addition to metadata
information such as the authors, institutions, journals,
conferences and their citation relationships.
The documents are organized in a six-level hierarchical
structure of scientific concepts, where the two
top-most levels are manually curated in order to
guarantee a high level of accuracy.
To create the training corpus, a random sample of
scientific articles with a publication year between
2000 and 2021 were retrieved from MAG with their respective
titles and abstracts in English. This results in over 2M documents
with their corresponding Field Of Study, which was obtained from
the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
or "Transport Engineering").
The fine-tuning dataset was constructed in a weakly supervised
manner by converting text classification data to the entailment format.
Using the relationship between scientific texts
and their matching concepts in the 1-level MAG
taxonomy we are able to generate the premise-
hypothesis pairs corresponding to the entailment
label. Conversely, we generate the pairs for the
neutral label by removing the actual relationship
between the texts and their scientific concepts and
creating a virtual relationship with those to which
they are not matched.
### Training procedure
The newly-created scientific dataset described in the previous section
was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
To do so, the model has to compute the entailment score between every text that
is fed to it and all candidate labels. The final prediction would be the
highest-scoring class in a single-label classification setup, or the N classes
above a certain threshold in a multi-label scenario.
A subset of 52 labels from the training data were kept apart so that they
could be used as a development set of fully-unseen classes.
As a novelty, the validation was not performed on the entailment task (which is used a proxy)
but directly on the target text classification task. This allows us to stop training at the right
time via early stopping, which prevents the model from "overfitting" to the training task. This method
was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed
that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.
Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.
## Evaluation
### Evaluation data
The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).
The following table provides an overview of the number of examples and labels for each dataset:
| Dataset | Labels | Size |
|------------------|--------|--------|
| arXiv | 11 | 3,838 |
| SciDocs-MeSH | 11 | 16,433 |
| SciDocs-MAG | 19 | 17,501 |
| Konstanz | 24 | 10,000 |
| Elsevier | 26 | 14,738 |
| PubMed | 109 | 5,000 |
| Topic Categorization (Yahoo! Answers) | 10 | 60,000 |
| Emotion Detection (UnifyEmotion) | 10 | 15,689 |
| Situation Frame Detection (Situation Typing) | 12 | 3,311 |
Please refer to the paper for further details on each particular dataset.
### Evaluation results
These are the official results reported in the paper:
#### Scientific domain benchmark
| Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed |
|-------|-------|--------------|-------------|----------|----------|--------|
| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**πŸ”₯ | 51.77 | 54.62 | 28.41 | **31.59**πŸ”₯ |
| SCIroShot | **42.22**πŸ”₯ | 59.34 | **69.86**πŸ”₯ | **66.07**πŸ”₯ | **54.42**πŸ”₯ | 27.93 |
#### General domain benchmark
| Model | Topic | Emotion | Situation |
|-------|-------|---------|-----------|
| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**πŸ”₯ |
| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 |
| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 |
| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 |
| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 |
| SCIroShot | **59.08**πŸ”₯ | **24.94**πŸ”₯ | 27.42
All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).
## Additional information
### Authors
- SIRIS Lab, Research Division of SIRIS Academic.
- Language Technologies Unit, Barcelona Supercomputing Center.
### Contact
For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.
### License
This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
### Funding
This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
- INODE (grant agreement No 863410).
- IntelComp (grant agreement No 101004870).
### Citation
```bibtex
@inproceedings{pamies2023weakly,
title={A weakly supervised textual entailment approach to zero-shot text classification},
author={P{\`a}mies, Marc and Llop, Joan and Multari, Francesco and Duran-Silva, Nicolau and Parra-Rojas, C{\'e}sar and Gonz{\'a}lez-Agirre, Aitor and Massucci, Francesco Alessandro and Villegas, Marta},
booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
pages={286--296},
year={2023}
}
```
### Disclaimer
<details>
<summary>Click to expand</summary>
The model published in this repository is intended for a generalist purpose
and is made available to third parties under a Apache v2.0 License.
Please keep in mind that the model may have bias and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model
(or a system based on it) or become users of the model itself, they should note that it is under
their responsibility to mitigate the risks arising from its use and, in any event, to comply with
applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
</details>