|
--- |
|
pipeline_tag: zero-shot-classification |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- zero-shot |
|
- text-classification |
|
- science |
|
- mag |
|
widget: |
|
- text: Leo Messi is the best player ever |
|
candidate_labels: politics, science, sports, environment |
|
multi_class: true |
|
--- |
|
|
|
# SCIroShot |
|
|
|
## Overview |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
- **Model type:** Language Model |
|
- **Architecture:** RoBERTa-large |
|
- **Language:** English |
|
- **License:** Apache 2.0 |
|
- **Task:** Zero-Shot Text Classification |
|
- **Data:** Microsoft Academic Graph |
|
- **Additional Resources:** |
|
- [Paper]() <-- WiP (soon to be published in EACL 2023) |
|
- [GitHub](https://github.com/TeMU-BSC/sciroshot) |
|
</details> |
|
|
|
## Model description |
|
|
|
SCIroShot is an entailment-based Zero-Shot Text Classification model that |
|
has been fine-tuned using a self-made dataset composed of scientific articles |
|
from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) |
|
(MAG). The resulting model achieves SOTA |
|
performance in the scientific domain and very competitive results in other areas. |
|
|
|
## Intended Usage |
|
|
|
This model is intended to be used for zero-shot text classification in English. |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot") |
|
|
|
sentence = "Leo Messi is the best player ever." |
|
candidate_labels = ["politics", "science", "sports", "environment"] |
|
template = "This example is {}" |
|
|
|
output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False) |
|
|
|
print(output) |
|
print(f'Predicted class: {output["labels"][0]}') |
|
``` |
|
|
|
## Limitations and bias |
|
|
|
No measures have been taken to estimate the bias and toxicity embedded in the model. |
|
|
|
Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias). |
|
|
|
## Training |
|
|
|
### Training data |
|
|
|
Our data builds on top of scientific-domain |
|
annotated data from Microsoft Academic Graph (MAG). |
|
This database consists of a heterogeneous |
|
graph with billions of records from both scientific |
|
publications and patents, in addition to metadata |
|
information such as the authors, institutions, journals, |
|
conferences and their citation relationships. |
|
The documents are organized in a six-level hierarchical |
|
structure of scientific concepts, where the two |
|
top-most levels are manually curated in order to |
|
guarantee a high level of accuracy. |
|
|
|
To create the training corpus, a random sample of |
|
scientific articles with a publication year between |
|
2000 and 2021 were retrieved from MAG with their respective |
|
titles and abstracts in English. This results in over 2M documents |
|
with their corresponding Field Of Study, which was obtained from |
|
the 1-level MAG taxonomy (292 possible classes, such as "Computational biology" |
|
or "Transport Engineering"). |
|
|
|
The fine-tuning dataset was constructed in a weakly supervised |
|
manner by converting text classification data to the entailment format. |
|
Using the relationship between scientific texts |
|
and their matching concepts in the 1-level MAG |
|
taxonomy we are able to generate the premise- |
|
hypothesis pairs corresponding to the entailment |
|
label. Conversely, we generate the pairs for the |
|
neutral label by removing the actual relationship |
|
between the texts and their scientific concepts and |
|
creating a virtual relationship with those to which |
|
they are not matched. |
|
|
|
### Training procedure |
|
|
|
The newly-created scientific dataset described in the previous section |
|
was used to fine-tune a 355M parameters RoBERTa model on the entailment task. |
|
To do so, the model has to compute the entailment score between every text that |
|
is fed to it and all candidate labels. The final prediction would be the |
|
highest-scoring class in a single-label classification setup, or the N classes |
|
above a certain threshold in a multi-label scenario. |
|
|
|
A subset of 52 labels from the training data were kept apart so that they |
|
could be used as a development set of fully-unseen classes. |
|
As a novelty, the validation was not performed on the entailment task (which is used a proxy) |
|
but directly on the target text classification task. This allows us to stop training at the right |
|
time via early stopping, which prevents the model from "overfitting" to the training task. This method |
|
was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed |
|
that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to |
|
improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance. |
|
|
|
Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation. |
|
|
|
## Evaluation |
|
|
|
### Evaluation data |
|
|
|
The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability). |
|
|
|
The following table provides an overview of the number of examples and labels for each dataset: |
|
| Dataset | Labels | Size | |
|
|------------------|--------|--------| |
|
| arXiv | 11 | 3,838 | |
|
| SciDocs-MeSH | 11 | 16,433 | |
|
| SciDocs-MAG | 19 | 17,501 | |
|
| Konstanz | 24 | 10,000 | |
|
| Elsevier | 26 | 14,738 | |
|
| PubMed | 109 | 5,000 | |
|
| Topic Categorization (Yahoo! Answers) | 10 | 60,000 | |
|
| Emotion Detection (UnifyEmotion) | 10 | 15,689 | |
|
| Situation Frame Detection (Situation Typing) | 12 | 3,311 | |
|
|
|
Please refer to the paper for further details on each particular dataset. |
|
|
|
### Evaluation results |
|
|
|
These are the official results reported in the paper: |
|
|
|
#### Scientific domain benchmark |
|
| Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed | |
|
|-------|-------|--------------|-------------|----------|----------|--------| |
|
| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**π₯ | 51.77 | 54.62 | 28.41 | **31.59**π₯ | |
|
| SCIroShot | **42.22**π₯ | 59.34 | **69.86**π₯ | **66.07**π₯ | **54.42**π₯ | 27.93 | |
|
|
|
#### General domain benchmark |
|
| Model | Topic | Emotion | Situation | |
|
|-------|-------|---------|-----------| |
|
| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**π₯ | |
|
| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 | |
|
| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 | |
|
| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 | |
|
| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 | |
|
| SCIroShot | **59.08**π₯ | **24.94**π₯ | 27.42 |
|
|
|
All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf). |
|
|
|
## Additional information |
|
|
|
### Authors |
|
|
|
- SIRIS Lab, Research Division of SIRIS Academic. |
|
- Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
|
|
For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>. |
|
|
|
### License |
|
|
|
This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
### Funding |
|
|
|
This work was partially funded by 2 projects under EUβs H2020 Research and Innovation Programme: |
|
- INODE (grant agreement No 863410). |
|
- IntelComp (grant agreement No 101004870). |
|
|
|
### Citation |
|
|
|
```bibtex |
|
@inproceedings{pamies2023weakly, |
|
title={A weakly supervised textual entailment approach to zero-shot text classification}, |
|
author={P{\`a}mies, Marc and Llop, Joan and Multari, Francesco and Duran-Silva, Nicolau and Parra-Rojas, C{\'e}sar and Gonz{\'a}lez-Agirre, Aitor and Massucci, Francesco Alessandro and Villegas, Marta}, |
|
booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, |
|
pages={286--296}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
### Disclaimer |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose |
|
and is made available to third parties under a Apache v2.0 License. |
|
|
|
Please keep in mind that the model may have bias and/or any other undesirable distortions. |
|
When third parties deploy or provide systems and/or services to other parties using this model |
|
(or a system based on it) or become users of the model itself, they should note that it is under |
|
their responsibility to mitigate the risks arising from its use and, in any event, to comply with |
|
applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties. |
|
</details> |
|
|