---
pipeline_tag: zero-shot-classification
license: apache-2.0
language:
- en
tags:
- zero-shot
- text-classification
- science
- mag
widget:
  - text: Leo Messi is the best player ever
    candidate_labels: politics, science, sports, environment
    multi_class: true
---

# SCIroShot

## Overview

<details>
<summary>Click to expand</summary>
  
- **Model type:** Language Model
- **Architecture:** RoBERTa-large
- **Language:** English
- **License:** Apache 2.0
- **Task:** Zero-Shot Text Classification
- **Data:** Microsoft Academic Graph
- **Additional Resources:**
  - [Paper]() <-- WiP (soon to be published in EACL 2023)
  - [GitHub](https://github.com/TeMU-BSC/sciroshot)
</details>

## Model description

SCIroShot is an entailment-based Zero-Shot Text Classification model that 
has been fine-tuned using a self-made dataset composed of scientific articles
from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) 
(MAG). The resulting model achieves SOTA
performance in the scientific domain and very competitive results in other areas.

## Intended Usage

This model is intended to be used for zero-shot text classification in English.

## How to use

```python
from transformers import pipeline

zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")

sentence = "Leo Messi is the best player ever."
candidate_labels = ["politics", "science", "sports", "environment"]
template = "This example is {}"

output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)

print(output)
print(f'Predicted class: {output["labels"][0]}')
```

## Limitations and bias

No measures have been taken to estimate the bias and toxicity embedded in the model.

Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).

## Training

### Training data

Our data builds on top of scientific-domain 
annotated data from Microsoft Academic Graph (MAG).
This database consists of a heterogeneous
graph with billions of records from both scientific
publications and patents, in addition to metadata 
information such as the authors, institutions, journals,
conferences and their citation relationships.
The documents are organized in a six-level hierarchical 
structure of scientific concepts, where the two
top-most levels are manually curated in order to 
guarantee a high level of accuracy.

To create the training corpus, a random sample of
scientific articles with a publication year between
2000 and 2021 were retrieved from MAG with their respective
titles and abstracts in English. This results in over 2M documents
with their corresponding Field Of Study, which was obtained from
the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
or "Transport Engineering"). 

The fine-tuning dataset was constructed in a weakly supervised 
manner by converting text classification data to the entailment format.
Using the relationship between scientific texts
and their matching concepts in the 1-level MAG
taxonomy we are able to generate the premise-
hypothesis pairs corresponding to the entailment
label. Conversely, we generate the pairs for the
neutral label by removing the actual relationship
between the texts and their scientific concepts and
creating a virtual relationship with those to which
they are not matched.

### Training procedure

The newly-created scientific dataset described in the previous section 
was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
To do so, the model has to compute the entailment score between every text that 
is fed to it and all candidate labels. The final prediction would be the 
highest-scoring class in a single-label classification setup, or the N classes 
above a certain threshold in a multi-label scenario.

A subset of 52 labels from the training data were kept apart so that they 
could be used as a development set of fully-unseen classes. 
As a novelty, the validation was not performed on the entailment task (which is used a proxy)
but directly on the target text classification task. This allows us to stop training at the right 
time via early stopping, which prevents the model from "overfitting" to the training task. This method
was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed 
that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.

Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.

## Evaluation

### Evaluation data

The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).

The following table provides an overview of the number of examples and labels for each dataset:
| Dataset          | Labels | Size   |
|------------------|--------|--------|
| arXiv            | 11     | 3,838  |
| SciDocs-MeSH     | 11     | 16,433 |
| SciDocs-MAG      | 19     | 17,501 |
| Konstanz         | 24     | 10,000 |
| Elsevier         | 26     | 14,738 |
| PubMed           | 109    | 5,000  |
| Topic Categorization (Yahoo! Answers) | 10     | 60,000 |
| Emotion Detection (UnifyEmotion) | 10     | 15,689 |
| Situation Frame Detection (Situation Typing) | 12     | 3,311  |

Please refer to the paper for further details on each particular dataset. 

### Evaluation results

These are the official results reported in the paper:

#### Scientific domain benchmark
| Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed |
|-------|-------|--------------|-------------|----------|----------|--------|
| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**🔥 | 51.77 | 54.62 | 28.41 | **31.59**🔥 |
| SCIroShot | **42.22**🔥 | 59.34 | **69.86**🔥 | **66.07**🔥 | **54.42**🔥 | 27.93 |

#### General domain benchmark
| Model | Topic | Emotion | Situation |
|-------|-------|---------|-----------|
| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**🔥 |
| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 |
| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 |
| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 |
| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 |
| SCIroShot | **59.08**🔥 | **24.94**🔥 | 27.42

All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).

## Additional information

### Authors 

- SIRIS Lab, Research Division of SIRIS Academic.
- Language Technologies Unit, Barcelona Supercomputing Center.

### Contact

For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.

### License

This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

### Funding

This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
- INODE (grant agreement No 863410).
- IntelComp (grant agreement No 101004870).

### Citation

```bibtex
Soon to be published in EACL 2023.
```

### Disclaimer

<details>
<summary>Click to expand</summary>

The model published in this repository is intended for a generalist purpose 
and is made available to third parties under a Apache v2.0 License.

Please keep in mind that the model may have bias and/or any other undesirable distortions. 
When third parties deploy or provide systems and/or services to other parties using this model 
(or a system based on it) or become users of the model itself, they should note that it is under 
their responsibility to mitigate the risks arising from its use and, in any event, to comply with 
applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
</details>