mapama247
commited on
Commit
•
c9a8b71
1
Parent(s):
a19b1de
init commit
Browse files- README.md +204 -0
- config.json +36 -0
- merges.txt +0 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- trainer_state.json +84 -0
- training_args.bin +3 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -1,3 +1,207 @@
|
|
1 |
---
|
|
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
pipeline_tag: zero-shot-classification
|
3 |
license: apache-2.0
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
tags:
|
7 |
+
- zero-shot
|
8 |
+
- text-classification
|
9 |
+
- science
|
10 |
+
- mag
|
11 |
---
|
12 |
+
|
13 |
+
# SCIroShot
|
14 |
+
|
15 |
+
## Overview
|
16 |
+
|
17 |
+
<details>
|
18 |
+
<summary>Click to expand</summary>
|
19 |
+
|
20 |
+
- **Model type:** Language Model
|
21 |
+
- **Architecture:** RoBERTa-large
|
22 |
+
- **Language:** English
|
23 |
+
- **License:** Apache 2.0
|
24 |
+
- **Task:** Zero-Shot Text Classification
|
25 |
+
- **Data:** Microsoft Academic Graph
|
26 |
+
- **Additional Resources:**
|
27 |
+
- [Paper]() <-- WiP (soon to be published in EACL 2023)
|
28 |
+
- [GitHub](https://github.com/TeMU-BSC/sciroshot)
|
29 |
+
</details>
|
30 |
+
|
31 |
+
## Model description
|
32 |
+
|
33 |
+
SCIroShot is an entailment-based Zero-Shot Text Classification model that
|
34 |
+
has been fine-tuned using a self-made dataset composed of scientific articles
|
35 |
+
from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
|
36 |
+
(MAG). The resulting model achieves SOTA
|
37 |
+
performance in the scientific domain and very competitive results in other areas.
|
38 |
+
|
39 |
+
## Intended Usage
|
40 |
+
|
41 |
+
This model is intended to be used for zero-shot text classification in English.
|
42 |
+
|
43 |
+
## How to use
|
44 |
+
|
45 |
+
```python
|
46 |
+
from transformers import pipeline
|
47 |
+
|
48 |
+
zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")
|
49 |
+
|
50 |
+
sentence = "Leo Messi is the best player ever."
|
51 |
+
candidate_labels = ["politics", "science", "sports", "environment"]
|
52 |
+
template = "This example is {}"
|
53 |
+
|
54 |
+
output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
|
55 |
+
|
56 |
+
print(output)
|
57 |
+
print(f'Predicted class: {output["labels"][0]}')
|
58 |
+
```
|
59 |
+
|
60 |
+
## Limitations and bias
|
61 |
+
|
62 |
+
No measures have been taken to estimate the bias and toxicity embedded in the model.
|
63 |
+
|
64 |
+
Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).
|
65 |
+
|
66 |
+
## Training
|
67 |
+
|
68 |
+
### Training data
|
69 |
+
|
70 |
+
Our data builds on top of scientific-domain
|
71 |
+
annotated data from Microsoft Academic Graph (MAG).
|
72 |
+
This database consists of a heterogeneous
|
73 |
+
graph with billions of records from both scientific
|
74 |
+
publications and patents, in addition to metadata
|
75 |
+
information such as the authors, institutions, journals,
|
76 |
+
conferences and their citation relationships.
|
77 |
+
The documents are organized in a six-level hierarchical
|
78 |
+
structure of scientific concepts, where the two
|
79 |
+
top-most levels are manually curated in order to
|
80 |
+
guarantee a high level of accuracy.
|
81 |
+
|
82 |
+
To create the training corpus, a random sample of
|
83 |
+
scientific articles with a publication year between
|
84 |
+
2000 and 2021 were retrieved from MAG with their respective
|
85 |
+
titles and abstracts in English. This results in over 2M documents
|
86 |
+
with their corresponding Field Of Study, which was obtained from
|
87 |
+
the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
|
88 |
+
or "Transport Engineering").
|
89 |
+
|
90 |
+
The fine-tuning dataset was constructed in a weakly supervised
|
91 |
+
manner by converting text classification data to the entailment format.
|
92 |
+
Using the relationship between scientific texts
|
93 |
+
and their matching concepts in the 1-level MAG
|
94 |
+
taxonomy we are able to generate the premise-
|
95 |
+
hypothesis pairs corresponding to the entailment
|
96 |
+
label. Conversely, we generate the pairs for the
|
97 |
+
neutral label by removing the actual relationship
|
98 |
+
between the texts and their scientific concepts and
|
99 |
+
creating a virtual relationship with those to which
|
100 |
+
they are not matched.
|
101 |
+
|
102 |
+
### Training procedure
|
103 |
+
|
104 |
+
The newly-created scientific dataset described in the previous section
|
105 |
+
was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
|
106 |
+
To do so, the model has to compute the entailment score between every text that
|
107 |
+
is fed to it and all candidate labels. The final prediction would be the
|
108 |
+
highest-scoring class in a single-label classification setup, or the N classes
|
109 |
+
above a certain threshold in a multi-label scenario.
|
110 |
+
|
111 |
+
A subset of 52 labels from the training data were kept apart so that they
|
112 |
+
could be used as a development set of fully-unseen classes.
|
113 |
+
As a novelty, the validation was not performed on the entailment task (which is used a proxy)
|
114 |
+
but directly on the target text classification task. This allows us to stop training at the right
|
115 |
+
time via early stopping, which prevents the model from "overfitting" to the training task. This method
|
116 |
+
was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed
|
117 |
+
that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
|
118 |
+
improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.
|
119 |
+
|
120 |
+
Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.
|
121 |
+
|
122 |
+
## Evaluation
|
123 |
+
|
124 |
+
### Evaluation data
|
125 |
+
|
126 |
+
The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).
|
127 |
+
|
128 |
+
The following table provides an overview of the number of examples and labels for each dataset:
|
129 |
+
| Dataset | Labels | Size |
|
130 |
+
|------------------|--------|--------|
|
131 |
+
| arXiv | 11 | 3,838 |
|
132 |
+
| SciDocs-MeSH | 11 | 16,433 |
|
133 |
+
| SciDocs-MAG | 19 | 17,501 |
|
134 |
+
| Konstanz | 24 | 10,000 |
|
135 |
+
| Elsevier | 26 | 14,738 |
|
136 |
+
| PubMed | 109 | 5,000 |
|
137 |
+
| Topic Categorization (Yahoo! Answers) | 10 | 60,000 |
|
138 |
+
| Emotion Detection (UnifyEmotion) | 10 | 15,689 |
|
139 |
+
| Situation Frame Detection (Situation Typing) | 12 | 3,311 |
|
140 |
+
|
141 |
+
Please refer to the paper for further details on each particular dataset.
|
142 |
+
|
143 |
+
### Evaluation results
|
144 |
+
|
145 |
+
These are the official results reported in the paper:
|
146 |
+
|
147 |
+
#### Scientific domain benchmark
|
148 |
+
| Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed |
|
149 |
+
|-------|-------|--------------|-------------|----------|----------|--------|
|
150 |
+
| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**🔥 | 51.77 | 54.62 | 28.41 | **31.59**🔥 |
|
151 |
+
| SCIroShot | **42.22**🔥 | 59.34 | **69.86**🔥 | **66.07**🔥 | **54.42**🔥 | 27.93 |
|
152 |
+
|
153 |
+
#### General domain benchmark
|
154 |
+
| Model | Topic | Emotion | Situation |
|
155 |
+
|-------|-------|---------|-----------|
|
156 |
+
| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**🔥 |
|
157 |
+
| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 |
|
158 |
+
| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 |
|
159 |
+
| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 |
|
160 |
+
| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 |
|
161 |
+
| SCIroShot | **59.08**🔥 | **24.94**🔥 | 27.42
|
162 |
+
|
163 |
+
All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).
|
164 |
+
|
165 |
+
## Additional information
|
166 |
+
|
167 |
+
### Authors
|
168 |
+
|
169 |
+
- SIRIS Lab, Research Division of SIRIS Academic.
|
170 |
+
- Language Technologies Unit, Barcelona Supercomputing Center.
|
171 |
+
|
172 |
+
### Contact
|
173 |
+
|
174 |
+
For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.
|
175 |
+
|
176 |
+
### License
|
177 |
+
|
178 |
+
This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
|
179 |
+
|
180 |
+
### Funding
|
181 |
+
|
182 |
+
This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
|
183 |
+
- INODE (grant agreement No 863410).
|
184 |
+
- IntelComp (grant agreement No 101004870).
|
185 |
+
|
186 |
+
### Citation
|
187 |
+
|
188 |
+
```bibtex
|
189 |
+
Soon to be published in EACL 2023.
|
190 |
+
```
|
191 |
+
|
192 |
+
### Disclaimer
|
193 |
+
|
194 |
+
<details>
|
195 |
+
<summary>Click to expand</summary>
|
196 |
+
|
197 |
+
The model published in this repository is intended for a generalist purpose
|
198 |
+
and is made available to third parties under a Apache v2.0 License.
|
199 |
+
|
200 |
+
Please keep in mind that the model may have bias and/or any other undesirable distortions.
|
201 |
+
When third parties deploy or provide systems and/or services to other parties using this model
|
202 |
+
(or a system based on it) or become users of the model itself, they should note that it is under
|
203 |
+
their responsibility to mitigate the risks arising from its use and, in any event, to comply with
|
204 |
+
applicable regulations, including regulations regarding the use of Artificial Intelligence.
|
205 |
+
|
206 |
+
In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
|
207 |
+
</details>
|
config.json
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "./models/roberta-large",
|
3 |
+
"architectures": [
|
4 |
+
"RobertaForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"bos_token_id": 0,
|
8 |
+
"eos_token_id": 2,
|
9 |
+
"finetuning_task": "mag",
|
10 |
+
"gradient_checkpointing": false,
|
11 |
+
"hidden_act": "gelu",
|
12 |
+
"hidden_dropout_prob": 0.1,
|
13 |
+
"hidden_size": 1024,
|
14 |
+
"id2label": {
|
15 |
+
"0": "ENTAILMENT",
|
16 |
+
"1": "NEUTRAL"
|
17 |
+
},
|
18 |
+
"initializer_range": 0.02,
|
19 |
+
"intermediate_size": 4096,
|
20 |
+
"label2id": {
|
21 |
+
"ENTAILMENT": 0,
|
22 |
+
"NEUTRAL": 1
|
23 |
+
},
|
24 |
+
"layer_norm_eps": 1e-05,
|
25 |
+
"max_position_embeddings": 514,
|
26 |
+
"model_type": "roberta",
|
27 |
+
"num_attention_heads": 16,
|
28 |
+
"num_hidden_layers": 24,
|
29 |
+
"pad_token_id": 1,
|
30 |
+
"position_embedding_type": "absolute",
|
31 |
+
"problem_type": "single_label_classification",
|
32 |
+
"torch_dtype": "float32",
|
33 |
+
"type_vocab_size": 1,
|
34 |
+
"use_cache": true,
|
35 |
+
"vocab_size": 50265
|
36 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8ba738451ec038c55bb862197a5abae763a6a340694c49b20dbdad523c5f577e
|
3 |
+
size 1421611309
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": false, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "name_or_path": "/gpfs/projects/bsc88/huggingface/models/roberta-large", "tokenizer_class": "RobertaTokenizer"}
|
trainer_state.json
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"best_metric": 0.9606727878443162,
|
3 |
+
"best_model_checkpoint": "./output//roberta_0_100_0.000008_8_0.01_0.000008_01-26-22_23-25/checkpoint-8000",
|
4 |
+
"epoch": 0.6081800212863008,
|
5 |
+
"global_step": 8000,
|
6 |
+
"is_hyper_param_search": false,
|
7 |
+
"is_local_process_zero": true,
|
8 |
+
"is_world_process_zero": true,
|
9 |
+
"log_history": [
|
10 |
+
{
|
11 |
+
"epoch": 0.15,
|
12 |
+
"learning_rate": 2.0271126314455844e-06,
|
13 |
+
"loss": 0.3799,
|
14 |
+
"step": 2000
|
15 |
+
},
|
16 |
+
{
|
17 |
+
"epoch": 0.15,
|
18 |
+
"eval_accuracy": 0.9360740357434579,
|
19 |
+
"eval_combined_score": 116916.96803701787,
|
20 |
+
"eval_loss": 0.16771526634693146,
|
21 |
+
"eval_number_of_examples": 233833,
|
22 |
+
"eval_runtime": 565.9226,
|
23 |
+
"eval_samples_per_second": 413.189,
|
24 |
+
"eval_steps_per_second": 2.583,
|
25 |
+
"step": 2000
|
26 |
+
},
|
27 |
+
{
|
28 |
+
"epoch": 0.3,
|
29 |
+
"learning_rate": 4.054225262891169e-06,
|
30 |
+
"loss": 0.162,
|
31 |
+
"step": 4000
|
32 |
+
},
|
33 |
+
{
|
34 |
+
"epoch": 0.3,
|
35 |
+
"eval_accuracy": 0.9526029260198517,
|
36 |
+
"eval_combined_score": 116916.97630146302,
|
37 |
+
"eval_loss": 0.12519623339176178,
|
38 |
+
"eval_number_of_examples": 233833,
|
39 |
+
"eval_runtime": 566.1477,
|
40 |
+
"eval_samples_per_second": 413.025,
|
41 |
+
"eval_steps_per_second": 2.582,
|
42 |
+
"step": 4000
|
43 |
+
},
|
44 |
+
{
|
45 |
+
"epoch": 0.46,
|
46 |
+
"learning_rate": 6.081337894336754e-06,
|
47 |
+
"loss": 0.127,
|
48 |
+
"step": 6000
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"epoch": 0.46,
|
52 |
+
"eval_accuracy": 0.958962165306009,
|
53 |
+
"eval_combined_score": 116916.97948108266,
|
54 |
+
"eval_loss": 0.11369086056947708,
|
55 |
+
"eval_number_of_examples": 233833,
|
56 |
+
"eval_runtime": 566.6337,
|
57 |
+
"eval_samples_per_second": 412.67,
|
58 |
+
"eval_steps_per_second": 2.58,
|
59 |
+
"step": 6000
|
60 |
+
},
|
61 |
+
{
|
62 |
+
"epoch": 0.61,
|
63 |
+
"learning_rate": 7.993077066164161e-06,
|
64 |
+
"loss": 0.1168,
|
65 |
+
"step": 8000
|
66 |
+
},
|
67 |
+
{
|
68 |
+
"epoch": 0.61,
|
69 |
+
"eval_accuracy": 0.9606727878443162,
|
70 |
+
"eval_combined_score": 116916.98033639393,
|
71 |
+
"eval_loss": 0.10579771548509598,
|
72 |
+
"eval_number_of_examples": 233833,
|
73 |
+
"eval_runtime": 566.6673,
|
74 |
+
"eval_samples_per_second": 412.646,
|
75 |
+
"eval_steps_per_second": 2.58,
|
76 |
+
"step": 8000
|
77 |
+
}
|
78 |
+
],
|
79 |
+
"max_steps": 131540,
|
80 |
+
"num_train_epochs": 10,
|
81 |
+
"total_flos": 1.1928721221631345e+18,
|
82 |
+
"trial_name": null,
|
83 |
+
"trial_params": null
|
84 |
+
}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e190f0e7b03ebe0b395f81f0f1d2b6b19070636ff90ffbba9d1c9effffdccc54
|
3 |
+
size 2735
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|