mapama247 commited on
Commit
c9a8b71
1 Parent(s): a19b1de

init commit

Browse files
README.md CHANGED
@@ -1,3 +1,207 @@
1
  ---
 
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: zero-shot-classification
3
  license: apache-2.0
4
+ language:
5
+ - en
6
+ tags:
7
+ - zero-shot
8
+ - text-classification
9
+ - science
10
+ - mag
11
  ---
12
+
13
+ # SCIroShot
14
+
15
+ ## Overview
16
+
17
+ <details>
18
+ <summary>Click to expand</summary>
19
+
20
+ - **Model type:** Language Model
21
+ - **Architecture:** RoBERTa-large
22
+ - **Language:** English
23
+ - **License:** Apache 2.0
24
+ - **Task:** Zero-Shot Text Classification
25
+ - **Data:** Microsoft Academic Graph
26
+ - **Additional Resources:**
27
+ - [Paper]() <-- WiP (soon to be published in EACL 2023)
28
+ - [GitHub](https://github.com/TeMU-BSC/sciroshot)
29
+ </details>
30
+
31
+ ## Model description
32
+
33
+ SCIroShot is an entailment-based Zero-Shot Text Classification model that
34
+ has been fine-tuned using a self-made dataset composed of scientific articles
35
+ from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
36
+ (MAG). The resulting model achieves SOTA
37
+ performance in the scientific domain and very competitive results in other areas.
38
+
39
+ ## Intended Usage
40
+
41
+ This model is intended to be used for zero-shot text classification in English.
42
+
43
+ ## How to use
44
+
45
+ ```python
46
+ from transformers import pipeline
47
+
48
+ zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")
49
+
50
+ sentence = "Leo Messi is the best player ever."
51
+ candidate_labels = ["politics", "science", "sports", "environment"]
52
+ template = "This example is {}"
53
+
54
+ output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
55
+
56
+ print(output)
57
+ print(f'Predicted class: {output["labels"][0]}')
58
+ ```
59
+
60
+ ## Limitations and bias
61
+
62
+ No measures have been taken to estimate the bias and toxicity embedded in the model.
63
+
64
+ Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).
65
+
66
+ ## Training
67
+
68
+ ### Training data
69
+
70
+ Our data builds on top of scientific-domain
71
+ annotated data from Microsoft Academic Graph (MAG).
72
+ This database consists of a heterogeneous
73
+ graph with billions of records from both scientific
74
+ publications and patents, in addition to metadata
75
+ information such as the authors, institutions, journals,
76
+ conferences and their citation relationships.
77
+ The documents are organized in a six-level hierarchical
78
+ structure of scientific concepts, where the two
79
+ top-most levels are manually curated in order to
80
+ guarantee a high level of accuracy.
81
+
82
+ To create the training corpus, a random sample of
83
+ scientific articles with a publication year between
84
+ 2000 and 2021 were retrieved from MAG with their respective
85
+ titles and abstracts in English. This results in over 2M documents
86
+ with their corresponding Field Of Study, which was obtained from
87
+ the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
88
+ or "Transport Engineering").
89
+
90
+ The fine-tuning dataset was constructed in a weakly supervised
91
+ manner by converting text classification data to the entailment format.
92
+ Using the relationship between scientific texts
93
+ and their matching concepts in the 1-level MAG
94
+ taxonomy we are able to generate the premise-
95
+ hypothesis pairs corresponding to the entailment
96
+ label. Conversely, we generate the pairs for the
97
+ neutral label by removing the actual relationship
98
+ between the texts and their scientific concepts and
99
+ creating a virtual relationship with those to which
100
+ they are not matched.
101
+
102
+ ### Training procedure
103
+
104
+ The newly-created scientific dataset described in the previous section
105
+ was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
106
+ To do so, the model has to compute the entailment score between every text that
107
+ is fed to it and all candidate labels. The final prediction would be the
108
+ highest-scoring class in a single-label classification setup, or the N classes
109
+ above a certain threshold in a multi-label scenario.
110
+
111
+ A subset of 52 labels from the training data were kept apart so that they
112
+ could be used as a development set of fully-unseen classes.
113
+ As a novelty, the validation was not performed on the entailment task (which is used a proxy)
114
+ but directly on the target text classification task. This allows us to stop training at the right
115
+ time via early stopping, which prevents the model from "overfitting" to the training task. This method
116
+ was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed
117
+ that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
118
+ improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.
119
+
120
+ Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.
121
+
122
+ ## Evaluation
123
+
124
+ ### Evaluation data
125
+
126
+ The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).
127
+
128
+ The following table provides an overview of the number of examples and labels for each dataset:
129
+ | Dataset | Labels | Size |
130
+ |------------------|--------|--------|
131
+ | arXiv | 11 | 3,838 |
132
+ | SciDocs-MeSH | 11 | 16,433 |
133
+ | SciDocs-MAG | 19 | 17,501 |
134
+ | Konstanz | 24 | 10,000 |
135
+ | Elsevier | 26 | 14,738 |
136
+ | PubMed | 109 | 5,000 |
137
+ | Topic Categorization (Yahoo! Answers) | 10 | 60,000 |
138
+ | Emotion Detection (UnifyEmotion) | 10 | 15,689 |
139
+ | Situation Frame Detection (Situation Typing) | 12 | 3,311 |
140
+
141
+ Please refer to the paper for further details on each particular dataset.
142
+
143
+ ### Evaluation results
144
+
145
+ These are the official results reported in the paper:
146
+
147
+ #### Scientific domain benchmark
148
+ | Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed |
149
+ |-------|-------|--------------|-------------|----------|----------|--------|
150
+ | [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**🔥 | 51.77 | 54.62 | 28.41 | **31.59**🔥 |
151
+ | SCIroShot | **42.22**🔥 | 59.34 | **69.86**🔥 | **66.07**🔥 | **54.42**🔥 | 27.93 |
152
+
153
+ #### General domain benchmark
154
+ | Model | Topic | Emotion | Situation |
155
+ |-------|-------|---------|-----------|
156
+ | RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**🔥 |
157
+ | FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 |
158
+ | MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 |
159
+ | NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 |
160
+ | NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 |
161
+ | SCIroShot | **59.08**🔥 | **24.94**🔥 | 27.42
162
+
163
+ All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).
164
+
165
+ ## Additional information
166
+
167
+ ### Authors
168
+
169
+ - SIRIS Lab, Research Division of SIRIS Academic.
170
+ - Language Technologies Unit, Barcelona Supercomputing Center.
171
+
172
+ ### Contact
173
+
174
+ For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.
175
+
176
+ ### License
177
+
178
+ This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
179
+
180
+ ### Funding
181
+
182
+ This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
183
+ - INODE (grant agreement No 863410).
184
+ - IntelComp (grant agreement No 101004870).
185
+
186
+ ### Citation
187
+
188
+ ```bibtex
189
+ Soon to be published in EACL 2023.
190
+ ```
191
+
192
+ ### Disclaimer
193
+
194
+ <details>
195
+ <summary>Click to expand</summary>
196
+
197
+ The model published in this repository is intended for a generalist purpose
198
+ and is made available to third parties under a Apache v2.0 License.
199
+
200
+ Please keep in mind that the model may have bias and/or any other undesirable distortions.
201
+ When third parties deploy or provide systems and/or services to other parties using this model
202
+ (or a system based on it) or become users of the model itself, they should note that it is under
203
+ their responsibility to mitigate the risks arising from its use and, in any event, to comply with
204
+ applicable regulations, including regulations regarding the use of Artificial Intelligence.
205
+
206
+ In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
207
+ </details>
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./models/roberta-large",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "finetuning_task": "mag",
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 1024,
14
+ "id2label": {
15
+ "0": "ENTAILMENT",
16
+ "1": "NEUTRAL"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 4096,
20
+ "label2id": {
21
+ "ENTAILMENT": 0,
22
+ "NEUTRAL": 1
23
+ },
24
+ "layer_norm_eps": 1e-05,
25
+ "max_position_embeddings": 514,
26
+ "model_type": "roberta",
27
+ "num_attention_heads": 16,
28
+ "num_hidden_layers": 24,
29
+ "pad_token_id": 1,
30
+ "position_embedding_type": "absolute",
31
+ "problem_type": "single_label_classification",
32
+ "torch_dtype": "float32",
33
+ "type_vocab_size": 1,
34
+ "use_cache": true,
35
+ "vocab_size": 50265
36
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ba738451ec038c55bb862197a5abae763a6a340694c49b20dbdad523c5f577e
3
+ size 1421611309
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": false, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "name_or_path": "/gpfs/projects/bsc88/huggingface/models/roberta-large", "tokenizer_class": "RobertaTokenizer"}
trainer_state.json ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.9606727878443162,
3
+ "best_model_checkpoint": "./output//roberta_0_100_0.000008_8_0.01_0.000008_01-26-22_23-25/checkpoint-8000",
4
+ "epoch": 0.6081800212863008,
5
+ "global_step": 8000,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.15,
12
+ "learning_rate": 2.0271126314455844e-06,
13
+ "loss": 0.3799,
14
+ "step": 2000
15
+ },
16
+ {
17
+ "epoch": 0.15,
18
+ "eval_accuracy": 0.9360740357434579,
19
+ "eval_combined_score": 116916.96803701787,
20
+ "eval_loss": 0.16771526634693146,
21
+ "eval_number_of_examples": 233833,
22
+ "eval_runtime": 565.9226,
23
+ "eval_samples_per_second": 413.189,
24
+ "eval_steps_per_second": 2.583,
25
+ "step": 2000
26
+ },
27
+ {
28
+ "epoch": 0.3,
29
+ "learning_rate": 4.054225262891169e-06,
30
+ "loss": 0.162,
31
+ "step": 4000
32
+ },
33
+ {
34
+ "epoch": 0.3,
35
+ "eval_accuracy": 0.9526029260198517,
36
+ "eval_combined_score": 116916.97630146302,
37
+ "eval_loss": 0.12519623339176178,
38
+ "eval_number_of_examples": 233833,
39
+ "eval_runtime": 566.1477,
40
+ "eval_samples_per_second": 413.025,
41
+ "eval_steps_per_second": 2.582,
42
+ "step": 4000
43
+ },
44
+ {
45
+ "epoch": 0.46,
46
+ "learning_rate": 6.081337894336754e-06,
47
+ "loss": 0.127,
48
+ "step": 6000
49
+ },
50
+ {
51
+ "epoch": 0.46,
52
+ "eval_accuracy": 0.958962165306009,
53
+ "eval_combined_score": 116916.97948108266,
54
+ "eval_loss": 0.11369086056947708,
55
+ "eval_number_of_examples": 233833,
56
+ "eval_runtime": 566.6337,
57
+ "eval_samples_per_second": 412.67,
58
+ "eval_steps_per_second": 2.58,
59
+ "step": 6000
60
+ },
61
+ {
62
+ "epoch": 0.61,
63
+ "learning_rate": 7.993077066164161e-06,
64
+ "loss": 0.1168,
65
+ "step": 8000
66
+ },
67
+ {
68
+ "epoch": 0.61,
69
+ "eval_accuracy": 0.9606727878443162,
70
+ "eval_combined_score": 116916.98033639393,
71
+ "eval_loss": 0.10579771548509598,
72
+ "eval_number_of_examples": 233833,
73
+ "eval_runtime": 566.6673,
74
+ "eval_samples_per_second": 412.646,
75
+ "eval_steps_per_second": 2.58,
76
+ "step": 8000
77
+ }
78
+ ],
79
+ "max_steps": 131540,
80
+ "num_train_epochs": 10,
81
+ "total_flos": 1.1928721221631345e+18,
82
+ "trial_name": null,
83
+ "trial_params": null
84
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e190f0e7b03ebe0b395f81f0f1d2b6b19070636ff90ffbba9d1c9effffdccc54
3
+ size 2735
vocab.json ADDED
The diff for this file is too large to render. See raw diff