init commit

Browse files

Files changed (10) hide show

README.md +204 -0
config.json +36 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
trainer_state.json +84 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,207 @@
 ---
 license: apache-2.0
 ---

 ---
+pipeline_tag: zero-shot-classification
 license: apache-2.0
+language:
+- en
+tags:
+- zero-shot
+- text-classification
+- science
+- mag
 ---
+# SCIroShot
+## Overview
+<details>
+<summary>Click to expand</summary>
+- **Model type:** Language Model
+- **Architecture:** RoBERTa-large
+- **Language:** English
+- **License:** Apache 2.0
+- **Task:** Zero-Shot Text Classification
+- **Data:** Microsoft Academic Graph
+- **Additional Resources:**
+  - [Paper]() <-- WiP (soon to be published in EACL 2023)
+  - [GitHub](https://github.com/TeMU-BSC/sciroshot)
+</details>
+## Model description
+SCIroShot is an entailment-based Zero-Shot Text Classification model that
+has been fine-tuned using a self-made dataset composed of scientific articles
+from [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
+(MAG). The resulting model achieves SOTA
+performance in the scientific domain and very competitive results in other areas.
+## Intended Usage
+This model is intended to be used for zero-shot text classification in English.
+## How to use
+```python
+from transformers import pipeline
+zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")
+sentence = "Leo Messi is the best player ever."
+candidate_labels = ["politics", "science", "sports", "environment"]
+template = "This example is {}"
+output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)
+print(output)
+print(f'Predicted class: {output["labels"][0]}')
+```
+## Limitations and bias
+No measures have been taken to estimate the bias and toxicity embedded in the model.
+Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the [RoBERTa-large model card](https://huggingface.co/roberta-large#limitations-and-bias).
+## Training
+### Training data
+Our data builds on top of scientific-domain
+annotated data from Microsoft Academic Graph (MAG).
+This database consists of a heterogeneous
+graph with billions of records from both scientific
+publications and patents, in addition to metadata
+information such as the authors, institutions, journals,
+conferences and their citation relationships.
+The documents are organized in a six-level hierarchical
+structure of scientific concepts, where the two
+top-most levels are manually curated in order to
+guarantee a high level of accuracy.
+To create the training corpus, a random sample of
+scientific articles with a publication year between
+2000 and 2021 were retrieved from MAG with their respective
+titles and abstracts in English. This results in over 2M documents
+with their corresponding Field Of Study, which was obtained from
+the 1-level MAG taxonomy (292 possible classes, such as "Computational biology"
+or "Transport Engineering").
+The fine-tuning dataset was constructed in a weakly supervised
+manner by converting text classification data to the entailment format.
+Using the relationship between scientific texts
+and their matching concepts in the 1-level MAG
+taxonomy we are able to generate the premise-
+hypothesis pairs corresponding to the entailment
+label. Conversely, we generate the pairs for the
+neutral label by removing the actual relationship
+between the texts and their scientific concepts and
+creating a virtual relationship with those to which
+they are not matched.
+### Training procedure
+The newly-created scientific dataset described in the previous section
+was used to fine-tune a 355M parameters RoBERTa model on the entailment task.
+To do so, the model has to compute the entailment score between every text that
+is fed to it and all candidate labels. The final prediction would be the
+highest-scoring class in a single-label classification setup, or the N classes
+above a certain threshold in a multi-label scenario.
+A subset of 52 labels from the training data were kept apart so that they
+could be used as a development set of fully-unseen classes.
+As a novelty, the validation was not performed on the entailment task (which is used a proxy)
+but directly on the target text classification task. This allows us to stop training at the right
+time via early stopping, which prevents the model from "overfitting" to the training task. This method
+was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed
+that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to
+improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.
+Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.
+## Evaluation
+### Evaluation data
+The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).
+The following table provides an overview of the number of examples and labels for each dataset:
+| Dataset          | Labels | Size   |
+|------------------|--------|--------|
+| arXiv            | 11     | 3,838  |
+| SciDocs-MeSH     | 11     | 16,433 |
+| SciDocs-MAG      | 19     | 17,501 |
+| Konstanz         | 24     | 10,000 |
+| Elsevier         | 26     | 14,738 |
+| PubMed           | 109    | 5,000  |
+| Topic Categorization (Yahoo! Answers) | 10     | 60,000 |
+| Emotion Detection (UnifyEmotion) | 10     | 15,689 |
+| Situation Frame Detection (Situation Typing) | 12     | 3,311  |
+Please refer to the paper for further details on each particular dataset.
+### Evaluation results
+These are the official results reported in the paper:
+#### Scientific domain benchmark
+| Model | arXiv | SciDocs-MesH | SciDocs-MAG | Konstanz | Elsevier | PubMed |
+|-------|-------|--------------|-------------|----------|----------|--------|
+| [fb/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) | 33.28 | **66.18**🔥 | 51.77 | 54.62 | 28.41 | **31.59**🔥 |
+| SCIroShot | **42.22**🔥 | 59.34 | **69.86**🔥 | **66.07**🔥 | **54.42**🔥 | 27.93 |
+#### General domain benchmark
+| Model | Topic | Emotion | Situation |
+|-------|-------|---------|-----------|
+| RTE [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 43.8 | 12.6 | **37.2**🔥 |
+| FEVER [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 40.1 | 24.7 | 21.0 |
+| MNLI [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf) | 37.9 | 22.3 | 15.4 |
+| NSP [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 50.6 | 16.5 | 25.8 |
+| NSP-Reverse [(Ma et al., 2021)](https://aclanthology.org/2021.acl-short.99.pdf) | 53.1 | 16.1 | 19.9 |
+| SCIroShot | **59.08**🔥 | **24.94**🔥 | 27.42
+All the numbers reported above represent **label-wise weighted F1** except for the Topic classification dataset, which is evaluated in terms of **accuracy** following the notation from [(Yin et al., 2019)](https://arxiv.org/pdf/1909.00161.pdf).
+## Additional information
+### Authors
+- SIRIS Lab, Research Division of SIRIS Academic.
+- Language Technologies Unit, Barcelona Supercomputing Center.
+### Contact
+For further information, send an email to either <langtech@bsc.es> or <info@sirisacademic.com>.
+### License
+This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+### Funding
+This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:
+- INODE (grant agreement No 863410).
+- IntelComp (grant agreement No 101004870).
+### Citation
+```bibtex
+Soon to be published in EACL 2023.
+```
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The model published in this repository is intended for a generalist purpose
+and is made available to third parties under a Apache v2.0 License.
+Please keep in mind that the model may have bias and/or any other undesirable distortions.
+When third parties deploy or provide systems and/or services to other parties using this model
+(or a system based on it) or become users of the model itself, they should note that it is under
+their responsibility to mitigate the risks arising from its use and, in any event, to comply with
+applicable regulations, including regulations regarding the use of Artificial Intelligence.
+In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
+</details>

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "./models/roberta-large",
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "finetuning_task": "mag",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "ENTAILMENT",
+    "1": "NEUTRAL"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "label2id": {
+    "ENTAILMENT": 0,
+    "NEUTRAL": 1
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ba738451ec038c55bb862197a5abae763a6a340694c49b20dbdad523c5f577e
+size 1421611309

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": false, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "name_or_path": "/gpfs/projects/bsc88/huggingface/models/roberta-large", "tokenizer_class": "RobertaTokenizer"}

trainer_state.json ADDED Viewed

	@@ -0,0 +1,84 @@

+{
+  "best_metric": 0.9606727878443162,
+  "best_model_checkpoint": "./output//roberta_0_100_0.000008_8_0.01_0.000008_01-26-22_23-25/checkpoint-8000",
+  "epoch": 0.6081800212863008,
+  "global_step": 8000,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.15,
+      "learning_rate": 2.0271126314455844e-06,
+      "loss": 0.3799,
+      "step": 2000
+    },
+    {
+      "epoch": 0.15,
+      "eval_accuracy": 0.9360740357434579,
+      "eval_combined_score": 116916.96803701787,
+      "eval_loss": 0.16771526634693146,
+      "eval_number_of_examples": 233833,
+      "eval_runtime": 565.9226,
+      "eval_samples_per_second": 413.189,
+      "eval_steps_per_second": 2.583,
+      "step": 2000
+    },
+    {
+      "epoch": 0.3,
+      "learning_rate": 4.054225262891169e-06,
+      "loss": 0.162,
+      "step": 4000
+    },
+    {
+      "epoch": 0.3,
+      "eval_accuracy": 0.9526029260198517,
+      "eval_combined_score": 116916.97630146302,
+      "eval_loss": 0.12519623339176178,
+      "eval_number_of_examples": 233833,
+      "eval_runtime": 566.1477,
+      "eval_samples_per_second": 413.025,
+      "eval_steps_per_second": 2.582,
+      "step": 4000
+    },
+    {
+      "epoch": 0.46,
+      "learning_rate": 6.081337894336754e-06,
+      "loss": 0.127,
+      "step": 6000
+    },
+    {
+      "epoch": 0.46,
+      "eval_accuracy": 0.958962165306009,
+      "eval_combined_score": 116916.97948108266,
+      "eval_loss": 0.11369086056947708,
+      "eval_number_of_examples": 233833,
+      "eval_runtime": 566.6337,
+      "eval_samples_per_second": 412.67,
+      "eval_steps_per_second": 2.58,
+      "step": 6000
+    },
+    {
+      "epoch": 0.61,
+      "learning_rate": 7.993077066164161e-06,
+      "loss": 0.1168,
+      "step": 8000
+    },
+    {
+      "epoch": 0.61,
+      "eval_accuracy": 0.9606727878443162,
+      "eval_combined_score": 116916.98033639393,
+      "eval_loss": 0.10579771548509598,
+      "eval_number_of_examples": 233833,
+      "eval_runtime": 566.6673,
+      "eval_samples_per_second": 412.646,
+      "eval_steps_per_second": 2.58,
+      "step": 8000
+    }
+  ],
+  "max_steps": 131540,
+  "num_train_epochs": 10,
+  "total_flos": 1.1928721221631345e+18,
+  "trial_name": null,
+  "trial_params": null
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e190f0e7b03ebe0b395f81f0f1d2b6b19070636ff90ffbba9d1c9effffdccc54
+size 2735

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff