allenai
/

specter2_base

+---
+license: apache-2.0
+datasets:
+- allenai/scirepeval
+language:
+- en
+---
+# SPECTER 2.0
+<!-- Provide a quick summary of what the model is/does. -->
+SPECTER 2.0 is the successor to [SPECTER](allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/spp).
+Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
+# Model Details
+## Model Description
+SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation).
+Post that it is trained on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks, with task format specific adapters.
+Task Formats trained on:
+- Classification
+- Regression
+- Proximity
+- Adhoc Search
+It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
+- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
+- **Shared by :** Allen AI
+- **Model type:** bert-base-uncased + adapters
+- **License:** Apache 2.0
+- **Finetuned from model [optional]:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased).
+## Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [https://github.com/allenai/SPECTER2_0] (https://github.com/allenai/SPECTER2_0)
+- **Paper [optional]:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137)
+- **Demo [optional]:** [Usage] (https://github.com/allenai/SPECTER2_0/blob/main/README.md)
+# Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+## Direct Use
+|Model|Type|Name and HF link|
+|--|--|--|
+|Base|Transformer|[allenai/specter_plus_plus](https://huggingface.co/allenai/specter_plus_plus)|
+|Classification|Adapter|[allenai/spp_classification](https://huggingface.co/allenai/spp_classification)|
+|Regression|Adapter|[allenai/spp_regression](https://huggingface.co/allenai/spp_regression)|
+|Retrieval|Adapter|[allenai/spp_proximity](https://huggingface.co/allenai/spp_proximity)|
+|Adhoc Query|Adapter|[allenai/spp_adhoc_query](https://huggingface.co/allenai/spp_adhoc_query)|
+```python
+from transformers import AutoTokenizer, AutoModel
+# load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('allenai/specter_plus_plus')
+#load base model
+model = AutoModel.from_pretrained('allenai/specter_plus_plus')
+#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
+model.load_adapter("allenai/spp_adhoc_query", source="hf", load_as="adhoc_query", set_active=True)
+papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
+          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
+# concatenate title and abstract
+text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
+# preprocess the input
+inputs = self.tokenizer(text_batch, padding=True, truncation=True,
+                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
+output = model(**inputs)
+# take the first token in the batch as the embedding
+embeddings = output.last_hidden_state[:, 0, :]
+```
+## Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md).
+# Training Details
+## Training Data
+<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats.
+All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval).
+The citation link are triplets in the form
+```json
+{"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}
+```
+consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation.
+## Training Procedure
+Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677).
+### Training Hyperparameters
+The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md):
+- Base Model: First a base model is trained on the above citation triplets.
+``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16```
+- Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well.
+``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16```
+# Evaluation
+We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset.
+We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark.
+|Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)|
+|--|--|--|--|--|
+|[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)|
+|[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)|
+|[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)|
+|[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)|
+|[SPECTER 2.0-base](https://huggingface.co/allenai/specter_plus_plus)|56.3|58.0|69.2|(38.0, 32.4)|
+|[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allen/spp)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**|
+Please cite the following works if you end up using SPECTER 2.0:
+[SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677):
+```bibtex
+@inproceedings{specter2020cohan,
+  title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
+  author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
+  booktitle={ACL},
+  year={2020}
+}
+```
+[SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137)
+```bibtex
+@article{Singh2022SciRepEvalAM,
+  title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
+  author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
+  journal={ArXiv},
+  year={2022},
+  volume={abs/2211.13308}
+}
+```