aps6992 commited on
Commit
0891808
·
1 Parent(s): 137bae0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -6
README.md CHANGED
@@ -30,14 +30,169 @@ model = AutoAdapterModel.from_pretrained("allenai/specter2_aug2023refresh_base")
30
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
31
  ```
32
 
33
- ## Architecture & Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- <!-- Add some description here -->
 
36
 
37
- ## Evaluation results
 
38
 
39
- <!-- Add some description here -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- ## Citation
42
 
43
- <!-- Add some description here -->
 
30
  adapter_name = model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", set_active=True)
31
  ```
32
 
33
+ ---
34
+ license: apache-2.0
35
+ datasets:
36
+ - allenai/scirepeval
37
+ ---
38
+ **\*\*\*\*\*\*Update\*\*\*\*\*\***
39
+
40
+ This update introduces a new set of SPECTER 2.0 models with the base transformer encoder pre-trained on an extended citation dataset containing more recent papers.
41
+ For benchmarking purposes please use the existing SPECTER 2.0 models w/o the **aug2023refresh** suffix viz. [allenai/specter2_base](https://huggingface.co/allenai/specter2_base).
42
+
43
+ # SPECTER 2.0 (Base)
44
+ SPECTER 2.0 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
45
+ This is the base model to be used along with the adapters.
46
+ Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
47
+
48
+ **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
49
+
50
+ **To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
51
+
52
+ # Model Details
53
+
54
+ ## Model Description
55
+
56
+ SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation).
57
+ Post that it is trained with additionally attached task format specific adapter modules on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks.
58
+
59
+ Task Formats trained on:
60
+ - Classification
61
+ - Regression
62
+ - Proximity
63
+ - Adhoc Search
64
+
65
+
66
+ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
67
+
68
+
69
+
70
+ - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
71
+ - **Shared by :** Allen AI
72
+ - **Model type:** bert-base-uncased + adapters
73
+ - **License:** Apache 2.0
74
+ - **Finetuned from model:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased).
75
+
76
+ ## Model Sources
77
+
78
+ <!-- Provide the basic links for the model. -->
79
+
80
+ - **Repository:** [https://github.com/allenai/SPECTER2_0](https://github.com/allenai/SPECTER2_0)
81
+ - **Paper:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137)
82
+ - **Demo:** [Usage](https://github.com/allenai/SPECTER2_0/blob/main/README.md)
83
+
84
+ # Uses
85
+
86
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
87
+
88
+ ## Direct Use
89
+
90
+ |Model|Name and HF link|Description|
91
+ |--|--|--|
92
+ |Retrieval*|[allenai/specter2_aug2023refresh_proximity](https://huggingface.co/allenai/specter2_aug2023refresh)|Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search|
93
+ |Adhoc Query|[allenai/specter2_aug2023refresh_adhoc_query](https://huggingface.co/allenai/specter2_aug2023refresh_adhoc_query)|Encode short raw text queries for search tasks. (Candidate papers can be encoded with proximity)|
94
+ |Classification|[allenai/specter2_aug2023refresh_classification](https://huggingface.co/allenai/specter2_aug2023refresh_classification)|Encode papers to feed into linear classifiers as features|
95
+ |Regression|[allenai/specter2_aug2023refresh_regression](https://huggingface.co/allenai/specter2_aug2023refresh_regression)|Encode papers to feed into linear regressors as features|
96
+
97
+ *Retrieval model should suffice for downstream task types not mentioned above
98
+
99
+ ```python
100
+ from transformers import AutoTokenizer, AutoModel
101
+
102
+ # load model and tokenizer
103
+ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_aug2023refresh_base')
104
+
105
+ #load base model
106
+ model = AutoModel.from_pretrained('allenai/specter2_aug2023refresh_base')
107
 
108
+ #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
109
+ model.load_adapter("allenai/specter2_aug2023refresh_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
110
 
111
+ papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
112
+ {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
113
 
114
+ # concatenate title and abstract
115
+ text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
116
+ # preprocess the input
117
+ inputs = self.tokenizer(text_batch, padding=True, truncation=True,
118
+ return_tensors="pt", return_token_type_ids=False, max_length=512)
119
+ output = model(**inputs)
120
+ # take the first token in the batch as the embedding
121
+ embeddings = output.last_hidden_state[:, 0, :]
122
+ ```
123
+
124
+ ## Downstream Use
125
+
126
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
127
+
128
+ For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md).
129
+
130
+ # Training Details
131
+
132
+ ## Training Data
133
+
134
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
135
+
136
+ The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats.
137
+ All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval).
138
+
139
+ The citation link are triplets in the form
140
+
141
+ ```json
142
+ {"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}
143
+ ```
144
+
145
+ consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation.
146
+
147
+ ## Training Procedure
148
+
149
+ Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677).
150
+
151
+
152
+ ### Training Hyperparameters
153
+
154
+
155
+ The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md):
156
+ - Base Model: First a base model is trained on the above citation triplets.
157
+ ``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16```
158
+ - Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well.
159
+ ``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16```
160
+
161
+
162
+ # Evaluation
163
+
164
+ We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset.
165
+ We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark.
166
+
167
+ |Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)|
168
+ |--|--|--|--|--|
169
+ |[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)|
170
+ |[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)|
171
+ |[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)|
172
+ |[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)|
173
+ |[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allenai/specter-2)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**|
174
+
175
+ Please cite the following works if you end up using SPECTER 2.0:
176
+
177
+ [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677):
178
+
179
+ ```bibtex
180
+ @inproceedings{specter2020cohan,
181
+ title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
182
+ author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
183
+ booktitle={ACL},
184
+ year={2020}
185
+ }
186
+ ```
187
+ [SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137)
188
+ ```bibtex
189
+ @article{Singh2022SciRepEvalAM,
190
+ title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
191
+ author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
192
+ journal={ArXiv},
193
+ year={2022},
194
+ volume={abs/2211.13308}
195
+ }
196
+ ```
197
 
 
198