aps6992 commited on
Commit
461b37e
1 Parent(s): 730073c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - allenai/scirepeval
5
+ language:
6
+ - en
7
+ ---
8
+ # SPECTER 2.0
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+ SPECTER 2.0 is the successor to [SPECTER](allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/spp).
13
+ Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
14
+
15
+ # Model Details
16
+
17
+ ## Model Description
18
+
19
+ SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation).
20
+ Post that it is trained on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks, with task format specific adapters.
21
+
22
+ Task Formats trained on:
23
+ - Classification
24
+ - Regression
25
+ - Proximity
26
+ - Adhoc Search
27
+
28
+
29
+ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
30
+
31
+
32
+
33
+ - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
34
+ - **Shared by :** Allen AI
35
+ - **Model type:** bert-base-uncased + adapters
36
+ - **License:** Apache 2.0
37
+ - **Finetuned from model [optional]:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased).
38
+
39
+ ## Model Sources [optional]
40
+
41
+ <!-- Provide the basic links for the model. -->
42
+
43
+ - **Repository:** [https://github.com/allenai/SPECTER2_0] (https://github.com/allenai/SPECTER2_0)
44
+ - **Paper [optional]:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137)
45
+ - **Demo [optional]:** [Usage] (https://github.com/allenai/SPECTER2_0/blob/main/README.md)
46
+
47
+ # Uses
48
+
49
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
50
+
51
+ ## Direct Use
52
+
53
+ |Model|Type|Name and HF link|
54
+ |--|--|--|
55
+ |Base|Transformer|[allenai/specter_plus_plus](https://huggingface.co/allenai/specter_plus_plus)|
56
+ |Classification|Adapter|[allenai/spp_classification](https://huggingface.co/allenai/spp_classification)|
57
+ |Regression|Adapter|[allenai/spp_regression](https://huggingface.co/allenai/spp_regression)|
58
+ |Retrieval|Adapter|[allenai/spp_proximity](https://huggingface.co/allenai/spp_proximity)|
59
+ |Adhoc Query|Adapter|[allenai/spp_adhoc_query](https://huggingface.co/allenai/spp_adhoc_query)|
60
+
61
+ ```python
62
+ from transformers import AutoTokenizer, AutoModel
63
+
64
+ # load model and tokenizer
65
+ tokenizer = AutoTokenizer.from_pretrained('allenai/specter_plus_plus')
66
+
67
+ #load base model
68
+ model = AutoModel.from_pretrained('allenai/specter_plus_plus')
69
+
70
+ #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
71
+ model.load_adapter("allenai/spp_adhoc_query", source="hf", load_as="adhoc_query", set_active=True)
72
+
73
+ papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
74
+ {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
75
+
76
+ # concatenate title and abstract
77
+ text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
78
+ # preprocess the input
79
+ inputs = self.tokenizer(text_batch, padding=True, truncation=True,
80
+ return_tensors="pt", return_token_type_ids=False, max_length=512)
81
+ output = model(**inputs)
82
+ # take the first token in the batch as the embedding
83
+ embeddings = output.last_hidden_state[:, 0, :]
84
+ ```
85
+
86
+ ## Downstream Use [optional]
87
+
88
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
89
+
90
+ For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md).
91
+
92
+ # Training Details
93
+
94
+ ## Training Data
95
+
96
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
97
+
98
+ The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats.
99
+ All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval).
100
+
101
+ The citation link are triplets in the form
102
+
103
+ ```json
104
+ {"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}
105
+ ```
106
+
107
+ consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation.
108
+
109
+ ## Training Procedure
110
+
111
+ Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677).
112
+
113
+
114
+ ### Training Hyperparameters
115
+
116
+
117
+ The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md):
118
+ - Base Model: First a base model is trained on the above citation triplets.
119
+ ``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16```
120
+ - Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well.
121
+ ``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16```
122
+
123
+
124
+ # Evaluation
125
+
126
+ We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset.
127
+ We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark.
128
+
129
+ |Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)|
130
+ |--|--|--|--|--|
131
+ |[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)|
132
+ |[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)|
133
+ |[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)|
134
+ |[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)|
135
+ |[SPECTER 2.0-base](https://huggingface.co/allenai/specter_plus_plus)|56.3|58.0|69.2|(38.0, 32.4)|
136
+ |[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allen/spp)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**|
137
+
138
+ Please cite the following works if you end up using SPECTER 2.0:
139
+
140
+ [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677):
141
+
142
+ ```bibtex
143
+ @inproceedings{specter2020cohan,
144
+ title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
145
+ author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
146
+ booktitle={ACL},
147
+ year={2020}
148
+ }
149
+ ```
150
+ [SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137)
151
+ ```bibtex
152
+ @article{Singh2022SciRepEvalAM,
153
+ title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
154
+ author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
155
+ journal={ArXiv},
156
+ year={2022},
157
+ volume={abs/2211.13308}
158
+ }
159
+ ```
160
+
161
+