Update README.md
Browse files
README.md
CHANGED
@@ -5,12 +5,17 @@ tags:
|
|
5 |
datasets:
|
6 |
- allenai/scirepeval
|
7 |
---
|
|
|
8 |
|
9 |
-
|
10 |
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
|
14 |
|
15 |
**Dec 2023 Update:**
|
16 |
|
@@ -28,19 +33,14 @@ Model usage updated to be compatible with latest versions of transformers and ad
|
|
28 |
However, for benchmarking purposes, please continue using the current version.**
|
29 |
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
34 |
|
35 |
-
|
36 |
-
This is the base model to be used along with the adapters.
|
37 |
-
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
38 |
|
39 |
-
|
40 |
|
41 |
-
**To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
|
42 |
|
43 |
-
## Usage
|
44 |
|
45 |
First, install `adapters`:
|
46 |
|
@@ -79,7 +79,6 @@ Task Formats trained on:
|
|
79 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
80 |
|
81 |
|
82 |
-
|
83 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
84 |
- **Shared by :** Allen AI
|
85 |
- **Model type:** bert-base-uncased + adapters
|
@@ -112,6 +111,16 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
|
|
112 |
```python
|
113 |
from transformers import AutoTokenizer
|
114 |
from adapters import AutoAdapterModel
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
|
116 |
# load model and tokenizer
|
117 |
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
@@ -119,20 +128,22 @@ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
|
119 |
#load base model
|
120 |
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
|
121 |
|
122 |
-
#load the adapter
|
123 |
model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
|
|
|
|
|
124 |
|
|
|
|
|
125 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
126 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
127 |
-
|
128 |
# concatenate title and abstract
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
embeddings = output.last_hidden_state[:, 0, :]
|
136 |
```
|
137 |
|
138 |
## Downstream Use
|
|
|
5 |
datasets:
|
6 |
- allenai/scirepeval
|
7 |
---
|
8 |
+
## SPECTER2
|
9 |
|
10 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
11 |
|
12 |
+
SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
|
13 |
+
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
|
14 |
+
|
15 |
+
**Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
|
16 |
+
|
17 |
+
**To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
|
18 |
|
|
|
19 |
|
20 |
**Dec 2023 Update:**
|
21 |
|
|
|
33 |
However, for benchmarking purposes, please continue using the current version.**
|
34 |
|
35 |
|
36 |
+
# Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
|
|
|
|
|
37 |
|
38 |
+
An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
|
|
|
|
|
39 |
|
40 |
+
This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
|
41 |
|
|
|
42 |
|
43 |
+
## Adapter Usage
|
44 |
|
45 |
First, install `adapters`:
|
46 |
|
|
|
79 |
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
|
80 |
|
81 |
|
|
|
82 |
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
|
83 |
- **Shared by :** Allen AI
|
84 |
- **Model type:** bert-base-uncased + adapters
|
|
|
111 |
```python
|
112 |
from transformers import AutoTokenizer
|
113 |
from adapters import AutoAdapterModel
|
114 |
+
from sklearn.metrics.pairwise import euclidean_distances
|
115 |
+
|
116 |
+
def embed_input(text_batch: List[str]):
|
117 |
+
# preprocess the input
|
118 |
+
inputs = self.tokenizer(text_batch, padding=True, truncation=True,
|
119 |
+
return_tensors="pt", return_token_type_ids=False, max_length=512)
|
120 |
+
output = model(**inputs)
|
121 |
+
# take the first token in the batch as the embedding
|
122 |
+
embeddings = output.last_hidden_state[:, 0, :]
|
123 |
+
return embeddings
|
124 |
|
125 |
# load model and tokenizer
|
126 |
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
|
|
|
128 |
#load base model
|
129 |
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
|
130 |
|
131 |
+
#load the query adapter, provide an identifier for the adapter in load_as argument and activate it
|
132 |
model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
|
133 |
+
query = ["Bidirectional transformers"]
|
134 |
+
query_embedding = embed_input(query)
|
135 |
|
136 |
+
#load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
|
137 |
+
model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2", set_active=True)
|
138 |
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
|
139 |
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
|
|
|
140 |
# concatenate title and abstract
|
141 |
+
text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
|
142 |
+
paper_embeddings = embed_input(text_papers_batch)
|
143 |
+
|
144 |
+
#Calculate L2 distance between query and papers
|
145 |
+
l2_distance = euclidean_distances(papers, query).flatten()
|
146 |
+
|
|
|
147 |
```
|
148 |
|
149 |
## Downstream Use
|