aps6992 commited on
Commit
c9bef72
1 Parent(s): 3bf83fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -22
README.md CHANGED
@@ -5,12 +5,17 @@ tags:
5
  datasets:
6
  - allenai/scirepeval
7
  ---
 
8
 
9
- # Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
10
 
11
- An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
 
 
 
 
 
12
 
13
- This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
14
 
15
  **Dec 2023 Update:**
16
 
@@ -28,19 +33,14 @@ Model usage updated to be compatible with latest versions of transformers and ad
28
  However, for benchmarking purposes, please continue using the current version.**
29
 
30
 
31
- ## SPECTER2
32
-
33
- <!-- Provide a quick summary of what the model is/does. -->
34
 
35
- SPECTER2 is the successor to [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
36
- This is the base model to be used along with the adapters.
37
- Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
38
 
39
- **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
40
 
41
- **To get the best performance on a downstream task type please load the associated adapter with the base model as in the example below.**
42
 
43
- ## Usage
44
 
45
  First, install `adapters`:
46
 
@@ -79,7 +79,6 @@ Task Formats trained on:
79
  It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
80
 
81
 
82
-
83
  - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
84
  - **Shared by :** Allen AI
85
  - **Model type:** bert-base-uncased + adapters
@@ -112,6 +111,16 @@ It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientif
112
  ```python
113
  from transformers import AutoTokenizer
114
  from adapters import AutoAdapterModel
 
 
 
 
 
 
 
 
 
 
115
 
116
  # load model and tokenizer
117
  tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
@@ -119,20 +128,22 @@ tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
119
  #load base model
120
  model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
121
 
122
- #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
123
  model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
 
 
124
 
 
 
125
  papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
126
  {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
127
-
128
  # concatenate title and abstract
129
- text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
130
- # preprocess the input
131
- inputs = self.tokenizer(text_batch, padding=True, truncation=True,
132
- return_tensors="pt", return_token_type_ids=False, max_length=512)
133
- output = model(**inputs)
134
- # take the first token in the batch as the embedding
135
- embeddings = output.last_hidden_state[:, 0, :]
136
  ```
137
 
138
  ## Downstream Use
 
5
  datasets:
6
  - allenai/scirepeval
7
  ---
8
+ ## SPECTER2
9
 
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
 
12
+ SPECTER2 is a family of models that succeeds [SPECTER](https://huggingface.co/allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_).
13
+ Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications.
14
+
15
+ **Note:For general embedding purposes, please use [allenai/specter2](https://huggingface.co/allenai/specter2).**
16
+
17
+ **To get the best performance on a downstream task type please load the associated adapter () with the base model as in the example below.**
18
 
 
19
 
20
  **Dec 2023 Update:**
21
 
 
33
  However, for benchmarking purposes, please continue using the current version.**
34
 
35
 
36
+ # Adapter `allenai/specter2_adhoc_query` for allenai/specter2_base
 
 
37
 
38
+ An [adapter](https://adapterhub.ml) for the [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) model that was trained on the [allenai/scirepeval](https://huggingface.co/datasets/allenai/scirepeval/) dataset.
 
 
39
 
40
+ This adapter was created for usage with the **[adapters](https://github.com/adapter-hub/adapters)** library.
41
 
 
42
 
43
+ ## Adapter Usage
44
 
45
  First, install `adapters`:
46
 
 
79
  It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well.
80
 
81
 
 
82
  - **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
83
  - **Shared by :** Allen AI
84
  - **Model type:** bert-base-uncased + adapters
 
111
  ```python
112
  from transformers import AutoTokenizer
113
  from adapters import AutoAdapterModel
114
+ from sklearn.metrics.pairwise import euclidean_distances
115
+
116
+ def embed_input(text_batch: List[str]):
117
+ # preprocess the input
118
+ inputs = self.tokenizer(text_batch, padding=True, truncation=True,
119
+ return_tensors="pt", return_token_type_ids=False, max_length=512)
120
+ output = model(**inputs)
121
+ # take the first token in the batch as the embedding
122
+ embeddings = output.last_hidden_state[:, 0, :]
123
+ return embeddings
124
 
125
  # load model and tokenizer
126
  tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
 
128
  #load base model
129
  model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
130
 
131
+ #load the query adapter, provide an identifier for the adapter in load_as argument and activate it
132
  model.load_adapter("allenai/specter2_adhoc_query", source="hf", load_as="specter2_adhoc_query", set_active=True)
133
+ query = ["Bidirectional transformers"]
134
+ query_embedding = embed_input(query)
135
 
136
+ #load the proximity adapter, provide an identifier for the adapter in load_as argument and activate it
137
+ model.load_adapter("allenai/specter2_proximity", source="hf", load_as="specter2", set_active=True)
138
  papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
139
  {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
 
140
  # concatenate title and abstract
141
+ text_papers_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
142
+ paper_embeddings = embed_input(text_papers_batch)
143
+
144
+ #Calculate L2 distance between query and papers
145
+ l2_distance = euclidean_distances(papers, query).flatten()
146
+
 
147
  ```
148
 
149
  ## Downstream Use