fl399 commited on
Commit
47b6bd0
1 Parent(s): ccb398f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -15,6 +15,41 @@ datasets:
15
  ### SapBERT-XLMR
16
  SapBERT [(Liu et al. 2020)](https://arxiv.org/pdf/2010.11784.pdf) trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AB, using [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) as the base model. Please use [CLS] as the representation of the input.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ### Citation
19
 
20
  ```bibtex
15
  ### SapBERT-XLMR
16
  SapBERT [(Liu et al. 2020)](https://arxiv.org/pdf/2010.11784.pdf) trained with [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) 2020AB, using [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) as the base model. Please use [CLS] as the representation of the input.
17
 
18
+
19
+ #### Extracting embeddings from SapBERT
20
+
21
+ The following script converts a list of strings (entity names) into embeddings.
22
+ ```python
23
+ import numpy as np
24
+ import torch
25
+ from tqdm.auto import tqdm
26
+ from transformers import AutoTokenizer, AutoModel
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
29
+ model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()
30
+
31
+ # replace with your own list of entity names
32
+ all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"]
33
+
34
+ bs = 128 # batch size during inference
35
+ all_embs = []
36
+ for i in tqdm(np.arange(0, len(all_names), bs)):
37
+ toks = tokenizer.batch_encode_plus(all_names[i:i+bs],
38
+ padding="max_length",
39
+ max_length=25,
40
+ truncation=True,
41
+ return_tensors="pt")
42
+ toks_cuda = {}
43
+ for k,v in toks.items():
44
+ toks_cuda[k] = v.cuda()
45
+ cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
46
+ all_embs.append(cls_rep.cpu().detach().numpy())
47
+
48
+ all_embs = np.concatenate(all_embs, axis=0)
49
+ ```
50
+
51
+ For more details about training and eval, see SapBERT [github repo](https://github.com/cambridgeltl/sapbert).
52
+
53
  ### Citation
54
 
55
  ```bibtex