File size: 1,561 Bytes
ab34d8c
 
 
add5bb0
 
9e897f3
 
 
 
 
 
 
 
 
 
 
 
 
 
0a08038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e897f3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
license: apache-2.0
---
This is the finetuned model presented in **MIReAD: a simple method for learning high-quality representations from
scientific documents (ACL 2023)**.

We trained MIReAD on >500,000 PubMed and arXiv abstracts across over 2,000 journal classes. MIReAD was initialized with SciBERT weights and finetuned to predict journal class based on the abstract and title of the paper. MIReAD uses SciBERT's tokenizer. 

Overall, with MIReAD you can:
* extract semantically meaningful representation using paper's abstact
* predict journal class based on paper's abstract 

To load the MIReAD model:
```python
from transformers import BertForSequenceClassification, AutoTokenizer

mpath = 'arazd/miread'
model_hub = BertForSequenceClassification.from_pretrained(mpath)
tokenizer = AutoTokenizer.from_pretrained(mpath)
```

To use MIReAD for feature extraction and classification:
```python
# sample abstract text
abstr = 'Learning semantically meaningful representations from scientific documents can ...'
source_len = 512
inputs = tokenizer(abstr,
                   max_length = source_len,
                   pad_to_max_length=True,
                   truncation=True,
                   return_tensors="pt")

# classification (getting logits over 2,734 journal classes)
out = model(**inputs)
logits = out.logits 

# feature extraction (getting 768-dimensional feature profiles)
out = model.bert(**inputs)
# IMPORTANT: use [CLS] token representation as document-level representation (hence, 0th idx)
feature = out.last_hidden_state[:, 0, :]

```