File size: 4,818 Bytes
cf53abd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
976214b
74c41d0
cf53abd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
tags:
- feature-extraction
language: en
datasets:
- SciDocs
- s2orc
metrics:
- F1
- accuracy
- map
- ndcg
license: mit
---

## SciNCL

SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers.
It uses the citation graph neighborhood to generate samples for contrastive learning.
Prior to the contrastive training, the model is initialized with weights from [scibert-scivocab-uncased](https://huggingface.co/allenai/scibert_scivocab_uncased).
The underlying citation embeddings are trained on the [S2ORC citation graph](https://github.com/allenai/s2orc).

Paper: [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671).

Code: https://github.com/malteos/scincl

PubMedNCL: Working with biomedical papers? Try [PubMedNCL](https://huggingface.co/malteos/PubMedNCL).

## How to use the pretrained model

```python
from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]
```

## Triplet Mining Parameters

| **Setting**             | **Value**          |
|-------------------------|--------------------|
| seed                    | 4                  |
| triples_per_query       | 5                  |
| easy_positives_count    | 5                  |
| easy_positives_strategy | 5                  |
| easy_positives_k        | 20-25              |
| easy_negatives_count    | 3                  |
| easy_negatives_strategy | random_without_knn |
| hard_negatives_count    | 2                  |
| hard_negatives_strategy | knn                |
| hard_negatives_k        | 3998-4000          |

## SciDocs Results

These model weights are the ones that yielded the best results on SciDocs (`seed=4`).
In the paper we report the SciDocs results as mean over ten seeds.

| **model**         | **mag-f1** | **mesh-f1** | **co-view-map** | **co-view-ndcg** | **co-read-map** | **co-read-ndcg** | **cite-map** | **cite-ndcg** | **cocite-map** | **cocite-ndcg** | **recomm-ndcg** | **recomm-P@1** | **Avg** |
|-------------------|-----------:|------------:|----------------:|-----------------:|----------------:|-----------------:|-------------:|--------------:|---------------:|----------------:|----------------:|---------------:|--------:|
| Doc2Vec           |       66.2 |        69.2 |            67.8 |             82.9 |            64.9 |             81.6 |         65.3 |          82.2 |           67.1 |            83.4 |            51.7 |           16.9 |    66.6 |
| fasttext-sum      |       78.1 |        84.1 |            76.5 |             87.9 |            75.3 |             87.4 |         74.6 |          88.1 |           77.8 |            89.6 |            52.5 |             18 |    74.1 |
| SGC               |       76.8 |        82.7 |            77.2 |               88 |            75.7 |             87.5 |         91.6 |          96.2 |           84.1 |            92.5 |            52.7 |           18.2 |    76.9 |
| SciBERT           |       79.7 |        80.7 |            50.7 |             73.1 |            47.7 |             71.1 |         48.3 |          71.7 |           49.7 |            72.6 |            52.1 |           17.9 |    59.6 |
| SPECTER           |         82 |        86.4 |            83.6 |             91.5 |            84.5 |             92.4 |         88.3 |          94.9 |           88.1 |            94.8 |            53.9 |             20 |      80 |
| SciNCL (10 seeds) |       81.4 |        88.7 |            85.3 |             92.3 |            87.5 |             93.9 |         93.6 |          97.3 |           91.6 |            96.4 |            53.9 |           19.3 |    81.8 |
| **SciNCL (seed=4)**   |       81.2 |        89.0 |            85.3 |             92.2 |            87.7 |             94.0 |         93.6 |          97.4 |           91.7 |            96.5 |            54.3 |           19.6 |    81.9 |

Additional evaluations are available in the paper.

## License

MIT