# malteos /scincl

## SciNCL

SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert-scivocab-uncased. The underlying citation embeddings are trained on the S2ORC citation graph.

## How to use the pretrained model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]


## Triplet Mining Parameters

Setting Value
seed 4
triples_per_query 5
easy_positives_count 5
easy_positives_strategy 5
easy_positives_k 20-25
easy_negatives_count 3
easy_negatives_strategy random_without_knn
hard_negatives_count 2
hard_negatives_strategy knn
hard_negatives_k 3998-4000

## SciDocs Results

These model weights are the ones that yielded the best results on SciDocs (seed=4). In the paper we report the SciDocs results as mean over ten seeds.

model mag-f1 mesh-f1 co-view-map co-view-ndcg co-read-map co-read-ndcg cite-map cite-ndcg cocite-map cocite-ndcg recomm-ndcg recomm-P@1 Avg
Doc2Vec 66.2 69.2 67.8 82.9 64.9 81.6 65.3 82.2 67.1 83.4 51.7 16.9 66.6
fasttext-sum 78.1 84.1 76.5 87.9 75.3 87.4 74.6 88.1 77.8 89.6 52.5 18 74.1
SGC 76.8 82.7 77.2 88 75.7 87.5 91.6 96.2 84.1 92.5 52.7 18.2 76.9
SciBERT 79.7 80.7 50.7 73.1 47.7 71.1 48.3 71.7 49.7 72.6 52.1 17.9 59.6
SPECTER 82 86.4 83.6 91.5 84.5 92.4 88.3 94.9 88.1 94.8 53.9 20 80
SciNCL (10 seeds) 81.4 88.7 85.3 92.3 87.5 93.9 93.6 97.3 91.6 96.4 53.9 19.3 81.8
SciNCL (seed=4) 81.2 89.0 85.3 92.2 87.7 94.0 93.6 97.4 91.7 96.5 54.3 19.6 81.9

Additional evaluations are available in the paper.