malteos commited on
Commit
cf53abd
1 Parent(s): 59e80e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -86
README.md CHANGED
@@ -1,86 +1,86 @@
1
- ---
2
- tags:
3
- - feature-extraction
4
- language: en
5
- datasets:
6
- - SciDocs
7
- - s2orc
8
- metrics:
9
- - F1
10
- - accuracy
11
- - map
12
- - ndcg
13
- license: mit
14
- ---
15
-
16
- ## SciNCL
17
-
18
- SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers.
19
- It uses the citation graph neighborhood to generate samples for contrastive learning.
20
- Prior to the contrastive training, the model is initialized with weights from [scibert-scivocab-uncased](https://huggingface.co/allenai/scibert_scivocab_uncased).
21
- The underlying citation embeddings are trained on the [S2ORC citation graph](https://github.com/allenai/s2orc).
22
-
23
- Paper: [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (Arxiv preprint)](https://arxiv.org/abs/2202.06671).
24
-
25
- Code: https://github.com/malteos/scincl
26
-
27
- ## How to use the pretrained model
28
-
29
- ```python
30
- from transformers import AutoTokenizer, AutoModel
31
-
32
- # load model and tokenizer
33
- tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
34
- model = AutoModel.from_pretrained('malteos/scincl')
35
-
36
- papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
37
- {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
38
-
39
- # concatenate title and abstract with [SEP] token
40
- title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
41
-
42
- # preprocess the input
43
- inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
44
-
45
- # inference
46
- result = model(**inputs)
47
-
48
- # take the first token ([CLS] token) in the batch as the embedding
49
- embeddings = result.last_hidden_state[:, 0, :]
50
- ```
51
-
52
- ## Triplet Mining Parameters
53
-
54
- | **Setting** | **Value** |
55
- |-------------------------|--------------------|
56
- | seed | 4 |
57
- | triples_per_query | 5 |
58
- | easy_positives_count | 5 |
59
- | easy_positives_strategy | 5 |
60
- | easy_positives_k | 20-25 |
61
- | easy_negatives_count | 3 |
62
- | easy_negatives_strategy | random_without_knn |
63
- | hard_negatives_count | 2 |
64
- | hard_negatives_strategy | knn |
65
- | hard_negatives_k | 3998-4000 |
66
-
67
- ## SciDocs Results
68
-
69
- These model weights are the ones that yielded the best results on SciDocs (`seed=4`).
70
- In the paper we report the SciDocs results as mean over ten seeds.
71
-
72
- | **model** | **mag-f1** | **mesh-f1** | **co-view-map** | **co-view-ndcg** | **co-read-map** | **co-read-ndcg** | **cite-map** | **cite-ndcg** | **cocite-map** | **cocite-ndcg** | **recomm-ndcg** | **recomm-P@1** | **Avg** |
73
- |-------------------|-----------:|------------:|----------------:|-----------------:|----------------:|-----------------:|-------------:|--------------:|---------------:|----------------:|----------------:|---------------:|--------:|
74
- | Doc2Vec | 66.2 | 69.2 | 67.8 | 82.9 | 64.9 | 81.6 | 65.3 | 82.2 | 67.1 | 83.4 | 51.7 | 16.9 | 66.6 |
75
- | fasttext-sum | 78.1 | 84.1 | 76.5 | 87.9 | 75.3 | 87.4 | 74.6 | 88.1 | 77.8 | 89.6 | 52.5 | 18 | 74.1 |
76
- | SGC | 76.8 | 82.7 | 77.2 | 88 | 75.7 | 87.5 | 91.6 | 96.2 | 84.1 | 92.5 | 52.7 | 18.2 | 76.9 |
77
- | SciBERT | 79.7 | 80.7 | 50.7 | 73.1 | 47.7 | 71.1 | 48.3 | 71.7 | 49.7 | 72.6 | 52.1 | 17.9 | 59.6 |
78
- | SPECTER | 82 | 86.4 | 83.6 | 91.5 | 84.5 | 92.4 | 88.3 | 94.9 | 88.1 | 94.8 | 53.9 | 20 | 80 |
79
- | SciNCL (10 seeds) | 81.4 | 88.7 | 85.3 | 92.3 | 87.5 | 93.9 | 93.6 | 97.3 | 91.6 | 96.4 | 53.9 | 19.3 | 81.8 |
80
- | **SciNCL (seed=4)** | 81.2 | 89.0 | 85.3 | 92.2 | 87.7 | 94.0 | 93.6 | 97.4 | 91.7 | 96.5 | 54.3 | 19.6 | 81.9 |
81
-
82
- Additional evaluations are available in the paper.
83
-
84
- ## License
85
-
86
- MIT
 
1
+ ---
2
+ tags:
3
+ - feature-extraction
4
+ language: en
5
+ datasets:
6
+ - SciDocs
7
+ - s2orc
8
+ metrics:
9
+ - F1
10
+ - accuracy
11
+ - map
12
+ - ndcg
13
+ license: mit
14
+ ---
15
+
16
+ ## SciNCL
17
+
18
+ SciNCL is a pre-trained BERT language model to generate document-level embeddings of research papers.
19
+ It uses the citation graph neighborhood to generate samples for contrastive learning.
20
+ Prior to the contrastive training, the model is initialized with weights from [scibert-scivocab-uncased](https://huggingface.co/allenai/scibert_scivocab_uncased).
21
+ The underlying citation embeddings are trained on the [S2ORC citation graph](https://github.com/allenai/s2orc).
22
+
23
+ Paper: [Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)](https://arxiv.org/abs/2202.06671).
24
+
25
+ Code: https://github.com/malteos/scincl
26
+
27
+ ## How to use the pretrained model
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModel
31
+
32
+ # load model and tokenizer
33
+ tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
34
+ model = AutoModel.from_pretrained('malteos/scincl')
35
+
36
+ papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
37
+ {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
38
+
39
+ # concatenate title and abstract with [SEP] token
40
+ title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
41
+
42
+ # preprocess the input
43
+ inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
44
+
45
+ # inference
46
+ result = model(**inputs)
47
+
48
+ # take the first token ([CLS] token) in the batch as the embedding
49
+ embeddings = result.last_hidden_state[:, 0, :]
50
+ ```
51
+
52
+ ## Triplet Mining Parameters
53
+
54
+ | **Setting** | **Value** |
55
+ |-------------------------|--------------------|
56
+ | seed | 4 |
57
+ | triples_per_query | 5 |
58
+ | easy_positives_count | 5 |
59
+ | easy_positives_strategy | 5 |
60
+ | easy_positives_k | 20-25 |
61
+ | easy_negatives_count | 3 |
62
+ | easy_negatives_strategy | random_without_knn |
63
+ | hard_negatives_count | 2 |
64
+ | hard_negatives_strategy | knn |
65
+ | hard_negatives_k | 3998-4000 |
66
+
67
+ ## SciDocs Results
68
+
69
+ These model weights are the ones that yielded the best results on SciDocs (`seed=4`).
70
+ In the paper we report the SciDocs results as mean over ten seeds.
71
+
72
+ | **model** | **mag-f1** | **mesh-f1** | **co-view-map** | **co-view-ndcg** | **co-read-map** | **co-read-ndcg** | **cite-map** | **cite-ndcg** | **cocite-map** | **cocite-ndcg** | **recomm-ndcg** | **recomm-P@1** | **Avg** |
73
+ |-------------------|-----------:|------------:|----------------:|-----------------:|----------------:|-----------------:|-------------:|--------------:|---------------:|----------------:|----------------:|---------------:|--------:|
74
+ | Doc2Vec | 66.2 | 69.2 | 67.8 | 82.9 | 64.9 | 81.6 | 65.3 | 82.2 | 67.1 | 83.4 | 51.7 | 16.9 | 66.6 |
75
+ | fasttext-sum | 78.1 | 84.1 | 76.5 | 87.9 | 75.3 | 87.4 | 74.6 | 88.1 | 77.8 | 89.6 | 52.5 | 18 | 74.1 |
76
+ | SGC | 76.8 | 82.7 | 77.2 | 88 | 75.7 | 87.5 | 91.6 | 96.2 | 84.1 | 92.5 | 52.7 | 18.2 | 76.9 |
77
+ | SciBERT | 79.7 | 80.7 | 50.7 | 73.1 | 47.7 | 71.1 | 48.3 | 71.7 | 49.7 | 72.6 | 52.1 | 17.9 | 59.6 |
78
+ | SPECTER | 82 | 86.4 | 83.6 | 91.5 | 84.5 | 92.4 | 88.3 | 94.9 | 88.1 | 94.8 | 53.9 | 20 | 80 |
79
+ | SciNCL (10 seeds) | 81.4 | 88.7 | 85.3 | 92.3 | 87.5 | 93.9 | 93.6 | 97.3 | 91.6 | 96.4 | 53.9 | 19.3 | 81.8 |
80
+ | **SciNCL (seed=4)** | 81.2 | 89.0 | 85.3 | 92.2 | 87.7 | 94.0 | 93.6 | 97.4 | 91.7 | 96.5 | 54.3 | 19.6 | 81.9 |
81
+
82
+ Additional evaluations are available in the paper.
83
+
84
+ ## License
85
+
86
+ MIT