sebastian-hofstaetter commited on
Commit
0feb43c
1 Parent(s): 24c6247

Add model, tokenizer, & initial model card

Browse files
Files changed (6) hide show
  1. README.md +154 -0
  2. config.json +10 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. tokenizer_config.json +1 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ tags:
4
+ - dpr
5
+ - dense-passage-retrieval
6
+ - knowledge-distillation
7
+ datasets:
8
+ - ms_marco
9
+ ---
10
+
11
+ # Margin-MSE Trained ColBERT
12
+
13
+ We provide a retrieval trained DistilBert-based ColBERT model (https://arxiv.org/pdf/2004.12832.pdf). Our model is trained with Margin-MSE using a 3 teacher BERT_Cat (concatenated BERT scoring) ensemble on MSMARCO-Passage.
14
+
15
+ This instance can be used to **re-rank a candidate set** or **directly for a vector index based dense retrieval**. The architecure is a 6-layer DistilBERT, with an additional single linear layer at the end.
16
+
17
+ If you want to know more about our simple, yet effective knowledge distillation method for efficient information retrieval models for a variety of student architectures that is used for this model instance check out our paper: https://arxiv.org/abs/2010.02666 🎉
18
+
19
+ For more information, training data, source code, and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/neural-ranking-kd
20
+
21
+ ## Configuration
22
+
23
+ - fp16 trained, so fp16 inference shouldn't be a problem
24
+ - We use no compression: 768 dim output vectors (better suited for re-ranking, or storage for smaller collections, MSMARCO gets to ~1TB vector storage with fp16 ... ups)
25
+ - Query [MASK] augmention = 8x regardless of batch-size (needs to be added before the model, see the usage example in GitHub repo for more)
26
+
27
+ ## Model Code
28
+
29
+ ````python
30
+ from transformers import AutoTokenizer,AutoModel, PreTrainedModel,PretrainedConfig
31
+ from typing import Dict
32
+ import torch
33
+
34
+ class ColBERTConfig(PretrainedConfig):
35
+ model_type = "ColBERT"
36
+ bert_model: str
37
+ compression_dim: int = 768
38
+ dropout: float = 0.0
39
+ return_vecs: bool = False
40
+ trainable: bool = True
41
+
42
+ class ColBERT(PreTrainedModel):
43
+ """
44
+ ColBERT model from: https://arxiv.org/pdf/2004.12832.pdf
45
+ We use a dot-product instead of cosine per term (slightly better)
46
+ """
47
+ config_class = ColBERTConfig
48
+ base_model_prefix = "bert_model"
49
+
50
+ def __init__(self,
51
+ cfg) -> None:
52
+ super().__init__(cfg)
53
+
54
+ self.bert_model = AutoModel.from_pretrained(cfg.bert_model)
55
+
56
+ for p in self.bert_model.parameters():
57
+ p.requires_grad = cfg.trainable
58
+
59
+ self.compressor = torch.nn.Linear(self.bert_model.config.hidden_size, cfg.compression_dim)
60
+
61
+ def forward(self,
62
+ query: Dict[str, torch.LongTensor],
63
+ document: Dict[str, torch.LongTensor]):
64
+
65
+ query_vecs = self.forward_representation(query)
66
+ document_vecs = self.forward_representation(document)
67
+
68
+ score = self.forward_aggregation(query_vecs,document_vecs,query["attention_mask"],document["attention_mask"])
69
+ return score
70
+
71
+ def forward_representation(self,
72
+ tokens,
73
+ sequence_type=None) -> torch.Tensor:
74
+
75
+ vecs = self.bert_model(**tokens)[0] # assuming a distilbert model here
76
+ vecs = self.compressor(vecs)
77
+
78
+ # if encoding only, zero-out the mask values so we can compress storage
79
+ if sequence_type == "doc_encode" or sequence_type == "query_encode":
80
+ vecs = vecs * tokens["tokens"]["mask"].unsqueeze(-1)
81
+
82
+ return vecs
83
+
84
+ def forward_aggregation(self,query_vecs, document_vecs,query_mask,document_mask):
85
+
86
+ # create initial term-x-term scores (dot-product)
87
+ score = torch.bmm(query_vecs, document_vecs.transpose(2,1))
88
+
89
+ # mask out padding on the doc dimension (mask by -1000, because max should not select those, setting it to 0 might select them)
90
+ exp_mask = document_mask.bool().unsqueeze(1).expand(-1,score.shape[1],-1)
91
+ score[~exp_mask] = - 10000
92
+
93
+ # max pooling over document dimension
94
+ score = score.max(-1).values
95
+
96
+ # mask out paddding query values
97
+ score[~(query_mask.bool())] = 0
98
+
99
+ # sum over query values
100
+ score = score.sum(-1)
101
+
102
+ return score
103
+
104
+ tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # honestly not sure if that is the best way to go, but it works :)
105
+ model = ColBERT.from_pretrained("sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco")
106
+ ````
107
+
108
+ ## Effectiveness on MSMARCO Passage & TREC Deep Learning '19
109
+
110
+ We trained our model on the MSMARCO standard ("small"-400K query) training triples with knowledge distillation with a batch size of 32 on a single consumer-grade GPU (11GB memory).
111
+
112
+ For re-ranking we used the top-1000 BM25 results.
113
+
114
+ ### MSMARCO-DEV
115
+
116
+ Here, we use the larger 49K query DEV set (same range as the smaller 7K DEV set, minimal changes possible)
117
+
118
+ | | MRR@10 | NDCG@10 |
119
+ |----------------------------------|--------|---------|
120
+ | BM25 | .194 | .241 |
121
+ | **Margin-MSE ColBERT** (Re-ranking) | .375 | .436 |
122
+
123
+ ### TREC-DL'19
124
+
125
+ For MRR we use the recommended binarization point of the graded relevance of 2. This might skew the results when compared to other binarization point numbers.
126
+
127
+ | | MRR@10 | NDCG@10 |
128
+ |----------------------------------|--------|---------|
129
+ | BM25 | .689 | .501 |
130
+ | **Margin-MSE ColBERT** (Re-ranking) | .878 | .744 |
131
+
132
+ For more metrics, baselines, info and analysis, please see the paper: https://arxiv.org/abs/2010.02666
133
+
134
+ ## Limitations & Bias
135
+
136
+ - The model inherits social biases from both DistilBERT and MSMARCO.
137
+
138
+ - The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
139
+
140
+
141
+ ## Citation
142
+
143
+ If you use our model checkpoint please cite our work as:
144
+
145
+ ```
146
+ @misc{hofstaetter2020_crossarchitecture_kd,
147
+ title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation},
148
+ author={Sebastian Hofst{\"a}tter and Sophia Althammer and Michael Schr{\"o}der and Mete Sertkan and Allan Hanbury},
149
+ year={2020},
150
+ eprint={2010.02666},
151
+ archivePrefix={arXiv},
152
+ primaryClass={cs.IR}
153
+ }
154
+ ```
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ColBERT"
4
+ ],
5
+ "bert_model": "distilbert-base-uncased",
6
+ "compression_dim": 768,
7
+ "model_type": "ColBERT",
8
+ "return_vecs": true,
9
+ "trainable": true
10
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb2a93cee563cc0ee7b8b5709835f57781338bf47fd2819fcf6265f29f598b26
3
+ size 267837019
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "distilbert-base-uncased"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff