add model & infos

Files changed (6) hide show

README.md +71 -0
config.json +22 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+language: "en"
+tags:
+- dpr
+- dense-passage-retrieval
+- knowledge-distillation
+datasets:
+- ms_marco
+---
+# DistilBert for Dense Passage Retrieval trained with Balanced Topic Aware Sampling (TAS-B)
+We provide a retrieval trained DistilBert-based model (we call the *dual-encoder then dot-product scoring* architecture BERT_Dot) trained with Balanced Topic Aware Sampling on MSMARCO-Passage.
+This instance was trained with a batch size of 256 and can be used to **re-rank a candidate set** or **directly for a vector index based dense retrieval**. The architecture is a 6-layer DistilBERT, without architecture additions or modifications (we only change the weights during training) - to receive a query/passage representation we pool the CLS vector. We use the same BERT layers for both query and passage encoding (yields better results, and lowers memory requirements).
+If you want to know more about our efficient (can be done on a single consumer GPU in 48 hours) batch composition procedure and dual supervision for dense retrieval training, check out our paper: https://arxiv.org/abs/2104.06967 🎉
+For more information and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/tas-balanced-dense-retrieval
+## Effectiveness on MSMARCO Passage & TREC-DL'19
+We trained our model on the MSMARCO standard ("small"-400K query) training triples re-sampled with our TAS-B method. As teacher models we used the BERT_CAT pairwise scores as well as the ColBERT model for in-batch-negative signals published here: https://github.com/sebastian-hofstaetter/neural-ranking-kd
+### MSMARCO-DEV (7K)
+|                                  | MRR@10 | NDCG@10 | Recall@1K                   |
+|----------------------------------|--------|---------|-----------------------------|
+| BM25                             | .194   | .241    | .857  |
+| **TAS-B BERT_Dot** (Retrieval)   | .347   | .410    | .978                        |
+### TREC-DL'19
+For MRR and Recall we use the recommended binarization point of the graded relevance of 2. This might skew the results when compared to other binarization point numbers.
+|                                  | MRR@10 | NDCG@10 | Recall@1K                   |
+|----------------------------------|--------|---------|-----------------------------|
+| BM25                             | .689   | .501    | .739  |
+| **TAS-B BERT_Dot** (Retrieval)   | .883   | .717    | .843                       |
+### TREC-DL'20
+For MRR and Recall we use the recommended binarization point of the graded relevance of 2. This might skew the results when compared to other binarization point numbers.
+|                                  | MRR@10 | NDCG@10 | Recall@1K                   |
+|----------------------------------|--------|---------|-----------------------------|
+| BM25                             | .649   | .475    | .806  |
+| **TAS-B BERT_Dot** (Retrieval)   | .843   | .686    | .875                        |
+For more baselines, info and analysis, please see the paper: https://arxiv.org/abs/2104.06967
+## Limitations & Bias
+- The model inherits social biases from both DistilBERT and MSMARCO.
+- The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
+## Citation
+If you use our model checkpoint please cite our work as:
+```
+@inproceedings{Hofstaetter2021_tasb_dense_retrieval,
+ author = {Sebastian Hofst{\"a}tter and Sheng-Chieh Lin and Jheng-Hong Yang and Jimmy Lin and Allan Hanbury},
+ title = {{Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling}},
+ booktitle = {Proc. of SIGIR},
+ year = {2021},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertModel"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "vocab_size": 30522
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6dc56a56d0f65ec2d057be4507775c47a79713f0b25f6c832a00eb6f59004a49
+size 265472230

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "distilbert-base-uncased"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff