voidful commited on
Commit
c7a3dc6
1 Parent(s): 453ecd3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ datasets:
4
+ - NQ
5
+ - Trivia
6
+ - SQuAD
7
+ - MLQA
8
+ - DRCD
9
+ ---
10
+
11
+ # dpr-ctx_encoder-bert-base-multilingual
12
+
13
+ ## Description
14
+
15
+ Multilingual DPR Model base on bert-base-multilingual-cased.
16
+ [DPR model](https://arxiv.org/abs/2004.04906)
17
+ [DPR repo](https://github.com/facebookresearch/DPR)
18
+
19
+ ## Data
20
+ 1. [NQ](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py)
21
+ 2. [Trivia](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py)
22
+ 3. [SQuAD](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py)
23
+ 4. [DRCD*](https://github.com/DRCKnowledgeTeam/DRCD)
24
+ 5. [MLQA*](https://github.com/facebookresearch/MLQA)
25
+
26
+ `question pairs for train`: 644,217
27
+ `question pairs for dev`: 73,710
28
+
29
+ *DRCD and MLQA are converted using script from haystack [squad_to_dpr.py](https://github.com/deepset-ai/haystack/blob/master/haystack/retriever/squad_to_dpr.py)
30
+
31
+ ## Training Script
32
+ I use the script from [haystack](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial9_DPR_training.ipynb)
33
+
34
+ ## Usage
35
+
36
+ ```python
37
+ from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
38
+ tokenizer = DPRContextEncoderTokenizer.from_pretrained('voidful/dpr-ctx_encoder-bert-base-multilingual')
39
+ model = DPRContextEncoder.from_pretrained('voidful/dpr-ctx_encoder-bert-base-multilingual')
40
+ input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
41
+ embeddings = model(input_ids).pooler_output
42
+ ```
43
+
44
+ Follow the tutorial from `haystack`:
45
+ [Better Retrievers via "Dense Passage Retrieval"](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)
46
+ ```
47
+ from haystack.retriever.dense import DensePassageRetriever
48
+ retriever = DensePassageRetriever(document_store=document_store,
49
+ query_embedding_model="voidful/dpr-question_encoder-bert-base-multilingual",
50
+ passage_embedding_model="voidful/dpr-ctx_encoder-bert-base-multilingual",
51
+ max_seq_len_query=64,
52
+ max_seq_len_passage=256,
53
+ batch_size=16,
54
+ use_gpu=True,
55
+ embed_title=True,
56
+ use_fast_tokenizers=True)
57
+ ```