File size: 4,510 Bytes
1fc6f41 e83d72d 8850fa3 1fc6f41 24b6b76 1fc6f41 24b6b76 1fc6f41 24b6b76 1fc6f41 8850fa3 1fc6f41 8850fa3 1fc6f41 8850fa3 24b6b76 8850fa3 1fc6f41 8850fa3 1fc6f41 8850fa3 24b6b76 c8fc850 8850fa3 1fc6f41 8850fa3 1fc6f41 11979d9 1fc6f41 8850fa3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
tags:
- generated_from_keras_callback
- dpr
license: apache-2.0
model-index:
- name: dpr-question_encoder_bert_uncased_L-2_H-128_A-2
results: []
---
<!-- This model card has been generated automatically according to the information Keras had access to. You should
probably proofread and complete it, then remove this comment. -->
# dpr-question_encoder_bert_uncased_L-2_H-128_A-2
This model(google/bert_uncased_L-2_H-128_A-2) was trained from scratch on training data: data.retriever.nq-adv-hn-train(facebookresearch/DPR).
It achieves the following results on the evaluation set:
## Evaluation data
evaluation dataset: facebook-dpr-dev-dataset from official DPR github
|model_name|data_name|num of queries|num of passages|R@10|R@20|R@50|R@100|R@100|
|---|---|---|---|---|---|---|---|---|
|nlpconnect/dpr-ctx_encoder_bert_uncased_L-2_H-128_A-2(our)|nq-dev dataset|6445|199795|60.53%|68.28%|76.07%|80.98%|91.45%|
|nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2(our)|nq-dev dataset|6445|199795|65.43%|71.99%|79.03%|83.24%|92.11%|
|*facebook/dpr-ctx_encoder-single-nq-base(hf/fb)|nq-dev dataset|6445|199795|40.94%|49.27%|59.05%|66.00%|82.00%|
evaluation dataset: UKPLab/beir test data but we have used first 2lac passage only.
|model_name|data_name|num of queries|num of passages|R@10|R@20|R@50|R@100|R@100|
|---|---|---|---|---|---|---|---|---|
|nlpconnect/dpr-ctx_encoder_bert_uncased_L-2_H-128_A-2(our)|nq-test dataset|3452|200001|49.68%|59.06%|69.40%|75.75%|89.28%|
|nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2(our)|nq-test dataset|3452|200001|51.62%|61.09%|70.10%|76.07%|88.70%|
|*facebook/dpr-ctx_encoder-single-nq-base(hf/fb)|nq-test dataset|3452|200001|32.93%|43.74%|56.95%|66.30%|83.92%|
Note: * means we have evaluated on same eval dataset.
### Usage (HuggingFace Transformers)
```python
passage_encoder = TFAutoModel.from_pretrained("nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2")
query_encoder = TFAutoModel.from_pretrained("nlpconnect/dpr-question_encoder_bert_uncased_L-12_H-128_A-2")
p_tokenizer = AutoTokenizer.from_pretrained("nlpconnect/dpr-ctx_encoder_bert_uncased_L-12_H-128_A-2")
q_tokenizer = AutoTokenizer.from_pretrained("nlpconnect/dpr-question_encoder_bert_uncased_L-12_H-128_A-2")
def get_title_text_combined(passage_dicts):
res = []
for p in passage_dicts:
res.append(tuple((p['title'], p['text'])))
return res
processed_passages = get_title_text_combined(passage_dicts)
def extracted_passage_embeddings(processed_passages, model_config):
passage_inputs = tokenizer.batch_encode_plus(
processed_passages,
add_special_tokens=True,
truncation=True,
padding="max_length",
max_length=model_config.passage_max_seq_len,
return_token_type_ids=True
)
passage_embeddings = passage_encoder.predict([np.array(passage_inputs['input_ids']),
np.array(passage_inputs['attention_mask']),
np.array(passage_inputs['token_type_ids'])],
batch_size=512,
verbose=1)
return passage_embeddings
passage_embeddings = extracted_passage_embeddings(processed_passages, model_config)
def extracted_query_embeddings(queries, model_config):
query_inputs = tokenizer.batch_encode_plus(
queries,
add_special_tokens=True,
truncation=True,
padding="max_length",
max_length=model_config.query_max_seq_len,
return_token_type_ids=True
)
query_embeddings = query_encoder.predict([np.array(query_inputs['input_ids']),
np.array(query_inputs['attention_mask']),
np.array(query_inputs['token_type_ids'])],
batch_size=512,
verbose=1)
return query_embeddings
query_embeddings = extracted_query_embeddings(queries, model_config)
```
### Training hyperparameters
The following hyperparameters were used during training:
- optimizer: None
- training_precision: float32
### Framework versions
- Transformers 4.15.0
- TensorFlow 2.7.0
- Tokenizers 0.10.3 |