File size: 4,602 Bytes
9c4d9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5aa073
9c4d9cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3ef76d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- es
tags:
- sentence similarity  # Example: audio
- passage retrieval  # Example: automatic-speech-recognition
datasets:
- squad_es
- PlanTL-GOB-ES/SQAC
- IIC/bioasq22_es

metrics:
- eval_loss: 0.010779764448327261
- eval_accuracy: 0.9982682224158297
- eval_f1: 0.9446059155411182
- average_rank: 0.11728500598392888

model-index:
- name: dpr-spanish-passage_encoder-allqa-base
  results:
  - task: 
      type: text similarity  # Required. Example: automatic-speech-recognition
      name: text similarity  # Optional. Example: Speech Recognition
    dataset:
      type: squad_es  # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: squad_es  # Required. Example: Common Voice zh-CN
      args: es         # Optional. Example: zh-CN
    metrics:
      - type: loss
        value: 0.010779764448327261
        name: eval_loss
      - type: accuracy
        value: 0.9982682224158297
        name: accuracy
      - type: f1
        value: 0.9446059155411182
        name: f1
      - type: avgrank
        value: 0.11728500598392888
        name: avgrank
---

[Dense Passage Retrieval](https://arxiv.org/abs/2004.04906)-DPR is a set of tools for performing State of the Art open-domain question answering. It was initially developed by Facebook and there is an [official repository](https://github.com/facebookresearch/DPR). DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.

With this and the [question encoder model](https://huggingface.co/avacaondata/dpr-spanish-question_encoder-allqa-base) we introduce the best passage retrievers in Spanish up to date (to the best of our knowledge), improving over the [previous model we developed](https://huggingface.co/IIC/dpr-spanish-question_encoder-squades-base), by training it for longer and with more data.

Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with [the passage encoder](https://huggingface.co/avacaondata/dpr-spanish-passage_encoder-squades-base)) to find the most similar documents , which then should be used for either extracting the answer or generating it.

For training the model, we used a collection of Question Answering datasets in Spanish: 
- the Spanish version of SQUAD, [SQUAD-ES](https://huggingface.co/datasets/squad_es)
- [SQAC- Spanish Question Answering Corpus](https://huggingface.co/datasets/PlanTL-GOB-ES/SQAC)
- [BioAsq22-ES](https://huggingface.co/datasets/IIC/bioasq22_es) - we translated this last one by using automatic translation with Transformers.

With this complete dataset we created positive and negative examples for the model (For more information look at [the paper](https://arxiv.org/abs/2004.04906) to understand the training process for DPR). We trained for 25 epochs with the same configuration as the paper. The [previous DPR model](https://huggingface.co/IIC/dpr-spanish-passage_encoder-squades-base) was trained for only 3 epochs with about 60% of the data.

Example of use:

```python
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

model_str = "IIC/dpr-spanish-passage_encoder-allqa-base"
tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_str)
model = DPRContextEncoder.from_pretrained(model_str)

input_ids = tokenizer("Usain Bolt ganó varias medallas de oro en las Olimpiadas del año 2012", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output
```

The full metrics of this model on the evaluation split of SQUADES are:

```
eval_loss: 0.010779764448327261
eval_acc: 0.9982682224158297
eval_f1: 0.9446059155411182
eval_acc_and_f1: 0.9714370689784739
eval_average_rank: 0.11728500598392888
```

And the classification report:

```
                precision    recall  f1-score   support

hard_negative     0.9991    0.9991    0.9991   1104999
     positive     0.9446    0.9446    0.9446     17547

     accuracy                         0.9983   1122546
    macro avg     0.9719    0.9719    0.9719   1122546
 weighted avg     0.9983    0.9983    0.9983   1122546

```

### Contributions
Thanks to [@avacaondata](https://huggingface.co/avacaondata), [@alborotis](https://huggingface.co/alborotis), [@albarji](https://huggingface.co/albarji), [@Dabs](https://huggingface.co/Dabs), [@GuillemGSubies](https://huggingface.co/GuillemGSubies) for adding this model.