Alejandro Vaca Serrano
- es
- sentence similarity # Example: audio
- passage retrieval # Example: automatic-speech-recognition
- squad_es
- IIC/bioasq22_es
- eval_loss: 0.010779764448327261
- eval_accuracy: 0.9982682224158297
- eval_f1: 0.9446059155411182
- average_rank: 0.11728500598392888
- name: dpr-spanish-passage_encoder-allqa-base
- task:
type: text similarity # Required. Example: automatic-speech-recognition
name: text similarity # Optional. Example: Speech Recognition
type: squad_es # Required. Example: common_voice. Use dataset id from
name: squad_es # Required. Example: Common Voice zh-CN
args: es # Optional. Example: zh-CN
- type: loss
value: 0.010779764448327261
name: eval_loss
- type: accuracy
value: 0.9982682224158297
name: accuracy
- type: f1
value: 0.9446059155411182
name: f1
- type: avgrank
value: 0.11728500598392888
name: avgrank
[Dense Passage Retrieval]( is a set of tools for performing State of the Art open-domain question answering. It was initially developed by Facebook and there is an [official repository]( DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.
With this and the [question encoder model]( we introduce the best passage retrievers in Spanish up to date (to the best of our knowledge), improving over the [previous model we developed](, by training it for longer and with more data.
Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with [the passage encoder]( to find the most similar documents , which then should be used for either extracting the answer or generating it.
For training the model, we used a collection of Question Answering datasets in Spanish:
- the Spanish version of SQUAD, [SQUAD-ES](
- [SQAC- Spanish Question Answering Corpus](
- [BioAsq22-ES]( - we translated this last one by using automatic translation with Transformers.
With this complete dataset we created positive and negative examples for the model (For more information look at [the paper]( to understand the training process for DPR). We trained for 25 epochs with the same configuration as the paper. The [previous DPR model]( was trained for only 3 epochs with about 60% of the data.
Example of use:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
model_str = "IIC/dpr-spanish-passage_encoder-allqa-base"
tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_str)
model = DPRContextEncoder.from_pretrained(model_str)
input_ids = tokenizer("Usain Bolt ganó varias medallas de oro en las Olimpiadas del año 2012", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output
The full metrics of this model on the evaluation split of SQUADES are:
eval_loss: 0.010779764448327261
eval_acc: 0.9982682224158297
eval_f1: 0.9446059155411182
eval_acc_and_f1: 0.9714370689784739
eval_average_rank: 0.11728500598392888
And the classification report:
precision recall f1-score support
hard_negative 0.9991 0.9991 0.9991 1104999
positive 0.9446 0.9446 0.9446 17547
accuracy 0.9983 1122546
macro avg 0.9719 0.9719 0.9719 1122546
weighted avg 0.9983 0.9983 0.9983 1122546
### Contributions
Thanks to [@avacaondata](, [@alborotis](, [@albarji](, [@Dabs](, [@GuillemGSubies]( for adding this model.