Dense Passage Retrieval is a set of tools for performing state of the art open-domain question answering. It was initially developed by Facebook and there is an official repository. DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.
Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with the passage encoder) to find the most similar documents , which then should be used for either extracting the answer or generating it.
For training the model, we used the spanish version of SQUAD, SQUAD-ES, with which we created positive and negative examples for the model.
Example of use:
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer model_str = "avacaondata/dpr-spanish-passage_encoder-squades-base" tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(model_str) model = DPRQuestionEncoder.from_pretrained(model_str) input_ids = tokenizer("¿Qué medallas ganó Usain Bolt en 2012?", return_tensors="pt")["input_ids"] embeddings = model(input_ids).pooler_output
The full metrics of this model on the evaluation split of SQUADES are:
evalloss: 0.08608942725107592 acc: 0.9925325215819639 f1: 0.8805402320715237 acc_and_f1: 0.9365363768267438 average_rank: 0.27430093209054596
And the classification report:
precision recall f1-score support hard_negative 0.9961 0.9961 0.9961 325878 positive 0.8805 0.8805 0.8805 10514 accuracy 0.9925 336392 macro avg 0.9383 0.9383 0.9383 336392 weighted avg 0.9925 0.9925 0.9925 336392
Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.
- Downloads last month
Dataset used to train IIC/dpr-spanish-question_encoder-squades-base
- eval_loss on squad_esself-reported0.086
- accuracy on squad_esself-reported0.990
- f1 on squad_esself-reported0.880
- avgrank on squad_esself-reported0.274