facebook/rag-token-nq · Cannot reproduce the results in paper

Sep 13, 2023

I have used this checkpoint and run the script in https://github.com/huggingface/transformers/blob/v4.30-release/examples/research_projects/rag/eval_rag.py; however, the EM score on NQ dev is only 11.56, far less than 44.1 in paper.

HugoZ

Sep 15, 2023

I used test set (not validation set) and reached 40+ EM score.

HugoZ changed discussion status to closed Sep 15, 2023

ThomasAtlantis

Dec 3, 2024

I used test set (not validation set) and reached 40+ EM score.

Is there a test set in Natural Questions? I only saw train and validation split in https://huggingface.co/datasets/google-research-datasets/natural_questions

HugoZ

Dec 4, 2024

I used test set (not validation set) and reached 40+ EM score.

Is there a test set in Natural Questions? I only saw train and validation split in https://huggingface.co/datasets/google-research-datasets/natural_questions

You could try following the paper (not the dataset directly). Maybe DPR or RAG, I can't remember which paper gave the right repo.

ThomasAtlantis

Dec 8, 2024

Thank you so much for the hints. I finally obtained F1=47.83 and EM=39.43 on the validation set. While there's significant difference between the Hugging Face google-research-datasets/natural_questions and the Facebook-version dataset, the improvement mainly results from different implementation of the F1-score function. The Facebook-version normalizes the answers and calculates a token-wise F1-score, while the original Google-version uses answer-wise F1-score (according to whether the answer is null). Answer-wise F1-score is stricter, where a redundant space or comma will lead to a totally wrong prediction.

rakmik

Jan 21

your code??????