Cannot reproduce the results in paper
I have used this checkpoint and run the script in https://github.com/huggingface/transformers/blob/v4.30-release/examples/research_projects/rag/eval_rag.py; however, the EM score on NQ dev is only 11.56, far less than 44.1 in paper.
I used test set (not validation set) and reached 40+ EM score.
I used test set (not validation set) and reached 40+ EM score.
Is there a test set in Natural Questions? I only saw train and validation split in https://huggingface.co/datasets/google-research-datasets/natural_questions
I used test set (not validation set) and reached 40+ EM score.
Is there a test set in Natural Questions? I only saw train and validation split in https://huggingface.co/datasets/google-research-datasets/natural_questions
You could try following the paper (not the dataset directly). Maybe DPR or RAG, I can't remember which paper gave the right repo.
Thank you so much for the hints. I finally obtained F1=47.83 and EM=39.43 on the validation set. While there's significant difference between the Hugging Face google-research-datasets/natural_questions and the Facebook-version dataset, the improvement mainly results from different implementation of the F1-score function. The Facebook-version normalizes the answers and calculates a token-wise F1-score, while the original Google-version uses answer-wise F1-score (according to whether the answer is null). Answer-wise F1-score is stricter, where a redundant space or comma will lead to a totally wrong prediction.
your code??????