evaluation scores are different from Google paper

by zhongwei - opened

I just evaluated the model using run_summarization.py with hugging face dataset: ccdv/arxiv-summarization, the Rouge1 score = 41.68
The Rouge1 score at Google paper ( https://arxiv.org/pdf/2208.04347.pdf ) for model PEGASUS-XBase with arXiv evaluation is 49.4
what are the reasons for the big difference? how would we get same score at hugging face as google paper.

Sign up or log in to comment