Better results than the reported results and some questions about the model

#2
by JingFan - opened

Hi there,

my colleague @dennlinger and I are from the Institute of Computer Science at Heidelberg University, currently investigating the performance of German abstractive summarizers. We are very interested in your model and have tested it (among others) on the MLSUM test set (all samples). In case you are interested, see the results listed in the table below, which indicates even better results than those reported in the model card.

Parameters Rouge1-F1 Rouge2-F1 RougeL-F1
MLSUM (max_length=354, min_length=13, do_sample=false, truncation=True) 0.4265 0.3321 0.3978

Aside from this, we have some further questions regarding the model and evaluation choices:

  1. Why did you use the validation set instead of the test set for the evaluation?
  2. The fine-tuned model was evaluated on 2000 random articles from the validation set. Did you use a seed in this process? Alternatively, could you share the subset of the 2000 articles that you used for the evaluation (IDs, etc.)?
  3. Could you specify the hyperparameter choices that you used for the training and evaluation process? We were unable to load the training_args.bin file, potentially related to this issue.
  4. We checked the first five articles in the test set and and found that the summaries primarily (4/5 articles) consist of copies of the leading sentences of the reference articles. Are you aware of this problem or did you perform any additional filtering?

Thank you in advance for your response and input!

Best wishes,

Dennis and Jing

Sign up or log in to comment