Philip May
Update README.md
90de634
|
raw
history blame
4.68 kB
metadata
language:
  - de
  - en
license: cc-by-nc-sa-3.0
tags:
  - summarization
datasets:
  - cnn_dailymail
  - xsum
  - wiki_lingua
  - mlsum
  - swiss_text_2019

mT5-small-sum-de-en-v1

This is a bilingual summarization model for English and German. It is based on the multilingual T5 model google/mt5-small.

One Conversation This model is provided by the One Conversation team of Deutsche Telekom AG.

Training

The training was conducted with the following hyperparameters:

  • base model: google/mt5-small
  • source_prefix: "summarize: "
  • batch size: 3
  • max_source_length: 800
  • max_target_length: 96
  • warmup_ratio: 0.3
  • number of train epochs: 10
  • gradient accumulation steps: 2

Datasets and Preprocessing

The datasets were preprocessed as follows:

The summary was tokenized with the google/mt5-small tokenizer. Then only the records with no more than 94 tokens were selected.

The MLSUM dataset has a special characteristic. In the text, the summary is often included completely as one or more sentences. These have been removed from the texts. The reason is that we do not want to train a model that ultimately extracts only sentences as a summary.

This model is trained on the following datasets:

Name Language Size License
CNN Daily - Train en 218,223 The license is unclear. The data comes from CNN and Daily Mail. We assume that it may only be used for research purposes and not commercially.
Extreme Summarization (XSum) - Train en 204,005 The license is unclear. The data comes from BBC. We assume that it may only be used for research purposes and not commercially.
wiki_lingua English en 130,331 Creative Commons CC BY-NC-SA 3.0 License
wiki_lingua German de 48,390 Creative Commons CC BY-NC-SA 3.0 License
MLSUM German - Train de 218,043 Usage of dataset is restricted to non-commercial research purposes only. Copyright belongs to the original copyright holders (see here).
SwissText 2019 - Train de 84,564 The license is unclear. The data was published in the German Text Summarization Challenge. We assume that they may be used for research purposes and not commercially.
Language Size
German 350,997
English 552,559
Total 903,556

Evaluation on MLSUM German Test Set (no beams)

Model rouge1 rouge2 rougeL rougeLsum
mT5-small-sum-de-en-01 (this) 21.7336 7.2614 17.1323 19.3977
ml6team/mt5-small-german-finetune-mlsum 18.3607 5.3604 14.5456 16.1946

Evaluation on MLSUM German Test Set (5 beams)

Model rouge1 rouge2 rougeL rougeLsum
mT5-small-sum-de-en-01 (this) 22.6018 7.8047 17.1363 19.719
ml6team/mt5-small-german-finetune-mlsum 19.6166 5.8818 14.74 16.889

Evaluation on CNN Daily English Test Set (no beams)

Model rouge1 rouge2 rougeL rougeLsum
mT5-small-sum-de-en-01 (this) 37.6339 16.5317 27.1418 34.9951
mrm8488/t5-base-finetuned-summarize-news 37.576 14.7389 24.0254 34.4634
facebook/bart-large-xsum 28.5374 9.8565 19.4829 24.7364
sshleifer/distilbart-xsum-12-6 26.7664 8.8243 18.3703 23.2614

License

Copyright (c) 2021 Philip May, Deutsche Telekom AG

This work is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license.