File size: 5,229 Bytes
0832ead
a14ca5d
a76912d
 
8c30420
449e879
0832ead
 
 
 
 
 
 
 
 
 
c7e0e48
 
0832ead
 
dc495ad
7d52a11
 
 
0832ead
 
 
 
 
 
 
 
 
 
 
 
9c4ee35
0832ead
 
 
 
 
9c4ee35
0832ead
 
 
90ab7eb
0832ead
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36500c3
0832ead
90de634
 
 
8e40b5d
0832ead
dd54448
 
 
 
 
04d4045
 
8e40b5d
04d4045
4b6bddc
f1c7e91
 
 
 
e55376f
 
8e40b5d
968a88d
665c958
968a88d
f1c7e91
0832ead
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language:
- de
- en
- multilingual
license: cc-by-nc-sa-4.0
tags:
- summarization
datasets:
- cnn_dailymail
- xsum
- wiki_lingua
- mlsum
- swiss_text_2019
---

# mT5-small-sum-de-en-v1

This is a bilingual summarization model for English and German. It is based on the multilingual T5 model [google/mt5-small](https://huggingface.co/google/mt5-small).

[![One Conversation](https://raw.githubusercontent.com/telekom/HPOflow/main/docs/source/imgs/1c-logo.png)](https://www.welove.ai/)
This model is provided by the [One Conversation](https://www.welove.ai/)
team of [Deutsche Telekom AG](https://www.telekom.com/).

## Training

The training was conducted with the following hyperparameters:

- base model: [google/mt5-small](https://huggingface.co/google/mt5-small)
- source_prefix: `"summarize: "`
- batch size: 3
- max_source_length: 800
- max_target_length: 96
- warmup_ratio: 0.3
- number of train epochs: 10
- gradient accumulation steps: 2
- learning rate: 5e-5

## Datasets and Preprocessing

The datasets were preprocessed as follows:

The summary was tokenized with the [google/mt5-small](https://huggingface.co/google/mt5-small) tokenizer. Then only the records with no more than 94 summary tokens were selected.

The MLSUM dataset has a special characteristic. In the text, the summary is often included completely as one or more sentences. These have been removed from the texts. The reason is that we do not want to train a model that ultimately extracts only sentences as a summary.

This model is trained on the following datasets:

| Name | Language | Size | License
|------|----------|------|--------
| [CNN Daily - Train](https://github.com/abisee/cnn-dailymail) | en | 218,223 | The license is unclear. The data comes from CNN and Daily Mail. We assume that it may only be used for research purposes and not commercially.
| [Extreme Summarization (XSum) - Train](https://github.com/EdinburghNLP/XSum) | en | 204,005 | The license is unclear. The data comes from BBC. We assume that it may only be used for research purposes and not commercially.
| [wiki_lingua English](https://github.com/esdurmus/Wikilingua) | en | 130,331 | [Creative Commons CC BY-NC-SA 3.0 License](https://www.wikihow.com/wikiHow:Terms-of-Use)
| [wiki_lingua German](https://github.com/esdurmus/Wikilingua) | de | 48,390 | [Creative Commons CC BY-NC-SA 3.0 License](https://www.wikihow.com/wikiHow:Terms-of-Use)
| [MLSUM German - Train](https://github.com/ThomasScialom/MLSUM) | de | 218,043 | Usage of dataset is restricted to non-commercial research purposes only. Copyright belongs to the original copyright holders (see [here](https://github.com/ThomasScialom/MLSUM#mlsum)).
| [SwissText 2019 - Train](https://www.swisstext.org/2019/shared-task/german-text-summarization-challenge.html) | de | 84,564 | The license is unclear. The data was published in the [German Text Summarization Challenge](https://www.swisstext.org/2019/shared-task/german-text-summarization-challenge.html). We assume that they may be used for research purposes and not commercially.

| Language | Size
|------|------
| German | 350,997
| English | 552,559
| Total | 903,556

## Evaluation on MLSUM German Test Set (no beams)

| Model | rouge1 | rouge2 | rougeL | rougeLsum
|-------|--------|--------|--------|----------
| [ml6team/mt5-small-german-finetune-mlsum](https://huggingface.co/ml6team/mt5-small-german-finetune-mlsum) | 18.3607 | 5.3604 | 14.5456 | 16.1946
| **deutsche-telekom/mT5-small-sum-de-en-01 (this)** | **21.7336** | **7.2614** | **17.1323** | **19.3977**

## Evaluation on CNN Daily English Test Set (no beams)

| Model | rouge1 | rouge2 | rougeL | rougeLsum
|-------|--------|--------|--------|----------
| [sshleifer/distilbart-xsum-12-6](https://huggingface.co/sshleifer/distilbart-xsum-12-6) | 26.7664 | 8.8243 | 18.3703 | 23.2614
| [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum) | 28.5374 | 9.8565 | 19.4829 | 24.7364
| [mrm8488/t5-base-finetuned-summarize-news](https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news) | 37.576 | 14.7389 | 24.0254 | 34.4634
| **deutsche-telekom/mT5-small-sum-de-en-01 (this)** | **37.6339** | **16.5317** | **27.1418** | **34.9951**


## Evaluation on Extreme Summarization (XSum) English Test Set (no beams)

| Model | rouge1 | rouge2 | rougeL | rougeLsum
|-------|--------|--------|--------|----------
| [mrm8488/t5-base-finetuned-summarize-news](https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news) | 18.6204 | 3.535 | 12.3997 | 15.2111
| [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum) | 28.5374 | 9.8565 | 19.4829 | 24.7364
| deutsche-telekom/mT5-small-sum-de-en-01 (this) | 32.3416 | 10.6191 | 25.3799 | 25.3908
| [sshleifer/distilbart-xsum-12-6](https://huggingface.co/sshleifer/distilbart-xsum-12-6) | 44.2553 ♣ | 21.4289 ♣ | 36.2639 ♣ | 36.2696 ♣

♣: These values seem to be unusually high. It could be that the test set was used in the training data.

## License

Copyright (c) 2021 Philip May, Deutsche Telekom AG

This work is licensed under the [Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)](https://creativecommons.org/licenses/by-nc-sa/3.0/) license.