PhilipMay commited on
Commit
0832ead
1 Parent(s): 2c2efd8

initial upload

Browse files
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # mT5-small-sum-de-en-v1
2
+
3
+ ---
4
+ languages:
5
+ - de-DE
6
+ - en-EN
7
+
8
+ license: CC BY-NC-SA 3.0
9
+
10
+ tags:
11
+ - MT5
12
+ - summarization
13
+
14
+ datasets:
15
+ - cnn_dailymail
16
+ - xsum
17
+ - wiki_lingua
18
+ - mlsum
19
+ - swiss_text_2019
20
+ ---
21
+
22
+ This is a bilingual summarization model for English and German. It is based on the multilingual T5 model [google/mt5-small](https://huggingface.co/google/mt5-small).
23
+
24
+ ## Training
25
+
26
+ The training was conducted with the following hyperparameters:
27
+
28
+ - base model: [google/mt5-small](https://huggingface.co/google/mt5-small)
29
+ - source_prefix: `"summarize: "`
30
+ - batch size: 3
31
+ - max_source_length: 800
32
+ - max_target_length: 96
33
+ - warmup_ratio: 0.3
34
+ - number of train epochs: 10
35
+ - gradient accumulation steps: 2
36
+
37
+ ## Datasets and Preprocessing
38
+
39
+ The datasets were preprocessed as follows:
40
+
41
+ The summary was tokenized with the [google/mt5-small](https://huggingface.co/google/mt5-small) tokenizer. Then only the records with no more than 94 tokens were selected.
42
+
43
+ The MLSUM dataset has a special characteristic. In the text, the summary is often included completely as one or more sentences. These have been removed from the texts. The reason is that we do not want to train a model that ultimately extracts only sentences as a summary.
44
+
45
+ This model is based on the following datasets:
46
+
47
+ | Name | Language | Size | License
48
+ |------|----------|------|--------
49
+ | [CNN Daily - Train](https://github.com/abisee/cnn-dailymail) | en | 218,223 | The license is unclear. The data comes from CNN and Daily Mail. We assume that it may only be used for research purposes and not commercially.
50
+ | [Extreme Summarization (XSum) - Train](https://github.com/EdinburghNLP/XSum) | en | 204,005 | The license is unclear. The data comes from BBC. We assume that it may only be used for research purposes and not commercially.
51
+ | [wiki_lingua English](https://github.com/esdurmus/Wikilingua) | en | 130,331 | [Creative Commons CC BY-NC-SA 3.0 License](https://www.wikihow.com/wikiHow:Terms-of-Use)
52
+ | [wiki_lingua German](https://github.com/esdurmus/Wikilingua) | de | 48,390 | [Creative Commons CC BY-NC-SA 3.0 License](https://www.wikihow.com/wikiHow:Terms-of-Use)
53
+ | [MLSUM German - Train](https://github.com/ThomasScialom/MLSUM) | de | 218,043 | Usage of dataset is restricted to non-commercial research purposes only. Copyright belongs to the original copyright holders (see [here](https://github.com/ThomasScialom/MLSUM#mlsum)).
54
+ | [SwissText 2019 - Train](https://www.swisstext.org/2019/shared-task/german-text-summarization-challenge.html) | de | 84,564 | The license is unclear. The data was published in the [German Text Summarization Challenge](https://www.swisstext.org/2019/shared-task/german-text-summarization-challenge.html). We assume that they may be used for research purposes and not commercially.
55
+
56
+ | Language | Size
57
+ |------|------
58
+ | German | 350,997
59
+ | English | 552,559
60
+ | Total | 903,556
61
+
62
+ ## Evaluation on MLSUM German Test Set
63
+
64
+ | Model | Params | rouge1 | rouge2 | rougeL | rougeLsum
65
+ |-------|--------|--------|--------|--------|----------
66
+ | mT5-small-sum-de-en-01 (this) | no beam | 21.7336 | 7.2614 | 17.1323 | 19.3977
67
+ | mT5-small-sum-de-en-01 (this) | num_beams: 5 | 22.6018 | 7.8047 | 17.1363 | 19.719
68
+ | [ml6team/mt5-small-german-finetune-mlsum](https://huggingface.co/ml6team/mt5-small-german-finetune-mlsum) | no beam | 18.3607 | 5.3604 | 14.5456 | 16.1946
69
+ | [ml6team/mt5-small-german-finetune-mlsum](https://huggingface.co/ml6team/mt5-small-german-finetune-mlsum) | num_beams: 5 | xxx | xxx | xxx | xxx
70
+
71
+ ## License
72
+
73
+ Copyright (c) 2021 Philip May, Deutsche Telekom AG
74
+
75
+ This work is licensed under the [Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)](https://creativecommons.org/licenses/by-nc-sa/3.0/) license.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/mt5-small",
3
+ "architectures": [
4
+ "MT5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 1024,
7
+ "d_kv": 64,
8
+ "d_model": 512,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "gated-gelu",
13
+ "initializer_factor": 1.0,
14
+ "is_encoder_decoder": true,
15
+ "layer_norm_epsilon": 1e-06,
16
+ "model_type": "mt5",
17
+ "num_decoder_layers": 8,
18
+ "num_heads": 6,
19
+ "num_layers": 8,
20
+ "pad_token_id": 0,
21
+ "relative_attention_num_buckets": 32,
22
+ "tie_word_embeddings": false,
23
+ "tokenizer_class": "T5Tokenizer",
24
+ "transformers_version": "4.7.0.dev0",
25
+ "use_cache": true,
26
+ "vocab_size": 250100
27
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3084bb9cd29b0f9e2fc99de6555c2313e045f725303683bdf4146fe95b751d4
3
+ size 1200726749
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "additional_special_tokens": []}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 0}