File size: 6,154 Bytes
813b970 2e0b72d 813b970 2e0b72d 813b970 64a1206 2e0b72d e38fe93 950a145 2e0b72d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- mt5-base
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# mt5-base-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'```
#Usage
```python
## Configuration of summarization pipeline
#
def summ_config():
cfg = OrderedDict([
## summarization model - checkpoint
# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
## language of summarization task
# language : string : cs, en, de, fr, es, tr, ru, zh
("language", "en"),
## generation method parameters in dictionary
#
("inference_cfg", OrderedDict([
("num_beams", 4),
("top_k", 40),
("top_p", 0.92),
("do_sample", True),
("temperature", 0.95),
("repetition_penalty", 1.23),
("no_repeat_ngram_size", None),
("early_stopping", True),
("max_length", 128),
("min_length", 10),
])),
#texts to summarize values = (list of strings, string, dataset)
("texts",
[
"english text1 to summarize",
"english text2 to summarize",
]
),
#OPTIONAL: Target summaries values = (list of strings, string, None)
('golds',
[
"target english text1",
"target english text2",
]),
#('golds', None),
])
return cfg
cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)
```
## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set: 3 464 563 docs
Validation set: 121 260 docs
```
| Stats | fragment | | | avg document length | | avg summary length | | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
| __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ |
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K|
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K|
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K|
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578
```
### ROUGE results per individual dataset test set:
| ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
| |Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc | 30.62 | 19.83 | 23.44 | 9.94 | 6.52 | 7.67 | 22.92 | 14.92 | 17.6 |
| sumeczech | 27.57 | 17.6 | 20.85 | 8.12 | 5.23 | 6.17 | 20.84 | 13.38 | 15.81 |
| cnndm | 43.83 | 37.73 | 39.34 | 20.81 | 17.82 | 18.6 | 31.8 | 27.42 | 28.55 |
| xsum | 41.63 | 30.54 | 34.56 | 16.13 | 11.76 | 13.33 | 33.65 | 24.74 | 27.97 |
| mlsum-tu- | 54.4 | 43.29 | 46.2 | 38.78 | 31.31 | 33.23 | 48.18 | 38.44 | 41 |
| mlsum-de | 47.94 | 44.14 | 45.11 | 36.42 | 35.24 | 35.42 | 44.43 | 41.42 | 42.16 |
| mlsum-fr | 35.26 | 25.96 | 28.98 | 16.72 | 12.35 | 13.75 | 28.06 | 20.75 | 23.12 |
| mlsum-es | 33.37 | 24.84 | 27.52 | 13.29 | 10.05 | 11.05 | 27.63 | 20.69 | 22.87 |
| mlsum-ru | 0.79 | 0.66 | 0.66 | 0.26 | 0.2 | 0.22 | 0.79 | 0.66 | 0.65 |
| cnewsum | 24.49 | 24.38 | 23.23 | 6.48 | 6.7 | 6.24 | 24.18 | 24.04 | 22.91 |
# USAGE
```
soon
``` |