krotima1's picture
feat: readme update
950a145
|
raw
history blame
6.15 kB
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- mt5-base
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# mt5-base-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'```
#Usage
```python
## Configuration of summarization pipeline
#
def summ_config():
cfg = OrderedDict([
## summarization model - checkpoint
# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
## language of summarization task
# language : string : cs, en, de, fr, es, tr, ru, zh
("language", "en"),
## generation method parameters in dictionary
#
("inference_cfg", OrderedDict([
("num_beams", 4),
("top_k", 40),
("top_p", 0.92),
("do_sample", True),
("temperature", 0.95),
("repetition_penalty", 1.23),
("no_repeat_ngram_size", None),
("early_stopping", True),
("max_length", 128),
("min_length", 10),
])),
#texts to summarize values = (list of strings, string, dataset)
("texts",
[
"english text1 to summarize",
"english text2 to summarize",
]
),
#OPTIONAL: Target summaries values = (list of strings, string, None)
('golds',
[
"target english text1",
"target english text2",
]),
#('golds', None),
])
return cfg
cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)
```
## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set: 3 464 563 docs
Validation set: 121 260 docs
```
| Stats | fragment | | | avg document length | | avg summary length | | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
| __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ |
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K|
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K|
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K|
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578
```
### ROUGE results per individual dataset test set:
| ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
| |Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc | 30.62 | 19.83 | 23.44 | 9.94 | 6.52 | 7.67 | 22.92 | 14.92 | 17.6 |
| sumeczech | 27.57 | 17.6 | 20.85 | 8.12 | 5.23 | 6.17 | 20.84 | 13.38 | 15.81 |
| cnndm | 43.83 | 37.73 | 39.34 | 20.81 | 17.82 | 18.6 | 31.8 | 27.42 | 28.55 |
| xsum | 41.63 | 30.54 | 34.56 | 16.13 | 11.76 | 13.33 | 33.65 | 24.74 | 27.97 |
| mlsum-tu- | 54.4 | 43.29 | 46.2 | 38.78 | 31.31 | 33.23 | 48.18 | 38.44 | 41 |
| mlsum-de | 47.94 | 44.14 | 45.11 | 36.42 | 35.24 | 35.42 | 44.43 | 41.42 | 42.16 |
| mlsum-fr | 35.26 | 25.96 | 28.98 | 16.72 | 12.35 | 13.75 | 28.06 | 20.75 | 23.12 |
| mlsum-es | 33.37 | 24.84 | 27.52 | 13.29 | 10.05 | 11.05 | 27.63 | 20.69 | 22.87 |
| mlsum-ru | 0.79 | 0.66 | 0.66 | 0.26 | 0.2 | 0.22 | 0.79 | 0.66 | 0.65 |
| cnewsum | 24.49 | 24.38 | 23.23 | 6.48 | 6.7 | 6.24 | 24.18 | 24.04 | 22.91 |
# USAGE
```
soon
```