File size: 6,154 Bytes

---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- mt5-base
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# mt5-base-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. 
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>',  'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>',  'zh': '<extra_id_7>'```

#Usage

```python

## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)

```



## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set:        3 464 563 docs
Validation set:     121 260 docs
```
| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). 
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578
```
### ROUGE results per individual dataset test set:

| ROUGE     | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
|      |Precision | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc  | 30.62   | 19.83   | 23.44     | 9.94   | 6.52   | 7.67      | 22.92  | 14.92  | 17.6    |
| sumeczech | 27.57   | 17.6    | 20.85     | 8.12   | 5.23   | 6.17      | 20.84  | 13.38  | 15.81   |
| cnndm    | 43.83   | 37.73   | 39.34     | 20.81  | 17.82  | 18.6      | 31.8   | 27.42  | 28.55   |
| xsum     | 41.63   | 30.54   | 34.56     | 16.13  | 11.76  | 13.33     | 33.65  | 24.74  | 27.97   |
| mlsum-tu- | 54.4    | 43.29   | 46.2      | 38.78  | 31.31  | 33.23     | 48.18  | 38.44  | 41      |
| mlsum-de  | 47.94   | 44.14   | 45.11     | 36.42  | 35.24  | 35.42     | 44.43  | 41.42  | 42.16   |
| mlsum-fr  | 35.26   | 25.96   | 28.98     | 16.72  | 12.35  | 13.75     | 28.06  | 20.75  | 23.12   |
| mlsum-es  | 33.37   | 24.84   | 27.52     | 13.29  | 10.05  | 11.05     | 27.63  | 20.69  | 22.87   |
| mlsum-ru | 0.79    | 0.66    | 0.66      | 0.26   | 0.2    | 0.22      | 0.79   | 0.66   | 0.65    |
| cnewsum  | 24.49   | 24.38   | 23.23     | 6.48   | 6.7    | 6.24      | 24.18  | 24.04  | 22.91   |

# USAGE
```
soon
```