File size: 6,183 Bytes
eef1e34
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eef1e34
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
eef1e34
96a5e8b
 
 
 
 
 
 
f90eb97
 
e330a98
 
f90eb97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- mbart-large-cc25
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---

# mbart25-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. 

## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.

# USAGE
Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.

```python
## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
msummarizer = MultiSummarizer(**cfg)
ret = msummarizer(**cfg)
```

## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set:        3 464 563 docs
Validation set:     121 260 docs
```
| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). 

## Training
Trained based on cross-entropy loss.
```
Time: 3 days 8 hours
Epochs: 860K steps cca 8 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.214 - 1.762
tloss: 3.365 - 1.445
```

### ROUGE results per individual dataset test set:
| ROUGE     | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
| dataset    |Precision | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore | 
| cnc       | 27.45   | 24.8    | 25.24     | 9.35   | 8.54   | 8.67      | 20.14  | 18.19  | 18.54   |
| sumeczech | 25.38   | 21.61   | 22.66     | 7.71   | 6.67   | 6.96      | 18.76  | 16.02  | 16.78   |
| cnndm     | 41.97   | 42.61   | 41.05     | 19.64  | 19.88  | 19.16     | 29.38  | 29.85  | 28.73   |
| xsum      | 39.18   | 39.8    | 38.83     | 16.59  | 16.98  | 16.5      | 31.25  | 31.74  | 30.96   |
| mlsum-tu  | 51.02   | 47.95   | 47.72     | 36.15  | 34.07  | 33.9      | 44.59  | 41.9   | 41.74   |
| mlsum-de  | 46.96   | 46.16   | 46.02     | 35.95  | 35.87  | 35.66     | 43.26  | 42.7   | 42.53   |
| mlsum-fr  | 34.51   | 31.4    | 32.03     | 16.56  | 15.07  | 15.37     | 26.73  | 24.41  | 24.86   |
| mlsum-es  | 32.62   | 29.66   | 30.21     | 13.3   | 12.2   | 12.39     | 26.24  | 24.02  | 24.4    |
| mlsum-ru  | 1.25    | 1.54    | 1.31      | 0.46   | 0.46   | 0.44      | 1.25   | 1.54   | 1.31    |
| cnewsum   | 26.43   | 29.44   | 26.38     | 7.38   | 8.52   | 7.46      | 25.99  | 28.94  | 25.92   |