File size: 6,169 Bytes
813b970
2e0b72d
 
 
 
 
 
 
 
 
9dd4bd8
 
2e0b72d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
813b970
64a1206
2e0b72d
 
e38fe93
950a145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e0b72d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
- multilingual
license: cc-by-sa-4.0
tags:
- Summarization
- abstractive summarization
- mt5-base
- Czech
- text2text generation
- text generation
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# mt5-base-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. 
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>',  'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>',  'zh': '<extra_id_7>'```

#Usage

```python

## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)

```



## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set:        3 464 563 docs
Validation set:     121 260 docs
```
| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). 
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578
```
### ROUGE results per individual dataset test set:

| ROUGE     | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
|      |Precision | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc  | 30.62   | 19.83   | 23.44     | 9.94   | 6.52   | 7.67      | 22.92  | 14.92  | 17.6    |
| sumeczech | 27.57   | 17.6    | 20.85     | 8.12   | 5.23   | 6.17      | 20.84  | 13.38  | 15.81   |
| cnndm    | 43.83   | 37.73   | 39.34     | 20.81  | 17.82  | 18.6      | 31.8   | 27.42  | 28.55   |
| xsum     | 41.63   | 30.54   | 34.56     | 16.13  | 11.76  | 13.33     | 33.65  | 24.74  | 27.97   |
| mlsum-tu- | 54.4    | 43.29   | 46.2      | 38.78  | 31.31  | 33.23     | 48.18  | 38.44  | 41      |
| mlsum-de  | 47.94   | 44.14   | 45.11     | 36.42  | 35.24  | 35.42     | 44.43  | 41.42  | 42.16   |
| mlsum-fr  | 35.26   | 25.96   | 28.98     | 16.72  | 12.35  | 13.75     | 28.06  | 20.75  | 23.12   |
| mlsum-es  | 33.37   | 24.84   | 27.52     | 13.29  | 10.05  | 11.05     | 27.63  | 20.69  | 22.87   |
| mlsum-ru | 0.79    | 0.66    | 0.66      | 0.26   | 0.2    | 0.22      | 0.79   | 0.66   | 0.65    |
| cnewsum  | 24.49   | 24.38   | 23.23     | 6.48   | 6.7    | 6.24      | 24.18  | 24.04  | 22.91   |

# USAGE
```
soon
```