File size: 4,470 Bytes
eef1e34
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eef1e34
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
eef1e34
96a5e8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- mbart-large-cc25
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---

# mbart25-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. 

## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.

## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set:        3 464 563 docs
Validation set:     121 260 docs
```
| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). 

## Training
Trained based on cross-entropy loss.
```
Time: 3 days 8 hours
Epochs: 860K steps cca 8 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.214 - 1.762
tloss: 3.365 - 1.445
```

### ROUGE results per individual dataset test set:
| ROUGE     | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
| dataset    |Precision | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore | 
| cnc       | 27.45   | 24.8    | 25.24     | 9.35   | 8.54   | 8.67      | 20.14  | 18.19  | 18.54   |
| sumeczech | 25.38   | 21.61   | 22.66     | 7.71   | 6.67   | 6.96      | 18.76  | 16.02  | 16.78   |
| cnndm     | 41.97   | 42.61   | 41.05     | 19.64  | 19.88  | 19.16     | 29.38  | 29.85  | 28.73   |
| xsum      | 39.18   | 39.8    | 38.83     | 16.59  | 16.98  | 16.5      | 31.25  | 31.74  | 30.96   |
| mlsum-tu  | 51.02   | 47.95   | 47.72     | 36.15  | 34.07  | 33.9      | 44.59  | 41.9   | 41.74   |
| mlsum-de  | 46.96   | 46.16   | 46.02     | 35.95  | 35.87  | 35.66     | 43.26  | 42.7   | 42.53   |
| mlsum-fr  | 34.51   | 31.4    | 32.03     | 16.56  | 15.07  | 15.37     | 26.73  | 24.41  | 24.86   |
| mlsum-es  | 32.62   | 29.66   | 30.21     | 13.3   | 12.2   | 12.39     | 26.24  | 24.02  | 24.4    |
| mlsum-ru  | 1.25    | 1.54    | 1.31      | 0.46   | 0.46   | 0.44      | 1.25   | 1.54   | 1.31    |
| cnewsum   | 26.43   | 29.44   | 26.38     | 7.38   | 8.52   | 7.46      | 25.99  | 28.94  | 25.92   |

# USAGE
```
soon
```