ctu-aic
/

m2m100-418M-multilingual-summarization-multilarge-cs

 ---
+language:
+- cs
+- en
+- de
+- fr
+- tu
+- zh
+- es
+- ru
+tags:
+- Summarization
+- abstractive summarization
+- multilingual summarization
+- m2m100_418M
+- Czech
+- text2text generation
+- text generation
 license: cc-by-sa-4.0
+datasets:
+- Multilingual_large_dataset_(multilarge)
+- cnc/dm
+- xsum
+- mlsum
+- cnewsum
+- cnc
+- sumeczech
+metrics:
+- rouge
+- rougeraw
+- MemesCS
 ---
+# mbart25-multilingual-summarization-multilarge-cs
+This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
+## Task
+The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
+## Dataset
+Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
+```
+Train set:        3 464 563 docs
+Validation set:     121 260 docs
+```
+| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
+|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
+|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
+| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
+| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
+| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
+| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
+| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
+| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
+| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
+| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
+| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
+| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
+#### Tokenization
+Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
+## Training
+Trained based on cross-entropy loss.
+```
+Time: 3 days 10 hours
+Epochs: 1072K steps = 10 (from 10)
+GPUs: 4x NVIDIA A100-SXM4-40GB
+eloss: 2.824 - 1.745
+tloss: 4.559 - 1.615
+```
+### ROUGE results per individual dataset test set:
+| ROUGE      | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
+|------------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
+|   dataset  |  Precision  | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore |
+| cnc    | 30.13   | 22.56   | 25.21     | 10.53  | 8.01   | 8.9       | 22.47  | 16.92  | 18.86   |
+| sumeczech- | 26.6    | 19.66   | 22.01     | 8.17   | 6.12   | 6.82      | 19.93  | 14.81  | 16.54   |
+| cnndm      | 41.8    | 38.41   | 38.94     | 18.74  | 17.14  | 17.4      | 29.69  | 27.33  | 27.68   |
+| xsum       | 38.27   | 33.62   | 35.16     | 14.39  | 12.69  | 13.25     | 30.77  | 27.05  | 28.29   |
+| mlsum-tu   | 52.44   | 44.36   | 46.39     | 36.98  | 31.51  | 32.86     | 46.04  | 39.04  | 40.8    |
+| mlsum-de   | 42.19   | 40.5    | 40.7      | 28.8   | 28.51  | 28.37     | 38.95  | 37.7   | 37.79   |
+| mlsum-fr   | 34.57   | 27.74   | 29.95     | 16.27  | 13.04  | 14.08     | 27.18  | 21.89  | 23.6    |
+| mlsum-es   | 30.93   | 26.41   | 27.66     | 11.42  | 9.85   | 10.28     | 25.12  | 21.59  | 22.55   |
+| mlsum-ru   | 0.65    | 0.52    | 0.56      | 0.15   | 0.15   | 0.15      | 0.65   | 0.52   | 0.56    |
+# USAGE
+```
+soon
+```