krotima1 commited on
Commit
040a555
1 Parent(s): 55d1682

feat: add readme.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -1,3 +1,87 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - cs
4
+ - en
5
+ - de
6
+ - fr
7
+ - tu
8
+ - zh
9
+ - es
10
+ - ru
11
+ tags:
12
+ - Summarization
13
+ - abstractive summarization
14
+ - multilingual summarization
15
+ - m2m100_418M
16
+ - Czech
17
+ - text2text generation
18
+ - text generation
19
  license: cc-by-sa-4.0
20
+ datasets:
21
+ - Multilingual_large_dataset_(multilarge)
22
+ - cnc/dm
23
+ - xsum
24
+ - mlsum
25
+ - cnewsum
26
+ - cnc
27
+ - sumeczech
28
+ metrics:
29
+ - rouge
30
+ - rougeraw
31
+ - MemesCS
32
  ---
33
+ # mbart25-multilingual-summarization-multilarge-cs
34
+ This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
35
+ ## Task
36
+ The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
37
+ ## Dataset
38
+ Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
39
+ ```
40
+ Train set: 3 464 563 docs
41
+ Validation set: 121 260 docs
42
+ ```
43
+ | Stats | fragment | | | avg document length | | avg summary length | | Documents |
44
+ |-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
45
+ | __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ |
46
+ | cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
47
+ | sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
48
+ | cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
49
+ | xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K|
50
+ | mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
51
+ | mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K|
52
+ | mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
53
+ | mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
54
+ | mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K|
55
+ | cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
56
+ #### Tokenization
57
+ Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
58
+ ## Training
59
+ Trained based on cross-entropy loss.
60
+ ```
61
+ Time: 3 days 10 hours
62
+ Epochs: 1072K steps = 10 (from 10)
63
+ GPUs: 4x NVIDIA A100-SXM4-40GB
64
+ eloss: 2.824 - 1.745
65
+ tloss: 4.559 - 1.615
66
+ ```
67
+ ### ROUGE results per individual dataset test set:
68
+
69
+ | ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | |
70
+ |------------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
71
+ | dataset | Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
72
+ | cnc | 30.13 | 22.56 | 25.21 | 10.53 | 8.01 | 8.9 | 22.47 | 16.92 | 18.86 |
73
+ | sumeczech- | 26.6 | 19.66 | 22.01 | 8.17 | 6.12 | 6.82 | 19.93 | 14.81 | 16.54 |
74
+ | cnndm | 41.8 | 38.41 | 38.94 | 18.74 | 17.14 | 17.4 | 29.69 | 27.33 | 27.68 |
75
+ | xsum | 38.27 | 33.62 | 35.16 | 14.39 | 12.69 | 13.25 | 30.77 | 27.05 | 28.29 |
76
+ | mlsum-tu | 52.44 | 44.36 | 46.39 | 36.98 | 31.51 | 32.86 | 46.04 | 39.04 | 40.8 |
77
+ | mlsum-de | 42.19 | 40.5 | 40.7 | 28.8 | 28.51 | 28.37 | 38.95 | 37.7 | 37.79 |
78
+ | mlsum-fr | 34.57 | 27.74 | 29.95 | 16.27 | 13.04 | 14.08 | 27.18 | 21.89 | 23.6 |
79
+ | mlsum-es | 30.93 | 26.41 | 27.66 | 11.42 | 9.85 | 10.28 | 25.12 | 21.59 | 22.55 |
80
+ | mlsum-ru | 0.65 | 0.52 | 0.56 | 0.15 | 0.15 | 0.15 | 0.65 | 0.52 | 0.56 |
81
+
82
+
83
+
84
+ # USAGE
85
+ ```
86
+ soon
87
+ ```