krotima1 commited on
Commit
950a145
1 Parent(s): 910bd8b

feat: readme update

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -33,6 +33,65 @@ metrics:
33
  This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
34
  ## Task
35
  The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Dataset
37
  Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
38
  ```
 
33
  This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
34
  ## Task
35
  The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'```
36
+
37
+ #Usage
38
+
39
+ ```python
40
+
41
+ ## Configuration of summarization pipeline
42
+ #
43
+ def summ_config():
44
+ cfg = OrderedDict([
45
+
46
+ ## summarization model - checkpoint
47
+ # ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
48
+ # ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
49
+ # ctu-aic/mbart25-multilingual-summarization-multilarge-cs
50
+ ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
51
+
52
+ ## language of summarization task
53
+ # language : string : cs, en, de, fr, es, tr, ru, zh
54
+ ("language", "en"),
55
+
56
+ ## generation method parameters in dictionary
57
+ #
58
+ ("inference_cfg", OrderedDict([
59
+ ("num_beams", 4),
60
+ ("top_k", 40),
61
+ ("top_p", 0.92),
62
+ ("do_sample", True),
63
+ ("temperature", 0.95),
64
+ ("repetition_penalty", 1.23),
65
+ ("no_repeat_ngram_size", None),
66
+ ("early_stopping", True),
67
+ ("max_length", 128),
68
+ ("min_length", 10),
69
+ ])),
70
+ #texts to summarize values = (list of strings, string, dataset)
71
+ ("texts",
72
+ [
73
+ "english text1 to summarize",
74
+ "english text2 to summarize",
75
+ ]
76
+ ),
77
+ #OPTIONAL: Target summaries values = (list of strings, string, None)
78
+ ('golds',
79
+ [
80
+ "target english text1",
81
+ "target english text2",
82
+ ]),
83
+ #('golds', None),
84
+ ])
85
+ return cfg
86
+
87
+ cfg = summ_config()
88
+ mSummarize = MultiSummarizer(**cfg)
89
+ summaries,scores = mSummarize(**cfg)
90
+
91
+ ```
92
+
93
+
94
+
95
  ## Dataset
96
  Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
97
  ```