krotima1 commited on
Commit
f90eb97
1 Parent(s): 713b9ba

feat: readme add usage

Browse files
Files changed (1) hide show
  1. README.md +54 -4
README.md CHANGED
@@ -36,6 +36,60 @@ This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://hug
36
  ## Task
37
  The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Dataset
40
  Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
41
  ```
@@ -83,7 +137,3 @@ tloss: 3.365 - 1.445
83
  | mlsum-ru | 1.25 | 1.54 | 1.31 | 0.46 | 0.46 | 0.44 | 1.25 | 1.54 | 1.31 |
84
  | cnewsum | 26.43 | 29.44 | 26.38 | 7.38 | 8.52 | 7.46 | 25.99 | 28.94 | 25.92 |
85
 
86
- # USAGE
87
- ```
88
- soon
89
- ```
 
36
  ## Task
37
  The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.
38
 
39
+ # USAGE
40
+ Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.
41
+ ```
42
+ ## Configuration of summarization pipeline
43
+ #
44
+ def summ_config():
45
+ cfg = OrderedDict([
46
+
47
+ ## summarization model - checkpoint
48
+ # ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
49
+ # ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
50
+ # ctu-aic/mbart25-multilingual-summarization-multilarge-cs
51
+ ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
52
+
53
+ ## language of summarization task
54
+ # language : string : cs, en, de, fr, es, tr, ru, zh
55
+ ("language", "en"),
56
+
57
+ ## generation method parameters in dictionary
58
+ #
59
+ ("inference_cfg", OrderedDict([
60
+ ("num_beams", 4),
61
+ ("top_k", 40),
62
+ ("top_p", 0.92),
63
+ ("do_sample", True),
64
+ ("temperature", 0.95),
65
+ ("repetition_penalty", 1.23),
66
+ ("no_repeat_ngram_size", None),
67
+ ("early_stopping", True),
68
+ ("max_length", 128),
69
+ ("min_length", 10),
70
+ ])),
71
+ #texts to summarize values = (list of strings, string, dataset)
72
+ ("texts",
73
+ [
74
+ "english text1 to summarize",
75
+ "english text2 to summarize",
76
+ ]
77
+ ),
78
+ #OPTIONAL: Target summaries values = (list of strings, string, None)
79
+ ('golds',
80
+ [
81
+ "target english text1",
82
+ "target english text2",
83
+ ]),
84
+ #('golds', None),
85
+ ])
86
+ return cfg
87
+
88
+ cfg = summ_config()
89
+ msummarizer = MultiSummarizer(**cfg)
90
+ ret = msummarizer(**cfg)
91
+ ```
92
+
93
  ## Dataset
94
  Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
95
  ```
 
137
  | mlsum-ru | 1.25 | 1.54 | 1.31 | 0.46 | 0.46 | 0.44 | 1.25 | 1.54 | 1.31 |
138
  | cnewsum | 26.43 | 29.44 | 26.38 | 7.38 | 8.52 | 7.46 | 25.99 | 28.94 | 25.92 |
139