Lê Nguyễn Minh Huy commited on
Commit
b19fccc
1 Parent(s): 4bb1989
README.md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - summarization
4
+ - mT5
5
+ datasets:
6
+ - csebuetnlp/xlsum
7
+ language:
8
+ - am
9
+ - ar
10
+ - az
11
+ - bn
12
+ - my
13
+ - zh
14
+ - en
15
+ - fr
16
+ - gu
17
+ - ha
18
+ - hi
19
+ - ig
20
+ - id
21
+ - ja
22
+ - rn
23
+ - ko
24
+ - ky
25
+ - mr
26
+ - ne
27
+ - om
28
+ - ps
29
+ - fa
30
+ - pcm
31
+ - pt
32
+ - pa
33
+ - ru
34
+ - gd
35
+ - sr
36
+ - si
37
+ - so
38
+ - es
39
+ - sw
40
+ - ta
41
+ - te
42
+ - th
43
+ - ti
44
+ - tr
45
+ - uk
46
+ - ur
47
+ - uz
48
+ - vi
49
+ - cy
50
+ - yo
51
+ licenses:
52
+ - cc-by-nc-sa-4.0
53
+ widget:
54
+ - text: Videos that say approved vaccines are dangerous and cause autism, cancer or
55
+ infertility are among those that will be taken down, the company said. The policy
56
+ includes the termination of accounts of anti-vaccine influencers. Tech giants
57
+ have been criticised for not doing more to counter false health information on
58
+ their sites. In July, US President Joe Biden said social media platforms were
59
+ largely responsible for people's scepticism in getting vaccinated by spreading
60
+ misinformation, and appealed for them to address the issue. YouTube, which is
61
+ owned by Google, said 130,000 videos were removed from its platform since last
62
+ year, when it implemented a ban on content spreading misinformation about Covid
63
+ vaccines. In a blog post, the company said it had seen false claims about Covid
64
+ jabs "spill over into misinformation about vaccines in general". The new policy
65
+ covers long-approved vaccines, such as those against measles or hepatitis B. "We're
66
+ expanding our medical misinformation policies on YouTube with new guidelines on
67
+ currently administered vaccines that are approved and confirmed to be safe and
68
+ effective by local health authorities and the WHO," the post said, referring to
69
+ the World Health Organization.
70
+ model-index:
71
+ - name: csebuetnlp/mT5_multilingual_XLSum
72
+ results:
73
+ - task:
74
+ type: summarization
75
+ name: Summarization
76
+ dataset:
77
+ name: xsum
78
+ type: xsum
79
+ config: default
80
+ split: test
81
+ metrics:
82
+ - name: ROUGE-1
83
+ type: rouge
84
+ value: 36.5002
85
+ verified: true
86
+ - name: ROUGE-2
87
+ type: rouge
88
+ value: 13.934
89
+ verified: true
90
+ - name: ROUGE-L
91
+ type: rouge
92
+ value: 28.9876
93
+ verified: true
94
+ - name: ROUGE-LSUM
95
+ type: rouge
96
+ value: 28.9958
97
+ verified: true
98
+ - name: loss
99
+ type: loss
100
+ value: 2.0674800872802734
101
+ verified: true
102
+ - name: gen_len
103
+ type: gen_len
104
+ value: 26.9733
105
+ verified: true
106
+ ---
107
+
108
+ # mT5-multilingual-XLSum
109
+
110
+ This repository contains the mT5 checkpoint finetuned on the 45 languages of [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) dataset. For finetuning details and scripts,
111
+ see the [paper](https://aclanthology.org/2021.findings-acl.413/) and the [official repository](https://github.com/csebuetnlp/xl-sum).
112
+
113
+
114
+ ## Using this model in `transformers` (tested on 4.11.0.dev0)
115
+
116
+ ```python
117
+ import re
118
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
119
+
120
+ WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))
121
+
122
+ article_text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said. The policy includes the termination of accounts of anti-vaccine influencers. Tech giants have been criticised for not doing more to counter false health information on their sites. In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue. YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines. In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B. "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""
123
+
124
+ model_name = "csebuetnlp/mT5_multilingual_XLSum"
125
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
126
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
127
+
128
+ input_ids = tokenizer(
129
+ [WHITESPACE_HANDLER(article_text)],
130
+ return_tensors="pt",
131
+ padding="max_length",
132
+ truncation=True,
133
+ max_length=512
134
+ )["input_ids"]
135
+
136
+ output_ids = model.generate(
137
+ input_ids=input_ids,
138
+ max_length=84,
139
+ no_repeat_ngram_size=2,
140
+ num_beams=4
141
+ )[0]
142
+
143
+ summary = tokenizer.decode(
144
+ output_ids,
145
+ skip_special_tokens=True,
146
+ clean_up_tokenization_spaces=False
147
+ )
148
+
149
+ print(summary)
150
+ ```
151
+
152
+ ## Benchmarks
153
+
154
+ Scores on the XL-Sum test sets are as follows:
155
+
156
+ Language | ROUGE-1 / ROUGE-2 / ROUGE-L
157
+ ---------|----------------------------
158
+ Amharic | 20.0485 / 7.4111 / 18.0753
159
+ Arabic | 34.9107 / 14.7937 / 29.1623
160
+ Azerbaijani | 21.4227 / 9.5214 / 19.3331
161
+ Bengali | 29.5653 / 12.1095 / 25.1315
162
+ Burmese | 15.9626 / 5.1477 / 14.1819
163
+ Chinese (Simplified) | 39.4071 / 17.7913 / 33.406
164
+ Chinese (Traditional) | 37.1866 / 17.1432 / 31.6184
165
+ English | 37.601 / 15.1536 / 29.8817
166
+ French | 35.3398 / 16.1739 / 28.2041
167
+ Gujarati | 21.9619 / 7.7417 / 19.86
168
+ Hausa | 39.4375 / 17.6786 / 31.6667
169
+ Hindi | 38.5882 / 16.8802 / 32.0132
170
+ Igbo | 31.6148 / 10.1605 / 24.5309
171
+ Indonesian | 37.0049 / 17.0181 / 30.7561
172
+ Japanese | 48.1544 / 23.8482 / 37.3636
173
+ Kirundi | 31.9907 / 14.3685 / 25.8305
174
+ Korean | 23.6745 / 11.4478 / 22.3619
175
+ Kyrgyz | 18.3751 / 7.9608 / 16.5033
176
+ Marathi | 22.0141 / 9.5439 / 19.9208
177
+ Nepali | 26.6547 / 10.2479 / 24.2847
178
+ Oromo | 18.7025 / 6.1694 / 16.1862
179
+ Pashto | 38.4743 / 15.5475 / 31.9065
180
+ Persian | 36.9425 / 16.1934 / 30.0701
181
+ Pidgin | 37.9574 / 15.1234 / 29.872
182
+ Portuguese | 37.1676 / 15.9022 / 28.5586
183
+ Punjabi | 30.6973 / 12.2058 / 25.515
184
+ Russian | 32.2164 / 13.6386 / 26.1689
185
+ Scottish Gaelic | 29.0231 / 10.9893 / 22.8814
186
+ Serbian (Cyrillic) | 23.7841 / 7.9816 / 20.1379
187
+ Serbian (Latin) | 21.6443 / 6.6573 / 18.2336
188
+ Sinhala | 27.2901 / 13.3815 / 23.4699
189
+ Somali | 31.5563 / 11.5818 / 24.2232
190
+ Spanish | 31.5071 / 11.8767 / 24.0746
191
+ Swahili | 37.6673 / 17.8534 / 30.9146
192
+ Tamil | 24.3326 / 11.0553 / 22.0741
193
+ Telugu | 19.8571 / 7.0337 / 17.6101
194
+ Thai | 37.3951 / 17.275 / 28.8796
195
+ Tigrinya | 25.321 / 8.0157 / 21.1729
196
+ Turkish | 32.9304 / 15.5709 / 29.2622
197
+ Ukrainian | 23.9908 / 10.1431 / 20.9199
198
+ Urdu | 39.5579 / 18.3733 / 32.8442
199
+ Uzbek | 16.8281 / 6.3406 / 15.4055
200
+ Vietnamese | 32.8826 / 16.2247 / 26.0844
201
+ Welsh | 32.6599 / 11.596 / 26.1164
202
+ Yoruba | 31.6595 / 11.6599 / 25.0898
203
+
204
+
205
+
206
+ ## Citation
207
+
208
+ If you use this model, please cite the following paper:
209
+ ```
210
+ @inproceedings{hasan-etal-2021-xl,
211
+ title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
212
+ author = "Hasan, Tahmid and
213
+ Bhattacharjee, Abhik and
214
+ Islam, Md. Saiful and
215
+ Mubasshir, Kazi and
216
+ Li, Yuan-Fang and
217
+ Kang, Yong-Bin and
218
+ Rahman, M. Sohel and
219
+ Shahriyar, Rifat",
220
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
221
+ month = aug,
222
+ year = "2021",
223
+ address = "Online",
224
+ publisher = "Association for Computational Linguistics",
225
+ url = "https://aclanthology.org/2021.findings-acl.413",
226
+ pages = "4693--4703",
227
+ }
228
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/mt5-base",
3
+ "architectures": [
4
+ "MT5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 2048,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "gated-gelu",
13
+ "initializer_factor": 1.0,
14
+ "is_encoder_decoder": true,
15
+ "layer_norm_epsilon": 1e-06,
16
+ "length_penalty": 0.6,
17
+ "max_length": 256,
18
+ "model_type": "mt5",
19
+ "no_repeat_ngram_size": 2,
20
+ "num_beams": 4,
21
+ "num_decoder_layers": 12,
22
+ "num_heads": 12,
23
+ "num_layers": 12,
24
+ "output_past": true,
25
+ "pad_token_id": 0,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": false,
28
+ "tokenizer_class": "T5Tokenizer",
29
+ "use_cache": true,
30
+ "vocab_size": 250112
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1899a041aceedfd0c9c67e87f2597bc597ce6f4c1f21b5d35a6325322608a898
3
+ size 2329707353
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 0, "additional_special_tokens": null, "special_tokens_map_file": "/home/patrick/.cache/torch/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276", "tokenizer_file": null, "name_or_path": "google/mt5-base"}