Ubuntu commited on
Commit
83377fc
1 Parent(s): 5d85593
.gitignore ADDED
@@ -0,0 +1 @@
 
1
+ *.ipynb_checkpoints
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ms
3
+ ---
4
+
5
+ # t5-super-super-tiny-bahasa-cased
6
+
7
+ Pretrained T5 super-super-tiny language model for Malay.
8
+
9
+ ## Pretraining Corpus
10
+
11
+ `t5-super-super-tiny-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
12
+
13
+ 1. Language masking task on bahasa news, bahasa Wikipedia, bahasa Academia.edu, bahasa parliament and translated The Pile.
14
+ 2. News title prediction on bahasa news.
15
+ 3. Next sentence prediction on bahasa news, bahasa Wikipedia, bahasa Academia.edu, bahasa parliament and translated The Pile.
16
+ 4. Translated QA Natural.
17
+ 5. Text Similarity task on translated SNLI and translated MNLI.
18
+ 6. EN-MS translation.
19
+ 7. MS-EN translation.
20
+ 8. Abstractive Summarization.
21
+ 9. Knowledge Graph triples generation.
22
+ 10. Paraphrase.
23
+
24
+ Preparing steps can reproduce at https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare
25
+
26
+ ## Pretraining details
27
+
28
+ - This model was trained using Google T5 repository https://github.com/google-research/text-to-text-transfer-transformer, on v3-8 TPU.
29
+ - All steps can reproduce from here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5
30
+
31
+ ## Load Pretrained Model
32
+
33
+ You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
34
+
35
+ ```python
36
+ from transformers import T5Tokenizer, T5Model
37
+
38
+ model = T5Model.from_pretrained('malay-huggingface/t5-super-super-tiny-bahasa-cased')
39
+ tokenizer = T5Tokenizer.from_pretrained('malay-huggingface/t5-super-super-tiny-bahasa-cased')
40
+ ```
41
+
42
+ ## Example using T5ForConditionalGeneration
43
+
44
+ ```python
45
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
46
+
47
+ tokenizer = T5Tokenizer.from_pretrained('malay-huggingface/t5-super-super-tiny-bahasa-cased')
48
+ model = T5ForConditionalGeneration.from_pretrained('malay-huggingface/t5-super-super-tiny-bahasa-cased')
49
+ input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
50
+ outputs = model.generate(input_ids)
51
+ print(tokenizer.decode(outputs[0]))
52
+ ```
53
+
54
+ Output is,
55
+
56
+ ```
57
+ 'Mahathir Mohamad'
58
+ ```
59
+
60
+ ## Supported prefix
61
+
62
+ 1. `soalan: {string}`, trained using Natural QA.
63
+ 2. `ringkasan: {string}`, for abstractive summarization.
64
+ 3. `tajuk: {string}`, for abstractive title.
65
+ 4. `parafrasa: {string}`, for abstractive paraphrase.
66
+ 5. `terjemah Inggeris ke Melayu: {string}`, for EN-MS translation.
67
+ 6. `terjemah Melayu ke Inggeris: {string}`, for MS-EN translation.
68
+ 7. `grafik pengetahuan: {string}`, for MS text to EN Knowledge Graph triples format.
69
+ 8. `ayat1: {string1} ayat2: {string2}`, semantic similarity.
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./pytorch_model.bin",
3
+ "architectures": [
4
+ "T5Model"
5
+ ],
6
+ "d_ff": 512,
7
+ "d_kv": 64,
8
+ "d_model": 128,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "relu",
13
+ "gradient_checkpointing": false,
14
+ "initializer_factor": 1.0,
15
+ "inputs_length": 512,
16
+ "is_encoder_decoder": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 2,
21
+ "num_heads": 6,
22
+ "num_layers": 2,
23
+ "pad_token_id": 0,
24
+ "relative_attention_num_buckets": 32,
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.10.0",
27
+ "use_cache": true,
28
+ "vocab_size": 32128
29
+ }
convert-from-malaya.ipynb ADDED
@@ -0,0 +1,510 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {
7
+ "scrolled": true
8
+ },
9
+ "outputs": [
10
+ {
11
+ "data": {
12
+ "text/plain": [
13
+ "'4.10.0'"
14
+ ]
15
+ },
16
+ "execution_count": 1,
17
+ "metadata": {},
18
+ "output_type": "execute_result"
19
+ }
20
+ ],
21
+ "source": [
22
+ "import transformers\n",
23
+ "transformers.__version__"
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": 2,
29
+ "metadata": {},
30
+ "outputs": [],
31
+ "source": [
32
+ "from transformers import T5Config, T5Model, load_tf_weights_in_t5"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": 5,
38
+ "metadata": {},
39
+ "outputs": [
40
+ {
41
+ "name": "stdout",
42
+ "output_type": "stream",
43
+ "text": [
44
+ "checkpoint\t\t\t\tmodel.ckpt-1000000.index\r\n",
45
+ "model.ckpt-1000000.data-00000-of-00002\tmodel.ckpt-1000000.meta\r\n",
46
+ "model.ckpt-1000000.data-00001-of-00002\toperative_config.gin\r\n"
47
+ ]
48
+ }
49
+ ],
50
+ "source": [
51
+ "# !wget https://f000.backblazeb2.com/file/malaya-model/pretrained/t5-super-super-tiny-2021-07-28.tar.gz\n",
52
+ "# !tar -zxf t5-super-super-tiny-2021-07-28.tar.gz\n",
53
+ "# !rm t5-super-super-tiny-2021-07-28.tar.gz\n",
54
+ "!ls t5-super-super-tiny-v2"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": 6,
60
+ "metadata": {},
61
+ "outputs": [
62
+ {
63
+ "name": "stdout",
64
+ "output_type": "stream",
65
+ "text": [
66
+ "T5Config {\n",
67
+ " \"d_ff\": 512,\n",
68
+ " \"d_kv\": 64,\n",
69
+ " \"d_model\": 128,\n",
70
+ " \"decoder_start_token_id\": 0,\n",
71
+ " \"dropout_rate\": 0.1,\n",
72
+ " \"eos_token_id\": 1,\n",
73
+ " \"feed_forward_proj\": \"relu\",\n",
74
+ " \"gradient_checkpointing\": false,\n",
75
+ " \"initializer_factor\": 1.0,\n",
76
+ " \"inputs_length\": 512,\n",
77
+ " \"is_encoder_decoder\": true,\n",
78
+ " \"layer_norm_epsilon\": 1e-06,\n",
79
+ " \"model_type\": \"t5\",\n",
80
+ " \"n_positions\": 512,\n",
81
+ " \"num_decoder_layers\": 2,\n",
82
+ " \"num_heads\": 6,\n",
83
+ " \"num_layers\": 2,\n",
84
+ " \"pad_token_id\": 0,\n",
85
+ " \"relative_attention_num_buckets\": 32,\n",
86
+ " \"transformers_version\": \"4.10.0\",\n",
87
+ " \"use_cache\": true,\n",
88
+ " \"vocab_size\": 32128\n",
89
+ "}\n",
90
+ "\n"
91
+ ]
92
+ }
93
+ ],
94
+ "source": [
95
+ "config = T5Config(\n",
96
+ " vocab_size = 32128,\n",
97
+ " n_positions=512,\n",
98
+ " d_ff = 512,\n",
99
+ " d_kv = 64,\n",
100
+ " d_model = 128,\n",
101
+ " dropout_rate = 0.1,\n",
102
+ " inputs_length = 512,\n",
103
+ " num_heads = 6,\n",
104
+ " num_layers = 2,\n",
105
+ " decoder_start_token_id = 0,\n",
106
+ " eos_token_id = 1,\n",
107
+ " pad_token_id = 0)\n",
108
+ "print(config)\n",
109
+ "config.save_pretrained('./')"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "code",
114
+ "execution_count": 7,
115
+ "metadata": {},
116
+ "outputs": [
117
+ {
118
+ "data": {
119
+ "text/plain": [
120
+ "T5Model(\n",
121
+ " (shared): Embedding(32128, 128)\n",
122
+ " (encoder): T5Stack(\n",
123
+ " (embed_tokens): Embedding(32128, 128)\n",
124
+ " (block): ModuleList(\n",
125
+ " (0): T5Block(\n",
126
+ " (layer): ModuleList(\n",
127
+ " (0): T5LayerSelfAttention(\n",
128
+ " (SelfAttention): T5Attention(\n",
129
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
130
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
131
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
132
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
133
+ " (relative_attention_bias): Embedding(32, 6)\n",
134
+ " )\n",
135
+ " (layer_norm): T5LayerNorm()\n",
136
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
137
+ " )\n",
138
+ " (1): T5LayerFF(\n",
139
+ " (DenseReluDense): T5DenseReluDense(\n",
140
+ " (wi): Linear(in_features=128, out_features=512, bias=False)\n",
141
+ " (wo): Linear(in_features=512, out_features=128, bias=False)\n",
142
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
143
+ " )\n",
144
+ " (layer_norm): T5LayerNorm()\n",
145
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
146
+ " )\n",
147
+ " )\n",
148
+ " )\n",
149
+ " (1): T5Block(\n",
150
+ " (layer): ModuleList(\n",
151
+ " (0): T5LayerSelfAttention(\n",
152
+ " (SelfAttention): T5Attention(\n",
153
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
154
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
155
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
156
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
157
+ " )\n",
158
+ " (layer_norm): T5LayerNorm()\n",
159
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
160
+ " )\n",
161
+ " (1): T5LayerFF(\n",
162
+ " (DenseReluDense): T5DenseReluDense(\n",
163
+ " (wi): Linear(in_features=128, out_features=512, bias=False)\n",
164
+ " (wo): Linear(in_features=512, out_features=128, bias=False)\n",
165
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
166
+ " )\n",
167
+ " (layer_norm): T5LayerNorm()\n",
168
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
169
+ " )\n",
170
+ " )\n",
171
+ " )\n",
172
+ " )\n",
173
+ " (final_layer_norm): T5LayerNorm()\n",
174
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
175
+ " )\n",
176
+ " (decoder): T5Stack(\n",
177
+ " (embed_tokens): Embedding(32128, 128)\n",
178
+ " (block): ModuleList(\n",
179
+ " (0): T5Block(\n",
180
+ " (layer): ModuleList(\n",
181
+ " (0): T5LayerSelfAttention(\n",
182
+ " (SelfAttention): T5Attention(\n",
183
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
184
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
185
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
186
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
187
+ " (relative_attention_bias): Embedding(32, 6)\n",
188
+ " )\n",
189
+ " (layer_norm): T5LayerNorm()\n",
190
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
191
+ " )\n",
192
+ " (1): T5LayerCrossAttention(\n",
193
+ " (EncDecAttention): T5Attention(\n",
194
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
195
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
196
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
197
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
198
+ " )\n",
199
+ " (layer_norm): T5LayerNorm()\n",
200
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
201
+ " )\n",
202
+ " (2): T5LayerFF(\n",
203
+ " (DenseReluDense): T5DenseReluDense(\n",
204
+ " (wi): Linear(in_features=128, out_features=512, bias=False)\n",
205
+ " (wo): Linear(in_features=512, out_features=128, bias=False)\n",
206
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
207
+ " )\n",
208
+ " (layer_norm): T5LayerNorm()\n",
209
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
210
+ " )\n",
211
+ " )\n",
212
+ " )\n",
213
+ " (1): T5Block(\n",
214
+ " (layer): ModuleList(\n",
215
+ " (0): T5LayerSelfAttention(\n",
216
+ " (SelfAttention): T5Attention(\n",
217
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
218
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
219
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
220
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
221
+ " )\n",
222
+ " (layer_norm): T5LayerNorm()\n",
223
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
224
+ " )\n",
225
+ " (1): T5LayerCrossAttention(\n",
226
+ " (EncDecAttention): T5Attention(\n",
227
+ " (q): Linear(in_features=128, out_features=384, bias=False)\n",
228
+ " (k): Linear(in_features=128, out_features=384, bias=False)\n",
229
+ " (v): Linear(in_features=128, out_features=384, bias=False)\n",
230
+ " (o): Linear(in_features=384, out_features=128, bias=False)\n",
231
+ " )\n",
232
+ " (layer_norm): T5LayerNorm()\n",
233
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
234
+ " )\n",
235
+ " (2): T5LayerFF(\n",
236
+ " (DenseReluDense): T5DenseReluDense(\n",
237
+ " (wi): Linear(in_features=128, out_features=512, bias=False)\n",
238
+ " (wo): Linear(in_features=512, out_features=128, bias=False)\n",
239
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
240
+ " )\n",
241
+ " (layer_norm): T5LayerNorm()\n",
242
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
243
+ " )\n",
244
+ " )\n",
245
+ " )\n",
246
+ " )\n",
247
+ " (final_layer_norm): T5LayerNorm()\n",
248
+ " (dropout): Dropout(p=0.1, inplace=False)\n",
249
+ " )\n",
250
+ ")"
251
+ ]
252
+ },
253
+ "execution_count": 7,
254
+ "metadata": {},
255
+ "output_type": "execute_result"
256
+ }
257
+ ],
258
+ "source": [
259
+ "model = T5Model(config)\n",
260
+ "load_tf_weights_in_t5(model, config, 't5-super-super-tiny-v2/model.ckpt-1000000')"
261
+ ]
262
+ },
263
+ {
264
+ "cell_type": "code",
265
+ "execution_count": 8,
266
+ "metadata": {},
267
+ "outputs": [
268
+ {
269
+ "data": {
270
+ "text/plain": [
271
+ "('config.json', 'pytorch_model.bin')"
272
+ ]
273
+ },
274
+ "execution_count": 8,
275
+ "metadata": {},
276
+ "output_type": "execute_result"
277
+ }
278
+ ],
279
+ "source": [
280
+ "from transformers import CONFIG_NAME, WEIGHTS_NAME\n",
281
+ "CONFIG_NAME, WEIGHTS_NAME"
282
+ ]
283
+ },
284
+ {
285
+ "cell_type": "code",
286
+ "execution_count": 9,
287
+ "metadata": {},
288
+ "outputs": [],
289
+ "source": [
290
+ "import torch\n",
291
+ "\n",
292
+ "torch.save(model.state_dict(), './' + WEIGHTS_NAME)"
293
+ ]
294
+ },
295
+ {
296
+ "cell_type": "code",
297
+ "execution_count": 10,
298
+ "metadata": {},
299
+ "outputs": [],
300
+ "source": [
301
+ "from transformers import T5Config, T5Model, T5Tokenizer"
302
+ ]
303
+ },
304
+ {
305
+ "cell_type": "code",
306
+ "execution_count": 12,
307
+ "metadata": {},
308
+ "outputs": [],
309
+ "source": [
310
+ "# !wget https://f000.backblazeb2.com/file/malaya-model/bpe/sp10m.cased.ms-en.model"
311
+ ]
312
+ },
313
+ {
314
+ "cell_type": "code",
315
+ "execution_count": 13,
316
+ "metadata": {},
317
+ "outputs": [
318
+ {
319
+ "data": {
320
+ "text/plain": [
321
+ "('./tokenizer_config.json',\n",
322
+ " './special_tokens_map.json',\n",
323
+ " './spiece.model',\n",
324
+ " './added_tokens.json')"
325
+ ]
326
+ },
327
+ "execution_count": 13,
328
+ "metadata": {},
329
+ "output_type": "execute_result"
330
+ }
331
+ ],
332
+ "source": [
333
+ "tokenizer = T5Tokenizer('sp10m.cased.ms-en.model')\n",
334
+ "tokenizer.save_pretrained('./')"
335
+ ]
336
+ },
337
+ {
338
+ "cell_type": "code",
339
+ "execution_count": 14,
340
+ "metadata": {},
341
+ "outputs": [],
342
+ "source": [
343
+ "tokenizer = T5Tokenizer.from_pretrained('./', lower = False)"
344
+ ]
345
+ },
346
+ {
347
+ "cell_type": "code",
348
+ "execution_count": 15,
349
+ "metadata": {},
350
+ "outputs": [],
351
+ "source": [
352
+ "config = T5Config.from_pretrained('./')"
353
+ ]
354
+ },
355
+ {
356
+ "cell_type": "code",
357
+ "execution_count": 16,
358
+ "metadata": {},
359
+ "outputs": [],
360
+ "source": [
361
+ "model = T5Model.from_pretrained('./pytorch_model.bin', config = config)"
362
+ ]
363
+ },
364
+ {
365
+ "cell_type": "code",
366
+ "execution_count": 17,
367
+ "metadata": {},
368
+ "outputs": [],
369
+ "source": [
370
+ "model.save_pretrained('./')"
371
+ ]
372
+ },
373
+ {
374
+ "cell_type": "code",
375
+ "execution_count": 18,
376
+ "metadata": {},
377
+ "outputs": [],
378
+ "source": [
379
+ "from transformers import T5Tokenizer, T5ForConditionalGeneration"
380
+ ]
381
+ },
382
+ {
383
+ "cell_type": "code",
384
+ "execution_count": 19,
385
+ "metadata": {},
386
+ "outputs": [],
387
+ "source": [
388
+ "model = T5ForConditionalGeneration.from_pretrained('./')"
389
+ ]
390
+ },
391
+ {
392
+ "cell_type": "code",
393
+ "execution_count": 20,
394
+ "metadata": {},
395
+ "outputs": [
396
+ {
397
+ "data": {
398
+ "text/plain": [
399
+ "'<pad> Narendra Modi</s>'"
400
+ ]
401
+ },
402
+ "execution_count": 20,
403
+ "metadata": {},
404
+ "output_type": "execute_result"
405
+ }
406
+ ],
407
+ "source": [
408
+ "input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')\n",
409
+ "outputs = model.generate(input_ids)\n",
410
+ "tokenizer.decode(outputs[0])"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "execution_count": 21,
416
+ "metadata": {},
417
+ "outputs": [
418
+ {
419
+ "data": {
420
+ "text/plain": [
421
+ "'<pad> PETALING JAYA: Bekas perdana menteri Najib Razak mempersoalkan sama ada kerajaan tahu bagaimana menguruskan pandemik'"
422
+ ]
423
+ },
424
+ "execution_count": 21,
425
+ "metadata": {},
426
+ "output_type": "execute_result"
427
+ }
428
+ ],
429
+ "source": [
430
+ "input_ids = tokenizer.encode('terjemah Inggeris ke Melayu: PETALING JAYA: Former prime minister Najib Razak has questioned whether the government knows how to manage the Covid-19 pandemic, outlining several seemingly contradictory announcements it has made.', return_tensors = 'pt')\n",
431
+ "outputs = model.generate(input_ids)\n",
432
+ "tokenizer.decode(outputs[0])"
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "code",
437
+ "execution_count": 22,
438
+ "metadata": {},
439
+ "outputs": [
440
+ {
441
+ "data": {
442
+ "text/plain": [
443
+ "\"<pad> PETALING JAYA: Former Prime Minister Najib Tun Razak's meeting and Deputy Prime Minister Datuk Seri\""
444
+ ]
445
+ },
446
+ "execution_count": 22,
447
+ "metadata": {},
448
+ "output_type": "execute_result"
449
+ }
450
+ ],
451
+ "source": [
452
+ "input_ids = tokenizer.encode('terjemah Melayu ke Inggeris: PETALING JAYA: Pertemuan bekas Perdana Menteri, Datuk Seri Najib Tun Razak dan Timbalan Perdana Menteri, Datuk Seri Ismail Sabri Yaakob hari ini adalah bagi membincangkan isu berkaitan hala tuju dan dasar negara.', return_tensors = 'pt')\n",
453
+ "outputs = model.generate(input_ids)\n",
454
+ "tokenizer.decode(outputs[0])"
455
+ ]
456
+ },
457
+ {
458
+ "cell_type": "code",
459
+ "execution_count": 23,
460
+ "metadata": {},
461
+ "outputs": [
462
+ {
463
+ "data": {
464
+ "text/plain": [
465
+ "'<pad> Roman Catholic Archdiocese of Maracaibo shares border with Roman Catholic Diocese'"
466
+ ]
467
+ },
468
+ "execution_count": 23,
469
+ "metadata": {},
470
+ "output_type": "execute_result"
471
+ }
472
+ ],
473
+ "source": [
474
+ "input_ids = tokenizer.encode('grafik pengetahuan: Keuskupan Agung Katolik Rom Maracaibo terletak di barat daya Keuskupan Katolik Rom Machiques.', return_tensors = 'pt')\n",
475
+ "outputs = model.generate(input_ids)\n",
476
+ "tokenizer.decode(outputs[0])"
477
+ ]
478
+ },
479
+ {
480
+ "cell_type": "code",
481
+ "execution_count": 24,
482
+ "metadata": {},
483
+ "outputs": [],
484
+ "source": [
485
+ "!rm -rf t5-super-super-tiny-v2"
486
+ ]
487
+ }
488
+ ],
489
+ "metadata": {
490
+ "kernelspec": {
491
+ "display_name": "Python 3",
492
+ "language": "python",
493
+ "name": "python3"
494
+ },
495
+ "language_info": {
496
+ "codemirror_mode": {
497
+ "name": "ipython",
498
+ "version": 3
499
+ },
500
+ "file_extension": ".py",
501
+ "mimetype": "text/x-python",
502
+ "name": "python",
503
+ "nbconvert_exporter": "python",
504
+ "pygments_lexer": "ipython3",
505
+ "version": "3.6.9"
506
+ }
507
+ },
508
+ "nbformat": 4,
509
+ "nbformat_minor": 4
510
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbaba9bf7fe8937ea3a8d028ce85fba8a33ee4419085d1529abd8156f850161d
3
+ size 23292952
sp10m.cased.ms-en.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26de51154cccc9db6e65e5d466bdb0b1fff9fab1d80f4689711de943448addd6
3
+ size 803030
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"]}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26de51154cccc9db6e65e5d466bdb0b1fff9fab1d80f4689711de943448addd6
3
+ size 803030
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 100, "additional_special_tokens": ["<extra_id_0>", "<extra_id_1>", "<extra_id_2>", "<extra_id_3>", "<extra_id_4>", "<extra_id_5>", "<extra_id_6>", "<extra_id_7>", "<extra_id_8>", "<extra_id_9>", "<extra_id_10>", "<extra_id_11>", "<extra_id_12>", "<extra_id_13>", "<extra_id_14>", "<extra_id_15>", "<extra_id_16>", "<extra_id_17>", "<extra_id_18>", "<extra_id_19>", "<extra_id_20>", "<extra_id_21>", "<extra_id_22>", "<extra_id_23>", "<extra_id_24>", "<extra_id_25>", "<extra_id_26>", "<extra_id_27>", "<extra_id_28>", "<extra_id_29>", "<extra_id_30>", "<extra_id_31>", "<extra_id_32>", "<extra_id_33>", "<extra_id_34>", "<extra_id_35>", "<extra_id_36>", "<extra_id_37>", "<extra_id_38>", "<extra_id_39>", "<extra_id_40>", "<extra_id_41>", "<extra_id_42>", "<extra_id_43>", "<extra_id_44>", "<extra_id_45>", "<extra_id_46>", "<extra_id_47>", "<extra_id_48>", "<extra_id_49>", "<extra_id_50>", "<extra_id_51>", "<extra_id_52>", "<extra_id_53>", "<extra_id_54>", "<extra_id_55>", "<extra_id_56>", "<extra_id_57>", "<extra_id_58>", "<extra_id_59>", "<extra_id_60>", "<extra_id_61>", "<extra_id_62>", "<extra_id_63>", "<extra_id_64>", "<extra_id_65>", "<extra_id_66>", "<extra_id_67>", "<extra_id_68>", "<extra_id_69>", "<extra_id_70>", "<extra_id_71>", "<extra_id_72>", "<extra_id_73>", "<extra_id_74>", "<extra_id_75>", "<extra_id_76>", "<extra_id_77>", "<extra_id_78>", "<extra_id_79>", "<extra_id_80>", "<extra_id_81>", "<extra_id_82>", "<extra_id_83>", "<extra_id_84>", "<extra_id_85>", "<extra_id_86>", "<extra_id_87>", "<extra_id_88>", "<extra_id_89>", "<extra_id_90>", "<extra_id_91>", "<extra_id_92>", "<extra_id_93>", "<extra_id_94>", "<extra_id_95>", "<extra_id_96>", "<extra_id_97>", "<extra_id_98>", "<extra_id_99>"], "sp_model_kwargs": {}, "tokenizer_class": "T5Tokenizer"}