|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# NanoT5 Small Malaysian Translation |
|
|
|
Finetuned https://huggingface.co/mesolitica/nanot5-small-malaysian-cased using 2048 context length on 7B tokens of translation dataset. |
|
|
|
- This model able to translate from localize text into standard text. |
|
- This model able to reverse translate from standard to localize text, suitable for text augmentation. |
|
- This model able to translate code. |
|
- This model natively code switching. |
|
- This model maintain `\n`, `\t`, `\r` as it is. |
|
|
|
**Still in training session**, Wandb at https://wandb.ai/huseinzol05/nanot5-small-malaysian-cased-translation-v4?nw=nwuserhuseinzol05 |
|
|
|
## Supported prefix |
|
|
|
1. `'terjemah ke Mandarin: '` |
|
3. `'terjemah ke Tamil: '` |
|
4. `'terjemah ke Jawa: '` |
|
5. `'terjemah ke Melayu: '` |
|
6. `'terjemah ke Inggeris: '` |
|
7. `'terjemah ke johor: '` |
|
8. `'terjemah ke kedah: '` |
|
9. `'terjemah ke kelantan: '` |
|
10. `'terjemah ke pasar Melayu: '` |
|
11. `'terjemah ke melaka: '` |
|
12. `'terjemah ke negeri sembilan: '` |
|
13. `'terjemah ke pahang: '` |
|
14. `'terjemah ke perak: '` |
|
15. `'terjemah ke sabah: '` |
|
16. `'terjemah ke sarawak: '` |
|
17. `'terjemah ke terengganu: '` |
|
18. `'terjemah ke Jawi: '` |
|
19. `'terjemah ke Manglish: '` |
|
20. `'terjemah ke Banjar: '` |
|
21. `'terjemah ke pasar Mandarin: '` |
|
|
|
## how to |
|
|
|
```python |
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('mesolitica/nanot5-small-malaysian-translation-v2') |
|
model = T5ForConditionalGeneration.from_pretrained('mesolitica/nanot5-small-malaysian-translation-v2') |
|
|
|
strings = [ |
|
'ak tak paham la', |
|
'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:', |
|
"Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.", |
|
'jam 8 di pasar KK memang org ramai π, pandai dia pilih tmpt.', |
|
'Jadi haram jadahπππ€', |
|
'nak gi mana tuu', |
|
'Macam nak ambil half day', |
|
"Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.", |
|
] |
|
all_special_ids = [0, 1, 2] |
|
prefix = 'terjemah ke Melayu: ' |
|
input_ids = [{'input_ids': tokenizer.encode(f'{prefix}{s}{tokenizer.eos_token}', return_tensors='pt')[ |
|
0]} for s in strings] |
|
padded = tokenizer.pad(input_ids, padding='longest') |
|
outputs = model.generate(**padded, max_length = 100) |
|
tokenizer.batch_decode([[i for i in o if i not in all_special_ids] for o in outputs]) |
|
``` |
|
|
|
Output, |
|
|
|
``` |
|
[' Saya tidak faham', |
|
' Hi guys! Saya perasan semalam dan hari ini ramai yang menerima cookies. Jadi hari ini saya ingin berkongsi beberapa post mortem batch pertama kami:', |
|
' Memanglah. Tak perlu pakar, saya juga tahu. Ini adalah satu isyarat, bodoh.', |
|
' Orang ramai di pasar KK pada jam 8 pagi, mereka sangat pandai memilih tempat.', |
|
' Jadi haram jadah πππ€', |
|
' Di mana kamu pergi?', |
|
' Saya ingin mengambil separuh hari', |
|
' Bayangkan PH dan menang PRU-14. Terdapat pelbagai pintu belakang. Akhirnya, Ismail Sabri naik. Itulah sebabnya saya tidak lagi bercakap tentang politik. Saya bersumpah sudah berputus asa.'] |
|
``` |
|
|
|
Input text can be any languages that speak in Malaysia, as long you use proper prefix, it should be able to translate to target language. |