--- language: - multilingual - af - am - ar - ast - az - ba - be - bg - bn - br - bs - ca - ceb - cs - cy - da - de - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kk - km - kn - ko - lb - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - ns - oc - or - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - th - tl - tn - tr - uk - ur - uz - vi - wo - xh - yi - yo - zh - zu license: mit tags: - small100 - translation datasets: - flores101 - gsarti/flores_101 - tico19 - gmnlp/tico19 - tatoeba --- # SMALL-100 Model SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in [this paper](https://arxiv.org/abs/2210.11621), and initially released in [this repository](https://github.com/alirezamshi/small100). The model architecture and config are the same as [M2M-100](https://huggingface.co/facebook/m2m100_418M/tree/main) implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from tokenization_small100.py file for the moment. ``` from transformers import M2M100ForConditionalGeneration from tokenization_small100 import SMALL100Tokenizer hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" chinese_text = "生活就像一盒巧克力。" model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100") tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100") # translate Hindi to French tokenizer.tgt_lang = "fr" encoded_hi = tokenizer(hi_text, return_tensors="pt") generated_tokens = model.generate(**encoded_hi) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "La vie est comme une boîte de chocolat." # translate Chinese to English tokenizer.tgt_lang = "en" encoded_zh = tokenizer(chinese_text, return_tensors="pt") generated_tokens = model.generate(**encoded_zh) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "Life is like a box of chocolate." ``` Please refer to [original repository](https://github.com/alirezamshi/small100) for further details.