SMALL-100 Model

SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in this paper(accepted to EMNLP2022), and initially released in this repository.

The model architecture and config are the same as M2M-100 implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from tokenization_small100.py file for the moment.

Demo: https://huggingface.co/spaces/alirezamsh/small100

Note: SMALL100Tokenizer requires sentencepiece, so make sure to install it by:

pip install sentencepiece

  • Supervised Training

SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is source:[tgt_lang_code] + src_tokens + [EOS] and target: tgt_tokens + [EOS].

small-100-th is the fine-tuned version of SMALL-100 for Thai

The dataset can be acquired from scb-mt-en-th-2020 and OPUS. It can also be directly download from Vistec.

small-100-th inference

from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer
from huggingface_hub import notebook_login

notebook_login()

checkpoint = "kimmchii/small-100-th"
model = M2M100ForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = SMALL100Tokenizer.from_pretrained(checkpoint)

thai_text = "สวัสดี"

# translate Thai to English
tokenizer.tgt_lang = "en"
encoded_th = tokenizer(thai_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_th)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Hello"
Downloads last month
8
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using kimmchii/small100-th 1