lbourdois/fineweb-2-trimming
Preview • Updated • 1.97M • 1.59k • 1
How to use alphaedge-ai/mbart-large-50-sin-32768 with Transformers:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("alphaedge-ai/mbart-large-50-sin-32768")
model = AutoModelForSeq2SeqLM.from_pretrained("alphaedge-ai/mbart-large-50-sin-32768")This model is a 36.42% smaller version of facebook/mbart-large-50 optimized for Sinhala language via vocabulary size reduction using the trimming method.
This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
| Metric | Original | Trimmed | Reduction |
|---|---|---|---|
| Vocabulary size | 250,054 tokens | 32,768 tokens | 86.90% |
| Model size | 610,879,488 params | 388,378,624 params | 36.42% |
from transformers import AutoModel, AutoTokenizer
model_name = "alphaedge-ai/mbart-large-50-sin-32768"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
@misc{tang2020multilingualtranslationextensiblemultilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2008.00401},
}
@misc{hf_blogpost_trimming,
title={Introduction to Trimming},
author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
year={2026},
url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
}
Base model
facebook/mbart-large-50