--- language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh license: mit --- # XLM-V (Base-sized model) XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R). It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa. **Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team. [This repository](https://github.com/stefan-it/xlm-v-experiments) documents all necessary integeration steps. ## Model description From the abstract of the XLM-V paper: > Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. > As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. > This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. > In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by > de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity > to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically > more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, > a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we > tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and > named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER). ## Usage You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='facebook/xlm-v-base') >>> unmasker("Paris is the of France.") [{'score': 0.9286897778511047, 'token': 133852, 'token_str': 'capital', 'sequence': 'Paris is the capital of France.'}, {'score': 0.018073994666337967, 'token': 46562, 'token_str': 'Capital', 'sequence': 'Paris is the Capital of France.'}, {'score': 0.013238662853837013, 'token': 8696, 'token_str': 'centre', 'sequence': 'Paris is the centre of France.'}, {'score': 0.010450296103954315, 'token': 550136, 'token_str': 'heart', 'sequence': 'Paris is the heart of France.'}, {'score': 0.005028395913541317, 'token': 60041, 'token_str': 'center', 'sequence': 'Paris is the center of France.'}] ``` ## Bias, Risks, and Limitations Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture and has been trained on similar training data. ### BibTeX entry and citation info ```bibtex @ARTICLE{2023arXiv230110472L, author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian}, title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning}, year = 2023, month = jan, eid = {arXiv:2301.10472}, pages = {arXiv:2301.10472}, doi = {10.48550/arXiv.2301.10472}, archivePrefix = {arXiv}, eprint = {2301.10472}, primaryClass = {cs.CL}, adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} } ```