Spaces:

OFA-Sys
/

OFA-OCR

Runtime error

App Files Files Community

OFA-OCR / fairseq /examples /xlmr /README.md

JustinLin610

first commit

ee21b96 almost 2 years ago

preview code

raw

history blame

6.56 kB

	# Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)
	https://arxiv.org/pdf/1911.02116.pdf

	# Larger-Scale Transformers for Multilingual Masked Language Modeling
	https://arxiv.org/pdf/2105.00572.pdf


	## What's New:
	- June 2021: `XLMR-XL` AND `XLMR-XXL` models released.

	## Introduction

	`XLM-R` (`XLM-RoBERTa`) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on `2.5T` of filtered CommonCrawl data in 100 languages (list below).

	Language \| Language\|Language \|Language \| Language
	---\|---\|---\|---\|---
	Afrikaans \| Albanian \| Amharic \| Arabic \| Armenian
	Assamese \| Azerbaijani \| Basque \| Belarusian \| Bengali
	Bengali Romanize \| Bosnian \| Breton \| Bulgarian \| Burmese
	Burmese zawgyi font \| Catalan \| Chinese (Simplified) \| Chinese (Traditional) \| Croatian
	Czech \| Danish \| Dutch \| English \| Esperanto
	Estonian \| Filipino \| Finnish \| French \| Galician
	Georgian \| German \| Greek \| Gujarati \| Hausa
	Hebrew \| Hindi \| Hindi Romanize \| Hungarian \| Icelandic
	Indonesian \| Irish \| Italian \| Japanese \| Javanese
	Kannada \| Kazakh \| Khmer \| Korean \| Kurdish (Kurmanji)
	Kyrgyz \| Lao \| Latin \| Latvian \| Lithuanian
	Macedonian \| Malagasy \| Malay \| Malayalam \| Marathi
	Mongolian \| Nepali \| Norwegian \| Oriya \| Oromo
	Pashto \| Persian \| Polish \| Portuguese \| Punjabi
	Romanian \| Russian \| Sanskrit \| Scottish Gaelic \| Serbian
	Sindhi \| Sinhala \| Slovak \| Slovenian \| Somali
	Spanish \| Sundanese \| Swahili \| Swedish \| Tamil
	Tamil Romanize \| Telugu \| Telugu Romanize \| Thai \| Turkish
	Ukrainian \| Urdu \| Urdu Romanize \| Uyghur \| Uzbek
	Vietnamese \| Welsh \| Western Frisian \| Xhosa \| Yiddish

	## Pre-trained models

	Model \| Description \| #params \| vocab size \| Download
	---\|---\|---\|---\|---
	`xlmr.base` \| XLM-R using the BERT-base architecture \| 250M \| 250k \| [xlm.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz)
	`xlmr.large` \| XLM-R using the BERT-large architecture \| 560M \| 250k \| [xlm.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz)
	`xlmr.xl` \| XLM-R (`layers=36, model_dim=2560`) \| 3.5B \| 250k \| [xlm.xl.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr/xlmr.xl.tar.gz)
	`xlmr.xxl` \| XLM-R (`layers=48, model_dim=4096`) \| 10.7B \| 250k \| [xlm.xxl.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr/xlmr.xxl.tar.gz)

	## Results

	[XNLI (Conneau et al., 2018)](https://arxiv.org/abs/1809.05053)

	Model \| average \| en \| fr \| es \| de \| el \| bg \| ru \| tr \| ar \| vi \| th \| zh \| hi \| sw \| ur
	---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---
	`roberta.large.mnli` _(TRANSLATE-TEST)_ \| 77.8 \| 91.3 \| 82.9 \| 84.3 \| 81.2 \| 81.7 \| 83.1 \| 78.3 \| 76.8 \| 76.6 \| 74.2 \| 74.1 \| 77.5 \| 70.9 \| 66.7 \| 66.8
	`xlmr.large` _(TRANSLATE-TRAIN-ALL)_ \| 83.6 \| 89.1 \| 85.1 \| 86.6 \| 85.7 \| 85.3 \| 85.9 \| 83.5 \| 83.2 \| 83.1 \| 83.7 \| 81.5 \| 83.7 \| 81.6 \| 78.0 \| 78.1
	`xlmr.xl` _(TRANSLATE-TRAIN-ALL)_ \| 85.4 \| 91.1 \| 87.2 \| 88.1 \| 87.0 \| 87.4 \| 87.8 \| 85.3 \| 85.2 \| 85.3 \| 86.2 \| 83.8 \| 85.3 \| 83.1 \| 79.8 \| 78.2 \| 85.4
	`xlmr.xxl` _(TRANSLATE-TRAIN-ALL)_ \| 86.0 \| 91.5 \| 87.6 \| 88.7 \| 87.8 \| 87.4 \| 88.2 \| 85.6 \| 85.1 \| 85.8 \| 86.3 \| 83.9 \| 85.6 \| 84.6 \| 81.7 \| 80.6

	[MLQA (Lewis et al., 2018)](https://arxiv.org/abs/1910.07475)

	Model \| average \| en \| es \| de \| ar \| hi \| vi \| zh
	---\|---\|---\|---\|---\|---\|---\|---\|---
	`BERT-large` \| - \| 80.2/67.4 \| - \| - \| - \| - \| - \| -
	`mBERT` \| 57.7 / 41.6 \| 77.7 / 65.2 \| 64.3 / 46.6 \| 57.9 / 44.3 \| 45.7 / 29.8\| 43.8 / 29.7 \| 57.1 / 38.6 \| 57.5 / 37.3
	`xlmr.large` \| 70.7 / 52.7 \| 80.6 / 67.8 \| 74.1 / 56.0 \| 68.5 / 53.6 \| 63.1 / 43.5 \| 69.2 / 51.6 \| 71.3 / 50.9 \| 68.0 / 45.4
	`xlmr.xl` \| 73.4 / 55.3 \| 85.1 / 72.6 \| 66.7 / 46.2 \| 70.5 / 55.5 \| 74.3 / 56.9 \| 72.2 / 54.7 \| 74.4 / 52.9 \| 70.9 / 48.5
	`xlmr.xxl` \| 74.8 / 56.6 \| 85.5 / 72.4 \| 68.6 / 48.4 \| 72.7 / 57.8 \| 75.4 / 57.6 \| 73.7 / 55.8 \| 76.0 / 55.0 \| 71.7 / 48.9


	## Example usage

	##### Load XLM-R from torch.hub (PyTorch >= 1.1):
	```python
	import torch
	xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
	xlmr.eval() # disable dropout (or leave in train mode to finetune)
	```

	##### Load XLM-R (for PyTorch 1.0 or custom models):
	```python
	# Download xlmr.large model
	wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
	tar -xzvf xlmr.large.tar.gz

	# Load the model in fairseq
	from fairseq.models.roberta import XLMRModel
	xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
	xlmr.eval() # disable dropout (or leave in train mode to finetune)
	```

	##### Apply sentence-piece-model (SPM) encoding to input text:
	```python
	en_tokens = xlmr.encode('Hello world!')
	assert en_tokens.tolist() == [0, 35378, 8999, 38, 2]
	xlmr.decode(en_tokens) # 'Hello world!'

	zh_tokens = xlmr.encode('你好，世界')
	assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
	xlmr.decode(zh_tokens) # '你好，世界'

	hi_tokens = xlmr.encode('नमस्ते दुनिया')
	assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
	xlmr.decode(hi_tokens) # 'नमस्ते दुनिया'

	ar_tokens = xlmr.encode('مرحبا بالعالم')
	assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
	xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

	fr_tokens = xlmr.encode('Bonjour le monde')
	assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
	xlmr.decode(fr_tokens) # 'Bonjour le monde'
	```

	##### Extract features from XLM-R:
	```python
	# Extract the last layer's features
	last_layer_features = xlmr.extract_features(zh_tokens)
	assert last_layer_features.size() == torch.Size([1, 6, 1024])

	# Extract all layer's features (layer 0 is the embedding layer)
	all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
	assert len(all_layers) == 25
	assert torch.all(all_layers[-1] == last_layer_features)
	```

	## Citation

	```bibtex
	@article{conneau2019unsupervised,
	title={Unsupervised Cross-lingual Representation Learning at Scale},
	author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
	journal={arXiv preprint arXiv:1911.02116},
	year={2019}
	}
	```


	```bibtex
	@article{goyal2021larger,
	title={Larger-Scale Transformers for Multilingual Masked Language Modeling},
	author={Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis},
	journal={arXiv preprint arXiv:2105.00572},
	year={2021}
	}
	```