konstantindobler
/

xlm-roberta-base-focus-hausa

Inference Endpoints

Model card Files Files and versions Community

xlm-roberta-base-focus-hausa / README.md

konstantindobler's picture

konstantindobler

Upload README.md with huggingface_hub

3bb6ff3 8 months ago

|

raw history blame contribute delete

No virus

1.82 kB

	---
	language: ha
	license: mit
	datasets: cc100
	---

	# xlm-roberta-base-focus-hausa

	XLM-R adapted to Hausa using "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models".

	Code: https://github.com/konstantinjdobler/focus

	Paper: https://arxiv.org/abs/2305.14481

	## Usage
	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")
	model = AutoModelForMaskedLM.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")

	# Use model and tokenizer as usual
	```

	## Details
	The model is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and was adapted to Hausa.
	The original multilingual tokenizer was replaced by a language-specific Hausa tokenizer with a vocabulary of 50k tokens. The new embeddings were initialized with FOCUS.
	The model was then trained on data from CC100 for 390k optimizer steps. More details and hyperparameters can be found [in the paper](https://arxiv.org/abs/2305.14481).

	## Disclaimer
	The web-scale dataset used for pretraining and tokenizer training (CC100) might contain personal and sensitive information.
	Such behavior needs to be assessed carefully before any real-world deployment of the models. Also, the tokenizer training was conducted using a sentencepiece `character_coverage` of 100%. As a result, the vocabulary contains characters which are usually not used in Hausa.

	## Citation
	Please cite FOCUS as follows:

	```bibtex
	@misc{dobler-demelo-2023-focus,
	title={FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models},
	author={Konstantin Dobler and Gerard de Melo},
	year={2023},
	eprint={2305.14481},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```