konstantindobler
/

xlm-roberta-base-focus-hausa

+---
+language: ha
+license: mit
+datasets: cc100
+---
+# xlm-roberta-base-focus-hausa
+XLM-R adapted to Hausa using "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models".
+Code: https://github.com/konstantinjdobler/focus
+Paper: https://arxiv.org/abs/2305.14481
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")
+model = AutoModelForMaskedLM.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")
+# Use model and tokenizer as usual
+```
+## Details
+The model is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and was adapted to Hausa.
+The original multilingual tokenizer was replaced by a language-specific Hausa tokenizer with a vocabulary of 50k tokens. The new embeddings were initialized with FOCUS.
+The model was then trained on data from CC100 for 390k optimizer steps. More details and hyperparameters can be found [in the paper](https://arxiv.org/abs/2305.14481).
+## Disclaimer
+The web-scale dataset used for pretraining and tokenizer training (CC100) might contain personal and sensitive information.
+Such behavior needs to be assessed carefully before any real-world deployment of the models. Also, the tokenizer training was conducted using a sentencepiece `character_coverage` of 100%. As a result, the vocabulary contains characters which are usually not used in Hausa.
+## Citation
+Please cite FOCUS as follows:
+```bibtex
+@misc{dobler-demelo-2023-focus,
+    title={FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models},
+    author={Konstantin Dobler and Gerard de Melo},
+    year={2023},
+    eprint={2305.14481},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```