konstantindobler commited on
Commit
3bb6ff3
1 Parent(s): 7402d18

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ha
3
+ license: mit
4
+ datasets: cc100
5
+ ---
6
+
7
+ # xlm-roberta-base-focus-hausa
8
+
9
+ XLM-R adapted to Hausa using "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models".
10
+
11
+ Code: https://github.com/konstantinjdobler/focus
12
+
13
+ Paper: https://arxiv.org/abs/2305.14481
14
+
15
+ ## Usage
16
+ ```python
17
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
18
+
19
+ tokenizer = AutoTokenizer.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")
20
+ model = AutoModelForMaskedLM.from_pretrained("konstantindobler/xlm-roberta-base-focus-hausa")
21
+
22
+ # Use model and tokenizer as usual
23
+ ```
24
+
25
+ ## Details
26
+ The model is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and was adapted to Hausa.
27
+ The original multilingual tokenizer was replaced by a language-specific Hausa tokenizer with a vocabulary of 50k tokens. The new embeddings were initialized with FOCUS.
28
+ The model was then trained on data from CC100 for 390k optimizer steps. More details and hyperparameters can be found [in the paper](https://arxiv.org/abs/2305.14481).
29
+
30
+ ## Disclaimer
31
+ The web-scale dataset used for pretraining and tokenizer training (CC100) might contain personal and sensitive information.
32
+ Such behavior needs to be assessed carefully before any real-world deployment of the models. Also, the tokenizer training was conducted using a sentencepiece `character_coverage` of 100%. As a result, the vocabulary contains characters which are usually not used in Hausa.
33
+
34
+ ## Citation
35
+ Please cite FOCUS as follows:
36
+
37
+ ```bibtex
38
+ @misc{dobler-demelo-2023-focus,
39
+ title={FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models},
40
+ author={Konstantin Dobler and Gerard de Melo},
41
+ year={2023},
42
+ eprint={2305.14481},
43
+ archivePrefix={arXiv},
44
+ primaryClass={cs.CL}
45
+ }
46
+ ```