kelechi commited on
Commit
57bf48b
1 Parent(s): 77451c1

initial commit

Browse files
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hugging Face's logo
2
+ ---
3
+ language:
4
+ - om
5
+ - am
6
+ - rw
7
+ - rn
8
+ - ha
9
+ - ig
10
+ - pcm
11
+ - so
12
+ - sw
13
+ - ti
14
+ - yo
15
+ - multilingual
16
+
17
+ ---
18
+ # afriberta_base
19
+ ## Model description
20
+ AfriBERTa base is a pretrained multilingual language model with around 111 million parameters.
21
+ The model has 8 layers, 6 attention heads, 768 hidden units and 3072 feed forward size.
22
+ The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá.
23
+ The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.
24
+
25
+
26
+ ## Intended uses & limitations
27
+
28
+ #### How to use
29
+ You can use this model with Transformers for any downstream task.
30
+ For example, assuming we want to finetune this model on a token classification task, we do the following:
31
+
32
+ ```python
33
+ >>> from transformers import AutoTokenizer, AutoModelForTokenClassification
34
+ >>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_base")
35
+ >>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_base")
36
+ # we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now
37
+ >>> tokenizer.model_max_length = 512
38
+ ```
39
+
40
+ #### Limitations and bias
41
+ - This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. Thus, it may not generalize well.
42
+ - This model is trained on very little data (less than 1 GB), hence it may not have seen enough data to learn very complex linguistic relations.
43
+
44
+ ## Training data
45
+ The model was trained on an aggregation of datasets from the BBC news website and Common Crawl.
46
+
47
+ ## Training procedure
48
+ For information on training procedures, please refer to the AfriBERTa [paper]() or [repository](https://github.com/keleog/afriberta)
49
+
50
+ ### BibTeX entry and citation info
51
+ ```
52
+ Kelechi Ogueji, Yuxin Zhu, Jimmy Lin.
53
+ Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages
54
+ Proceedings of the 1st workshop on Multilingual Representation Learning at EMNLP 2021
55
+ ```
56
+
57
+
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/Users/kelechogueji/Downloads/afriberta_base",
3
+ "architectures": [
4
+ "XLMRobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_length": 512,
17
+ "max_position_embeddings": 514,
18
+ "model_type": "xlm-roberta",
19
+ "num_attention_heads": 6,
20
+ "num_hidden_layers": 8,
21
+ "output_past": true,
22
+ "pad_token_id": 1,
23
+ "position_embedding_type": "absolute",
24
+ "transformers_version": "4.2.1",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 70006
28
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1b5507320b2175f5aa01cfbb9cd853fdf92dee491dde2359eae8919eb583a8a
3
+ size 446168989
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6419b3044bff45e94e0553cbb81425fd06046e9294b33555e23fdc69377dba6f
3
+ size 1554839
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "spm_models/spm_model_final_70k"}