File size: 2,129 Bytes
43e3296
 
9e32285
 
43e3296
9e32285
 
900a609
9e32285
c9ce8ff
a13f385
9e32285
8f6eb1a
 
b0e3850
9c12614
 
 
 
 
ddea653
9e32285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14d96b7
9e32285
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
license: cc-by-4.0
language:
- sw
---


BERT base (cased) model trained on a subset of 125M tokens of cc100-Swahili for our work [Scaling Laws for BERT in Low-Resource Settings](https://aclanthology.org/2023.findings-acl.492.pdf) at ACL2023 Findings.

The model has 124M parameters (12L), and a vocab size of 50K.
It was trained for 500K steps with a sequence length of 512 tokens and batch-size of 256.

Results
-----------
|           | [bert-base-sw](https://huggingface.co/orai-nlp/bert-base-sw) | [bert-medium-sw](https://huggingface.co/orai-nlp/bert-medium-sw) | Flair | [mBERT](https://huggingface.co/bert-base-multilingual-cased) | [SwahBERT](https://github.com/gatimartin/SwahBERT#pre-trained-models) |
|-----------|--------------|----------------|-------|-------|---------------------------------|
| NERC      | **92.09**    | 91.63          | 92.04 | 91.17 | 88.60                           |
| Topic     | **93.07**    | 92.88          | 91.83 | 91.52 | 90.90                           |
| Sentiment | **79.04**    | 77.07          | 73.60 | 69.17 | 71.12                           |
| QNLI      | 63.34        | 63.87          | 52.82 | 63.48 | **64.72**                       |


Authors
-----------
Gorka Urbizu [1], Iñaki San Vicente [1], Xabier Saralegi [1],
Rodrigo Agerri [2] and Aitor Soroa [2]

Affiliation of the authors: 

[1] Orai NLP Technologies

[2] HiTZ Center - Ixa, University of the Basque Country UPV/EHU



Licensing
-------------

The model is licensed under the Creative Commons Attribution 4.0. International License (CC BY 4.0). 

To view a copy of this license, visit [http://creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/deed.eu).




Acknowledgements
-------------------
If you use this model please cite the following paper:

- G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. Scaling Laws for BERT in Low-Resource Settings. Findings of the Association for Computational Linguistics: ACL 2023. July, 2023. Toronto, Canada



Contact information
-----------------------
Gorka Urbizu, Iñaki San Vicente: {g.urbizu,i.sanvicente}@orai.eus