File size: 1,776 Bytes
6788a44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language: zu
---

# zuBERTa
zuBERTa is a RoBERTa style transformer language model trained on zulu text.

## Intended uses & limitations
The model can be used for getting embeddings to use on a down-stream task such as question answering.

#### How to use

```python
>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead

>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Abafika eNkandla bafika sebeholwa <mask> uMpongo kaZingelwayo.")

[
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.</s>",
    "score": 0.050459690392017365,
    "token": 555,
    "token_str": "Ġkhona"
  },
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.</s>",
    "score": 0.03668094798922539,
    "token": 2321,
    "token_str": "Ġinkosi"
  },
  {
    "sequence": "<s>Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.</s>",
    "score": 0.028774697333574295,
    "token": 5101,
    "token_str": "Ġubukhosi"
  }
]
```

## Training data

1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings. 
2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages).

### BibTeX entry and citation info

```bibtex
@inproceedings{author = {Moseli Motsoehli},
  title = {Towards transformation of Southern African language models through transformers.},
  year={2020}
}
```