--- language: zu --- # zuBERTa zuBERTa is a RoBERTa style transformer language model trained on zulu text. ## Intended uses & limitations The model can be used for getting embeddings to use on a down-stream task such as question answering. #### How to use ```python >>> from transformers import pipeline >>> from transformers import AutoTokenizer, AutoModelWithLMHead >>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa") >>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa") >>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) >>> unmasker("Abafika eNkandla bafika sebeholwa uMpongo kaZingelwayo.") [ { "sequence": "Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.", "score": 0.050459690392017365, "token": 555, "token_str": "Ġkhona" }, { "sequence": "Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.", "score": 0.03668094798922539, "token": 2321, "token_str": "Ġinkosi" }, { "sequence": "Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.", "score": 0.028774697333574295, "token": 5101, "token_str": "Ġubukhosi" } ] ``` ## Training data 1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings. 2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages). ### BibTeX entry and citation info ```bibtex @inproceedings{author = {Moseli Motsoehli}, title = {Towards transformation of Southern African language models through transformers.}, year={2020} } ```