MoseliMotsoehli
/

zuBERTa

Inference Endpoints

Model card Files Files and versions Community

zuBERTa / README.md

julien-c's picture

julien-c HF staff

Migrate model card from transformers-repo

6788a44 over 3 years ago

|

raw history blame

No virus

1.78 kB

	---
	language: zu
	---

	# zuBERTa
	zuBERTa is a RoBERTa style transformer language model trained on zulu text.

	## Intended uses & limitations
	The model can be used for getting embeddings to use on a down-stream task such as question answering.

	#### How to use

	```python
	>>> from transformers import pipeline
	>>> from transformers import AutoTokenizer, AutoModelWithLMHead

	>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/zuBERTa")
	>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/zuBERTa")
	>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
	>>> unmasker("Abafika eNkandla bafika sebeholwa <mask> uMpongo kaZingelwayo.")

	[
	{
	"sequence": "<s>Abafika eNkandla bafika sebeholwa khona uMpongo kaZingelwayo.</s>",
	"score": 0.050459690392017365,
	"token": 555,
	"token_str": "Ġkhona"
	},
	{
	"sequence": "<s>Abafika eNkandla bafika sebeholwa inkosi uMpongo kaZingelwayo.</s>",
	"score": 0.03668094798922539,
	"token": 2321,
	"token_str": "Ġinkosi"
	},
	{
	"sequence": "<s>Abafika eNkandla bafika sebeholwa ubukhosi uMpongo kaZingelwayo.</s>",
	"score": 0.028774697333574295,
	"token": 5101,
	"token_str": "Ġubukhosi"
	}
	]
	```

	## Training data

	1. 30k sentences of text, came from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download) of zulu 2018. These were collected from news articles and creative writtings.
	2. ~7500 articles of human generated translations were scraped from the zulu [wikipedia](https://zu.wikipedia.org/wiki/Special:AllPages).

	### BibTeX entry and citation info

	```bibtex
	@inproceedings{author = {Moseli Motsoehli},
	title = {Towards transformation of Southern African language models through transformers.},
	year={2020}
	}
	```