Back to all models
fill-mask mask_token: <mask>
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚡️ Upgrade your account to access the Inference API

Share Copied link to clipboard

Monthly model downloads

MoseliMotsoehli/TswanaBert MoseliMotsoehli/TswanaBert
21 downloads
last 30 days

pytorch

tf

Contributed by

MoseliMotsoehli moseli motsoehli
3 models

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert") model = AutoModelForMaskedLM.from_pretrained("MoseliMotsoehli/TswanaBert")

TswanaBert

Pretrained model on the Tswana language using a masked language modeling (MLM) objective.

Model Description.

TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.

Intended uses & limitations

The model can be used for either masked language modeling or next word prediction. It can also be fine-tuned on a specific down-stream NLP application.

How to use

>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead

>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Ntshopotse <mask> e godile.")

[{'score': 0.32749542593955994,
  'sequence': '<s>Ntshopotse setse e godile.</s>',
  'token': 538,
  'token_str': 'Ġsetse'},
 {'score': 0.060260992497205734,
  'sequence': '<s>Ntshopotse le e godile.</s>',
  'token': 270,
  'token_str': 'Ġle'},
 {'score': 0.058460816740989685,
  'sequence': '<s>Ntshopotse bone e godile.</s>',
  'token': 364,
  'token_str': 'Ġbone'},
 {'score': 0.05694682151079178,
  'sequence': '<s>Ntshopotse ga e godile.</s>',
  'token': 298,
  'token_str': 'Ġga'},
 {'score': 0.0565204992890358,
  'sequence': '<s>Ntshopotse, e godile.</s>',
  'token': 16,
  'token_str': ','}]

Limitations and bias

The model is trained on a relatively small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet.

Training data

  1. The largest portion of this dataset (10k) sentences of text, comes from the Leipzig Corpora Collection

  2. I Then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that is generously made available on zenoodo. This added 185 tswana sentences to my corpus.

  3. I went on to add 300 more sentences by scrapping following news sites and blogs that mosty originate in Botswana. I actively continue to expand the dataset.

BibTeX entry and citation info

@inproceedings{author = {Moseli Motsoehli},
  year={2020}
}