Text Classification
Transformers
PyTorch
Safetensors
Tswana
roberta
iptc
Inference Endpoints

PuoBERTa-News: A Setswana Langauge Model Finetuned for News Categorisation

Zenodo doi badge arXiv 🤗 https://huggingface.co/dsfsi/PuoBERTa

Give Feedback 📑: DSFSI Resource Feedback Form{:target="_blank"}

A Roberta-based language model finetuned for News Categorisation.

Based on https://huggingface.co/dsfsi/PuoBERTa

Model Details

Model Description

This is a News Categorisation model for Setswana.

  • Developed by: Vukosi Marivate (@vukosi), Moseli Mots'Oehli (@MoseliMotsoehli) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
  • Model type: RoBERTa Model
  • Language(s) (NLP): Setswana
  • License: CC BY 4.0

News Categories

We use the IPTC news codes https://iptc.org/standards/newscodes/

  1. arts_culture_entertainment_and_media (Botsweretshi, setso, boitapoloso le bobegakgang)
  2. crime_law_and_justice (Bosenyi, molao le bosiamisi)
  3. disaster_accident_and_emergency_incident (Masetlapelo, kotsi le tiragalo ya maemo a tshoganyetso)
  4. economy_business_and_finance (Ikonomi, tsa kgwebo le tsa ditšhelete)
  5. education (Thuto)
  6. environment (Tikologo)
  7. health (Boitekanelo)
  8. politics (Dipolotiki)
  9. religion_and_belief (Bodumedi le tumelo)
  10. society (Setšhaba)

Training, Dev and Validation dataset https://huggingface.co/datasets/dsfsi/daily-news-dikgang.

Model Performance

Performance of models on Daily News Dikgang dataset

Model 5-fold Cross Validation F1 Test F1
Logistic Regression + TFIDF 60.1 56.2
NCHLT TSN RoBERTa 64.7 60.3
PuoBERTa 63.8 62.9
PuoBERTaJW300 66.2 65.4

Usage

Use this model for Part of text classification for Setswana.


Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

Contributing

Your contributions are welcome! Feel free to improve the model.

Model Card Authors

Vukosi Marivate

Model Card Contact

For more details, reach out or check our website.

Email: vukosi.marivate@cs.up.ac.za

Enjoy exploring Setswana through AI!

Downloads last month
21
Safetensors
Model size
83.5M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train dsfsi/PuoBERTa-News

Spaces using dsfsi/PuoBERTa-News 2