Model Description

As part of the ITANONG project's 10 billion-token Tagalog dataset, we have introduced our initial pre-trained language models for Philippine languages. Our model suite encompasses various BERT-based, GPT-based, and Sentence Transformers tailored for Tagalog,Taglish and Cebuano.

Training Details

This model was trained using an Nvidia V100-32GB GPU on DOST-ASTI Computing and Archiving Research Environment (COARE) - https://asti.dost.gov.ph/projects/coare/

Training Data

The training dataset was compiled from both formal and informal sources, consisting of 194,001 instances from formal channels and 1,816,735 from informal sources. More information on pre-processing and training parameters on our paper

Citation

Paper : iTANONG-DS : A Collection of Benchmark Datasets for Downstream Natural Language Processing Tasks on Select Philippine Language

Bibtex:

@inproceedings{visperas-etal-2023-itanong,
    title = "i{TANONG}-{DS} : A Collection of Benchmark Datasets for Downstream Natural Language Processing Tasks on Select {P}hilippine Languages",
    author = "Visperas, Moses L.  and
      Borjal, Christalline Joie  and
      Adoptante, Aunhel John M  and
      Abacial, Danielle Shine R.  and
      Decano, Ma. Miciella  and
      Peramo, Elmer C",
    editor = "Abbas, Mourad  and
      Freihat, Abed Alhakim",
    booktitle = "Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)",
    month = dec,
    year = "2023",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.icnlsp-1.34",
    pages = "316--323",
}
Downloads last month
10
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including dost-asti/RoBERTa-ceb-cased