GPT-2 Tagalog

This is a prototype GPT-2 model of the smallest variant, trained using a combination of WikiText-TL-39 and the NewsPH-Raw datasets. The checkpoint provided can be used for text generation as-is, but should be finetuned for more specific tasks or generation topics.

Usage

Weights are provided in both PyTorch and TensorFlow and can be used with ease via the HuggingFace Transformers library:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('jcblaise/gpt2-tagalog')
model = GPT2Model.from_pretrained('jcblaise/gpt2-tagalog')

s = "Palitan ito ng iyong nais na pangungusap."
s_in = tokenizer(s, return_tensors='pt')
out = model(**s_in)

Limitations and Bias

The model was trained with two language modeling datasets for Tagalog:

  • WikiText-TL-39, which is sourced from a dump of Tagalog WikiPedia.
  • NewsPH, which is a dump of news articles from all available mainstream news outlets in the Philippines.

Due to the source of the training data, generated sentences out-of-the-box may sound and read like actual news articles, possessing the common tone and style of these works. While these may look like news articles, these are not news articles, and should not be read, understood, published, or shared as one. Language models do not inherently distinguish factual statements from non-factual ones, and as such, we discourage use of the model in systems and use-cases where the generated output is required to be true.

As this model is currently a prototype, bias was not thoroughly studied. Models inherit biases that are present in the data that they are trained with. Thing such as frequency of association of gender to occupation can induce certain biases in the model that will remain undetected unless thoroughly tested. As with the original GPT-2 model, we recommend that this model not be deployed or used in systems that interact with humans unless thorough study of potential biases is carried out.

We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.

Citations

This model is part of a much larger work-in-progress, and as such, does not have a citeable paper at the moment. We will update this repository once a paper has been released.

For the datasets used to train the model, please cite the following papers:

@article{cruz2020investigating,
  title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
  author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
  journal={arXiv preprint arXiv:2010.11574},
  year={2020}
}

@article{cruz2019evaluating,
  title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
  author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
  journal={arXiv preprint arXiv:1907.00409},
  year={2019}
}

Data and Other Resources

Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com

Contact

If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph

Downloads last month
0
Hosted inference API

Inference API has been turned off for this model.