jcblaise
/

gpt2-tagalog

@@ -9,21 +9,7 @@ inference: false
 ---
 # GPT-2 Tagalog
-This is a prototype GPT-2 model of the smallest variant, trained using a combination of WikiText-TL-39 and the NewsPH-Raw datasets. The checkpoint provided can be used for text generation as-is, but should be finetuned for more specific tasks or generation topics.
-## Usage
-Weights are provided in both PyTorch and TensorFlow and can be used with ease via the HuggingFace Transformers library:
-```python
-from transformers import GPT2Tokenizer, GPT2Model
-tokenizer = GPT2Tokenizer.from_pretrained('jcblaise/gpt2-tagalog')
-model = GPT2Model.from_pretrained('jcblaise/gpt2-tagalog')
-s = "Palitan ito ng iyong nais na pangungusap."
-s_in = tokenizer(s, return_tensors='pt')
-out = model(**s_in)
-```
 ## Limitations and Bias
 The model was trained with two language modeling datasets for Tagalog:
@@ -37,23 +23,15 @@ As this model is currently a prototype, bias was not thoroughly studied. Models
 We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
 ## Citations
-This model is part of a much larger work-in-progress, and as such, does not have a citeable paper at the moment. We will update this repository once a paper has been released.
-For the datasets used to train the model, please cite the following papers:
 ```bibtex
-@article{cruz2020investigating,
-  title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
-  author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
-  journal={arXiv preprint arXiv:2010.11574},
-  year={2020}
-}
-@article{cruz2019evaluating,
-  title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
-  author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
-  journal={arXiv preprint arXiv:1907.00409},
-  year={2019}
 }
 ```
@@ -61,4 +39,4 @@ For the datasets used to train the model, please cite the following papers:
 Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
 ## Contact
-If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph

 ---
 # GPT-2 Tagalog
+The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39.
 ## Limitations and Bias
 The model was trained with two language modeling datasets for Tagalog:
 We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
 ## Citations
 ```bibtex
+@inproceedings{localization2020cruz,
+  title={{Localization of Fake News Detection via Multitask Transfer Learning}},
+  author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
+  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
+  pages={2589--2597},
+  year={2020},
+  url={https://www.aclweb.org/anthology/2020.lrec-1.315}
 }
 ```
 Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
 ## Contact
+If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at me@blaisecruz.com