jcblaise commited on
Commit
a93a67a
1 Parent(s): 40fc64e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -31
README.md CHANGED
@@ -9,21 +9,7 @@ inference: false
9
  ---
10
 
11
  # GPT-2 Tagalog
12
- This is a prototype GPT-2 model of the smallest variant, trained using a combination of WikiText-TL-39 and the NewsPH-Raw datasets. The checkpoint provided can be used for text generation as-is, but should be finetuned for more specific tasks or generation topics.
13
-
14
- ## Usage
15
-
16
- Weights are provided in both PyTorch and TensorFlow and can be used with ease via the HuggingFace Transformers library:
17
-
18
- ```python
19
- from transformers import GPT2Tokenizer, GPT2Model
20
- tokenizer = GPT2Tokenizer.from_pretrained('jcblaise/gpt2-tagalog')
21
- model = GPT2Model.from_pretrained('jcblaise/gpt2-tagalog')
22
-
23
- s = "Palitan ito ng iyong nais na pangungusap."
24
- s_in = tokenizer(s, return_tensors='pt')
25
- out = model(**s_in)
26
- ```
27
 
28
  ## Limitations and Bias
29
  The model was trained with two language modeling datasets for Tagalog:
@@ -37,23 +23,15 @@ As this model is currently a prototype, bias was not thoroughly studied. Models
37
  We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
38
 
39
  ## Citations
40
- This model is part of a much larger work-in-progress, and as such, does not have a citeable paper at the moment. We will update this repository once a paper has been released.
41
-
42
- For the datasets used to train the model, please cite the following papers:
43
 
44
  ```bibtex
45
- @article{cruz2020investigating,
46
- title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation},
47
- author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng},
48
- journal={arXiv preprint arXiv:2010.11574},
49
- year={2020}
50
- }
51
-
52
- @article{cruz2019evaluating,
53
- title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
54
- author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
55
- journal={arXiv preprint arXiv:1907.00409},
56
- year={2019}
57
  }
58
  ```
59
 
@@ -61,4 +39,4 @@ For the datasets used to train the model, please cite the following papers:
61
  Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
62
 
63
  ## Contact
64
- If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at jan_christian_cruz@dlsu.edu.ph
9
  ---
10
 
11
  # GPT-2 Tagalog
12
+ The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Limitations and Bias
15
  The model was trained with two language modeling datasets for Tagalog:
23
  We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
24
 
25
  ## Citations
 
 
 
26
 
27
  ```bibtex
28
+ @inproceedings{localization2020cruz,
29
+ title={{Localization of Fake News Detection via Multitask Transfer Learning}},
30
+ author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
31
+ booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
32
+ pages={2589--2597},
33
+ year={2020},
34
+ url={https://www.aclweb.org/anthology/2020.lrec-1.315}
 
 
 
 
 
35
  }
36
  ```
37
 
39
  Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
40
 
41
  ## Contact
42
+ If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at me@blaisecruz.com