Update README.md
Browse files
README.md
CHANGED
@@ -9,21 +9,7 @@ inference: false
|
|
9 |
---
|
10 |
|
11 |
# GPT-2 Tagalog
|
12 |
-
|
13 |
-
|
14 |
-
## Usage
|
15 |
-
|
16 |
-
Weights are provided in both PyTorch and TensorFlow and can be used with ease via the HuggingFace Transformers library:
|
17 |
-
|
18 |
-
```python
|
19 |
-
from transformers import GPT2Tokenizer, GPT2Model
|
20 |
-
tokenizer = GPT2Tokenizer.from_pretrained('jcblaise/gpt2-tagalog')
|
21 |
-
model = GPT2Model.from_pretrained('jcblaise/gpt2-tagalog')
|
22 |
-
|
23 |
-
s = "Palitan ito ng iyong nais na pangungusap."
|
24 |
-
s_in = tokenizer(s, return_tensors='pt')
|
25 |
-
out = model(**s_in)
|
26 |
-
```
|
27 |
|
28 |
## Limitations and Bias
|
29 |
The model was trained with two language modeling datasets for Tagalog:
|
@@ -37,23 +23,15 @@ As this model is currently a prototype, bias was not thoroughly studied. Models
|
|
37 |
We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
|
38 |
|
39 |
## Citations
|
40 |
-
This model is part of a much larger work-in-progress, and as such, does not have a citeable paper at the moment. We will update this repository once a paper has been released.
|
41 |
-
|
42 |
-
For the datasets used to train the model, please cite the following papers:
|
43 |
|
44 |
```bibtex
|
45 |
-
@
|
46 |
-
title={
|
47 |
-
author={Jan Christian Blaise
|
48 |
-
|
49 |
-
|
50 |
-
}
|
51 |
-
|
52 |
-
@article{cruz2019evaluating,
|
53 |
-
title={Evaluating Language Model Finetuning Techniques for Low-resource Languages},
|
54 |
-
author={Cruz, Jan Christian Blaise and Cheng, Charibeth},
|
55 |
-
journal={arXiv preprint arXiv:1907.00409},
|
56 |
-
year={2019}
|
57 |
}
|
58 |
```
|
59 |
|
@@ -61,4 +39,4 @@ For the datasets used to train the model, please cite the following papers:
|
|
61 |
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
|
62 |
|
63 |
## Contact
|
64 |
-
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at
|
9 |
---
|
10 |
|
11 |
# GPT-2 Tagalog
|
12 |
+
The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## Limitations and Bias
|
15 |
The model was trained with two language modeling datasets for Tagalog:
|
23 |
We release this model with the intent that it may aid in the advancement of Filipino NLP, and that researchers and engineers who are interested in applying their work to the language may have a baseline model to use. For future work, in addition to the study of inherent bias, we mainly look into improving the quality of our models. As this is a prototype, a large-scale corpora was not used to train it. We plan to train larger GPT-2 models with larger corpora in the future.
|
24 |
|
25 |
## Citations
|
|
|
|
|
|
|
26 |
|
27 |
```bibtex
|
28 |
+
@inproceedings{localization2020cruz,
|
29 |
+
title={{Localization of Fake News Detection via Multitask Transfer Learning}},
|
30 |
+
author={Cruz, Jan Christian Blaise and Tan, Julianne Agatha and Cheng, Charibeth},
|
31 |
+
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
|
32 |
+
pages={2589--2597},
|
33 |
+
year={2020},
|
34 |
+
url={https://www.aclweb.org/anthology/2020.lrec-1.315}
|
|
|
|
|
|
|
|
|
|
|
35 |
}
|
36 |
```
|
37 |
|
39 |
Data used to train this model as well as other benchmark datasets in Filipino can be found in my website at https://blaisecruz.com
|
40 |
|
41 |
## Contact
|
42 |
+
If you have questions, concerns, or if you just want to chat about NLP and low-resource languages in general, you may reach me through my work email at me@blaisecruz.com
|