Galuh
commited on
Commit
•
18f208e
1
Parent(s):
b930817
Add model card
Browse files
README.md
ADDED
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: id
|
3 |
+
widget:
|
4 |
+
- text: "Penelitian ini bertujuan untuk menentukan identitas invertebrata laut dari Perairan Papua dengan teknik DNA barcoding"
|
5 |
+
---
|
6 |
+
|
7 |
+
# Indonesian GPT-2 finetuned on Indonesian academic journals
|
8 |
+
This is the [Indonesian gpt2-small model](https://huggingface.co/flax-community/gpt2-small-indonesian) fine-tuned to abstracts of Indonesian academic journals. All training was done on a TPUv2-8 VM sponsored by [TPU Research Cloud](https://sites.research.google/trc/).
|
9 |
+
|
10 |
+
The demo can be found [here](https://huggingface.co/spaces/flax-community/gpt2-indonesian).
|
11 |
+
|
12 |
+
## How to use
|
13 |
+
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness,
|
14 |
+
we set a seed for reproducibility:
|
15 |
+
```python
|
16 |
+
>>> from transformers import pipeline, set_seed
|
17 |
+
>>> generator = pipeline('text-generation', model='Galuh/id-journal-gpt2')
|
18 |
+
>>> set_seed(42)
|
19 |
+
>>> generator("Sewindu sudah kita tak berjumpa,", max_length=30, num_return_sequences=5)
|
20 |
+
|
21 |
+
[{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mendeteksi perubahan genetik bakteri pada udang windu. Empat tahap telah dilakukan, meliputi preparasi media untuk larva,'},
|
22 |
+
{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk identifikasi gen pengasil flavonoid. Data yang diperoleh dari hasil PCR diidentifikasi dengan teknik sekuensing'},
|
23 |
+
{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mengekstraksi fragmen DNA dari sampel kulit buaya dan tulang anjing, di mana proses ini melibatkan karakterisasi enzim yang'},
|
24 |
+
{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk melakukan transformasi. Tahapan transformasi meliputi seleksi sel dengan urutan (2, 8, 16,..., 18) dan'},
|
25 |
+
{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk amplifikasi genom DNA dengan menggunakan primer TG8226 dan TG806. Metode pol'}]
|
26 |
+
```
|
27 |
+
|
28 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
29 |
+
```python
|
30 |
+
from transformers import GPT2Tokenizer, GPT2Model
|
31 |
+
tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
|
32 |
+
model = GPT2Model.from_pretrained('Galuh/id-journal-gpt2')
|
33 |
+
text = "Ubah dengan teks apa saja."
|
34 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
35 |
+
output = model(**encoded_input)
|
36 |
+
```
|
37 |
+
|
38 |
+
and in TensorFlow:
|
39 |
+
```python
|
40 |
+
from transformers import GPT2Tokenizer, TFGPT2Model
|
41 |
+
tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
|
42 |
+
model = TFGPT2Model.from_pretrained('Galuh/id-journal-gpt2')
|
43 |
+
text = "Ubah dengan teks apa saja."
|
44 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
45 |
+
output = model(encoded_input)
|
46 |
+
```
|
47 |
+
|
48 |
+
## Limitations and bias
|
49 |
+
This model is originally the [Indonesian gpt2-small model](https://huggingface.co/flax-community/gpt2-small-indonesian), thus this model is also subject to the same [limitations and bias as the original model](https://huggingface.co/flax-community/gpt2-small-indonesian#limitations-and-bias). More detailed bias and analysis on this specific model is coming soon.
|
50 |
+
|
51 |
+
## Training data
|
52 |
+
The model was trained on a dataset of Indonesian journals. We only trained this model on the abstracts. We extract the abstract by writing a script to find any text that is located between the word "Abstrak" (abstract) and "Kata kunci" (keywords). The extraction script can be found [here](https://github.com/galuhsahid/id-journal-gpt2/). To separate each abstract, we also add an end of text token (`<|endoftext|>`) between each abstract.
|
53 |
+
|
54 |
+
The information of the sub-dataset and the distribution of the training and evaluation dataset are as follows:
|
55 |
+
|
56 |
+
| split | count | percentage |
|
57 |
+
| ---------- | ---------- | -------------- |
|
58 |
+
| train | 146,248 | 90% |
|
59 |
+
| validation | 16,250 | 10% |
|
60 |
+
|
61 |
+
## Training procedure
|
62 |
+
The model was trained on a TPUv2-8 VM provided by [TPU Research Cloud](https://sites.research.google/trc/). The training duration was `2h 30m 57s`.
|
63 |
+
|
64 |
+
### Evaluation results
|
65 |
+
The model achieves the following results without any fine-tuning (zero-shot):
|
66 |
+
|
67 |
+
| dataset | train loss | eval loss | eval perplexity |
|
68 |
+
| ---------- | ---------- | -------------- | ---------- |
|
69 |
+
| Indonesian journals dataset (abstract only) | 2.913 | 2.855 | 17.37 |
|
70 |
+
|
71 |
+
### Tracking
|
72 |
+
The training process was tracked in [TensorBoard](https://huggingface.co/Galuh/id-journal-gpt2/tensorboard).
|