Galuh commited on
Commit
18f208e
1 Parent(s): b930817

Add model card

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: id
3
+ widget:
4
+ - text: "Penelitian ini bertujuan untuk menentukan identitas invertebrata laut dari Perairan Papua dengan teknik DNA barcoding"
5
+ ---
6
+
7
+ # Indonesian GPT-2 finetuned on Indonesian academic journals
8
+ This is the [Indonesian gpt2-small model](https://huggingface.co/flax-community/gpt2-small-indonesian) fine-tuned to abstracts of Indonesian academic journals. All training was done on a TPUv2-8 VM sponsored by [TPU Research Cloud](https://sites.research.google/trc/).
9
+
10
+ The demo can be found [here](https://huggingface.co/spaces/flax-community/gpt2-indonesian).
11
+
12
+ ## How to use
13
+ You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness,
14
+ we set a seed for reproducibility:
15
+ ```python
16
+ >>> from transformers import pipeline, set_seed
17
+ >>> generator = pipeline('text-generation', model='Galuh/id-journal-gpt2')
18
+ >>> set_seed(42)
19
+ >>> generator("Sewindu sudah kita tak berjumpa,", max_length=30, num_return_sequences=5)
20
+
21
+ [{'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mendeteksi perubahan genetik bakteri pada udang windu. Empat tahap telah dilakukan, meliputi preparasi media untuk larva,'},
22
+ {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk identifikasi gen pengasil flavonoid. Data yang diperoleh dari hasil PCR diidentifikasi dengan teknik sekuensing'},
23
+ {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk mengekstraksi fragmen DNA dari sampel kulit buaya dan tulang anjing, di mana proses ini melibatkan karakterisasi enzim yang'},
24
+ {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk melakukan transformasi. Tahapan transformasi meliputi seleksi sel dengan urutan (2, 8, 16,..., 18) dan'},
25
+ {'generated_text': 'Penelitian ini menggunakan teknik DNA barcoding untuk amplifikasi genom DNA dengan menggunakan primer TG8226 dan TG806. Metode pol'}]
26
+ ```
27
+
28
+ Here is how to use this model to get the features of a given text in PyTorch:
29
+ ```python
30
+ from transformers import GPT2Tokenizer, GPT2Model
31
+ tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
32
+ model = GPT2Model.from_pretrained('Galuh/id-journal-gpt2')
33
+ text = "Ubah dengan teks apa saja."
34
+ encoded_input = tokenizer(text, return_tensors='pt')
35
+ output = model(**encoded_input)
36
+ ```
37
+
38
+ and in TensorFlow:
39
+ ```python
40
+ from transformers import GPT2Tokenizer, TFGPT2Model
41
+ tokenizer = GPT2Tokenizer.from_pretrained('Galuh/id-journal-gpt2')
42
+ model = TFGPT2Model.from_pretrained('Galuh/id-journal-gpt2')
43
+ text = "Ubah dengan teks apa saja."
44
+ encoded_input = tokenizer(text, return_tensors='tf')
45
+ output = model(encoded_input)
46
+ ```
47
+
48
+ ## Limitations and bias
49
+ This model is originally the [Indonesian gpt2-small model](https://huggingface.co/flax-community/gpt2-small-indonesian), thus this model is also subject to the same [limitations and bias as the original model](https://huggingface.co/flax-community/gpt2-small-indonesian#limitations-and-bias). More detailed bias and analysis on this specific model is coming soon.
50
+
51
+ ## Training data
52
+ The model was trained on a dataset of Indonesian journals. We only trained this model on the abstracts. We extract the abstract by writing a script to find any text that is located between the word "Abstrak" (abstract) and "Kata kunci" (keywords). The extraction script can be found [here](https://github.com/galuhsahid/id-journal-gpt2/). To separate each abstract, we also add an end of text token (`<|endoftext|>`) between each abstract.
53
+
54
+ The information of the sub-dataset and the distribution of the training and evaluation dataset are as follows:
55
+
56
+ | split | count | percentage |
57
+ | ---------- | ---------- | -------------- |
58
+ | train | 146,248 | 90% |
59
+ | validation | 16,250 | 10% |
60
+
61
+ ## Training procedure
62
+ The model was trained on a TPUv2-8 VM provided by [TPU Research Cloud](https://sites.research.google/trc/). The training duration was `2h 30m 57s`.
63
+
64
+ ### Evaluation results
65
+ The model achieves the following results without any fine-tuning (zero-shot):
66
+
67
+ | dataset | train loss | eval loss | eval perplexity |
68
+ | ---------- | ---------- | -------------- | ---------- |
69
+ | Indonesian journals dataset (abstract only) | 2.913 | 2.855 | 17.37 |
70
+
71
+ ### Tracking
72
+ The training process was tracked in [TensorBoard](https://huggingface.co/Galuh/id-journal-gpt2/tensorboard).