File size: 4,134 Bytes
c5b8b41
 
 
 
 
11ac6ba
c5b8b41
82c97f5
 
 
 
c5b8b41
 
 
 
 
 
 
 
 
 
 
82c97f5
 
 
 
 
 
c5b8b41
82c97f5
c5b8b41
82c97f5
c5b8b41
 
 
 
82c97f5
 
 
 
 
 
 
c5b8b41
 
82c97f5
c5b8b41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82c97f5
 
 
 
 
 
 
 
 
 
 
 
11ac6ba
 
82c97f5
 
 
6c65e24
1dac12d
82c97f5
 
 
 
 
 
a40fe92
 
 
82c97f5
 
 
8f9e72e
82c97f5
 
8f9e72e
82c97f5
bb7548b
82c97f5
8f9e72e
82c97f5
 
6c65e24
c5b8b41
8f9e72e
c5b8b41
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: mit
tags:
- generated_from_trainer
model-index:
- name: IndoBERT SQuAD
  results: []
widget:
- text: Di daerah mana Ubud berada?
  context: Ubud adalah sebuah desa adat sekaligus menjadi destinasi wisata di daerah kabupaten Gianyar, pulau Bali, Indonesia. Ubud terutama terkenal diantara para wisatawan mancanegara karena terletak di antara sawah dan hutan yang berjurang-jurang yang membuat pemandangan alam sangat indah. Selain itu, Ubud dikenal karena seni dan budaya yang berkembang sangat pesat dan maju.

---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# indobert-squad-trained

This model is a fine-tuned version of [indolem/indobert-base-uncased](https://huggingface.co/indolem/indobert-base-uncased) on the None dataset.
It achieves the following results on the evaluation set:
- Loss: 1.8025

## IndoBERT

[IndoBERT](https://huggingface.co/indolem/indobert-base-uncased) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:
- Indonesian Wikipedia (74M words)
- news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
- an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).

We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT-base).

This IndoBERT was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.


## Training and evaluation data

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

| Dataset  | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k      |
| SQuAD2.0 | eval  | 12.3k     |


## Training procedure
The model was trained on a Tesla T4 GPU and 12GB of RAM.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step  | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|
| 1.4098        | 1.0   | 8202  | 1.3860          |
| 1.1716        | 2.0   | 16404 | 1.8555          |
| 1.2909        | 3.0   | 24606 | 1.8025          |

| Metric | # Value   |
| ------ | --------- |
| **EM** | **52.17** |
| **F1** | **69.22** |


## Pipeline
```py
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="esakrissa/IndoBERT-SQuAD",
    tokenizer="esakrissa/IndoBERT-SQuAD"
)

qa_pipeline({
    'context': """Sudah sejak tahun 1920-an, Ubud terkenal di antara wisatawan barat. Kala itu pelukis Jerman; Walter Spies dan pelukis Belanda; Rudolf Bonnet menetap di sana. Mereka dibantu oleh Tjokorda Gde Agung Sukawati, dari Puri Agung Ubud. Sekarang karya mereka bisa dilihat di Museum Puri Lukisan, Ubud.""",
    'question': "Sejak kapan Ubud terkenal di antara wisatawan barat?"
})
```
*output:*
```py
{
'answer': '1920-an'
'start': 18, 
'end': 25,
'score': 0.8675463795661926, 
}
```

## Github
[Github](https://github.com/esakrissa/question-answering)


## Demo
[IndoBERT SQuAD Demo](https://huggingface.co/spaces/esakrissa/IndoBERT-SQuAD)


### Reference
<a id="1">[1]</a>Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin. 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th COLING.
<a id="2">[2]</a>rifkybujana/IndoBERT-QA


### Framework versions

- Transformers 4.25.1
- Pytorch 1.13.0+cu116
- Datasets 2.7.1
- Tokenizers 0.13.2