File size: 4,380 Bytes
8b3b9f2
 
 
 
 
 
 
 
 
32e28c0
8b3b9f2
 
996593f
0a5bac4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dba919
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- id
tags:
- Twitter
license: apache-2.0
datasets:
- Twitter 2021
widget:
- text: "guweehh udh ga' paham lg sm [MASK]"
---

# IndoBERTweet 🐦 

## 1. Paper
Fajri Koto, Jey Han Lau, and Timothy Baldwin. [_IndoBERTweet: A Pretrained Language Model for Indonesian Twitter
with Effective Domain-Specific Vocabulary Initialization_](https://arxiv.org/pdf/2109.04607.pdf). 
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (**EMNLP 2021**), Dominican Republic (virtual).

## 2. About

[IndoBERTweet](https://github.com/indolem/IndoBERTweet) is the first large-scale pretrained model for Indonesian Twitter
that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.

In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.

## 3. Pretraining Data

We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of **409M word tokens**, two times larger than the training data used to pretrain [IndoBERT](https://aclanthology.org/2020.coling-main.66.pdf). Due to Twitter policy, this pretraining data will not be released to public.

## 4. How to use

Load model and tokenizer (tested with transformers==3.5.1)
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
```
**Preprocessing Steps:**
* lower-case all words 
* converting user mentions and URLs into @USER and HTTPURL, respectively
* translating emoticons into text using the [emoji package](https://pypi.org/project/emoji/).

## 5. Results over 7 Indonesian Twitter Datasets

<table>
  <col>
  <colgroup span="2"></colgroup>
  <colgroup span="2"></colgroup>
  <tr>
    <th rowspan="2">Models</td>
    <th colspan="2" scope="colgroup">Sentiment</th>
    <th colspan="1" scope="colgroup">Emotion</th>
    <th colspan="2" scope="colgroup">Hate Speech</th>
    <th colspan="2" scope="colgroup">NER</th>
    <th rowspan="2" scope="colgroup">Average</th>
  </tr>
  <tr>
    <th scope="col">IndoLEM</th>
    <th scope="col">SmSA</th>
    <th scope="col">EmoT</th>
    <th scope="col">HS1</th>
    <th scope="col">HS2</th>
    <th scope="col">Formal</th>
    <th scope="col">Informal</th>
  </tr>
  <tr>
    <td scope="row">mBERT</td>
    <td>76.6</td>
    <td>84.7</td>
    <td>67.5</td>
    <td>85.1</td>
    <td>75.1</td>
    <td>85.2</td>
    <td>83.2</td>
    <td>79.6</td>
  </tr>
  <tr>
    <td scope="row">malayBERT</td>
    <td>82.0</td>
    <td>84.1</td>
    <td>74.2</td>
    <td>85.0</td>
    <td>81.9</td>
    <td>81.9</td>
    <td>81.3</td>
    <td>81.5</td>
  </tr>
  <tr>
    <td scope="row">IndoBERT (Willie, et al., 2020)</td>
    <td>84.1</td>
    <td>88.7</td>
    <td>73.3</td>
    <td>86.8</td>
    <td>80.4</td>
    <td>86.3</td>
    <td>84.3</td>
    <td>83.4</td>
  </tr>
  <tr>
    <td scope="row">IndoBERT (Koto, et al., 2020)</td>
    <td>84.1</td>
    <td>87.9</td>
    <td>71.0</td>
    <td>86.4</td>
    <td>79.3</td>
    <td>88.0</td>
  <td><b>86.9</b></td>
    <td>83.4</td>
  </tr>
  <tr>
    <td scope="row">IndoBERTweet (1M steps from scratch)</td>
    <td>86.2</td>
    <td>90.4</td>
    <td>76.0</td>
  <td><b>88.8</b></td>
  <td><b>87.5</b></td>
  <td><b>88.1</b></td>
    <td>85.4</td>
    <td>86.1</td>
  </tr>
  <tr>
    <td scope="row">IndoBERT + Voc adaptation + 200k steps</td>
  <td><b>86.6</b></td>
  <td><b>92.7</b></td>
  <td><b>79.0</b></td>
    <td>88.4</td>
    <td>84.0</td>
    <td>87.7</td>
  <td><b>86.9</b></td>
  <td><b>86.5</b></td>
  </tr>
</table>

## Citation
If you use our work, please cite:

```bibtex
@inproceedings{koto2021indobertweet,
  title={IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization},
  author={Fajri Koto and Jey Han Lau and Timothy Baldwin},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)},
  year={2021}
}
```