File size: 12,445 Bytes
2626c53 296a0a9 65a4e18 9b5ca3c 65a4e18 9b5ca3c 65a4e18 9b5ca3c 65a4e18 9b5ca3c 65a4e18 9b5ca3c 65a4e18 9b5ca3c 65a4e18 296a0a9 9b5ca3c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
---
language: th
---
# BERT-th
Adapted from https://github.com/ThAIKeras/bert for HuggingFace/Transformers library
## Pre-tokenization
You must run the original ThaiTokenizer to have your tokenization match that of the original model.
If you skip this step, you will not do much better than
mBERT or random chance!
[Refer to this CoLab notebook](https://colab.research.google.com/drive/1Ax9OsbTPwBBP1pJx1DkYwtgKILcj3Ur5?usp=sharing)
or follow these steps:
```bash
pip install pythainlp six sentencepiece python-crfsuite
git clone https://github.com/ThAIKeras/bert
# download .vocab and .model files from ThAIKeras/bert > Tokenization section
```
Or from [.vocab](https://raw.githubusercontent.com/jitkapat/thaipostagger/master/th.wiki.bpe.op25000.vocab)
and [.model](https://raw.githubusercontent.com/jitkapat/thaipostagger/master/th.wiki.bpe.op25000.model) links.
Then set up ThaiTokenizer class - this is modified slightly to
remove a TensorFlow dependency.
```python
import collections
import unicodedata
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
vocab = collections.OrderedDict()
index = 0
with open(vocab_file, "r") as reader:
while True:
token = reader.readline()
if token.split(): token = token.split()[0] # to support SentencePiece vocab file
token = convert_to_unicode(token)
if not token:
break
token = token.strip()
vocab[token] = index
index += 1
return vocab
#####
from bert.bpe_helper import BPE
import sentencepiece as spm
def convert_by_vocab(vocab, items):
output = []
for item in items:
output.append(vocab[item])
return output
class ThaiTokenizer(object):
"""Tokenizes Thai texts."""
def __init__(self, vocab_file, spm_file):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.bpe = BPE(vocab_file)
self.s = spm.SentencePieceProcessor()
self.s.Load(spm_file)
def tokenize(self, text):
bpe_tokens = self.bpe.encode(text).split(' ')
spm_tokens = self.s.EncodeAsPieces(text)
tokens = bpe_tokens if len(bpe_tokens) < len(spm_tokens) else spm_tokens
split_tokens = []
for token in tokens:
new_token = token
if token.startswith('_') and not token in self.vocab:
split_tokens.append('_')
new_token = token[1:]
if not new_token in self.vocab:
split_tokens.append('<unk>')
else:
split_tokens.append(new_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
```
Then pre-tokenizing your own text:
```python
from pythainlp import sent_tokenize
tokenizer = ThaiTokenizer(vocab_file='th.wiki.bpe.op25000.vocab', spm_file='th.wiki.bpe.op25000.model')
txt = "กรุงเทพมหานครเป็นเขตปกครองพิเศษของประเทศไทย มิได้มีสถานะเป็นจังหวัด คำว่า \"กรุงเทพมหานคร\" นั้นยังใช้เรียกองค์กรปกครองส่วนท้องถิ่นของกรุงเทพมหานครอีกด้วย"
split_sentences = sent_tokenize(txt)
print(split_sentences)
"""
['กรุงเทพมหานครเป็นเขตปกครองพิเศษของประเทศไทย ',
'มิได้มีสถานะเป็นจังหวัด ',
'คำว่า "กรุงเทพมหานคร" นั้นยังใช้เรียกองค์กรปกครองส่วนท้องถิ่นของกรุงเทพมหานครอีกด้วย']
"""
split_words = ' '.join(tokenizer.tokenize(' '.join(split_sentences)))
print(split_words)
"""
'▁กรุงเทพมหานคร เป็นเขต ปกครอง พิเศษ ของประเทศไทย ▁มิ ได้มี สถานะเป็น จังหวัด ▁คําว่า ▁" กรุงเทพมหานคร " ▁นั้น...' # continues
"""
```
Original README follows:
---
Google's [**BERT**](https://github.com/google-research/bert) is currently the state-of-the-art method of pre-training text representations which additionally provides multilingual models. ~~Unfortunately, Thai is the only one in 103 languages that is excluded due to difficulties in word segmentation.~~
BERT-th presents the Thai-only pre-trained model based on the BERT-Base structure. It is now available to download.
* **[`BERT-Base, Thai`](https://drive.google.com/open?id=1J3uuXZr_Se_XIFHj7zlTJ-C9wzI9W_ot)**: BERT-Base architecture, Thai-only model
BERT-th also includes relevant codes and scripts along with the pre-trained model, all of which are the modified versions of those in the original BERT project.
## Preprocessing
### Data Source
Training data for BERT-th come from [the latest article dump of Thai Wikipedia](https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-pages-articles.xml.bz2) on November 2, 2018. The raw texts are extracted by using [WikiExtractor](https://github.com/attardi/wikiextractor).
### Sentence Segmentation
Input data need to be segmented into separate sentences before further processing by BERT modules. Since Thai language has no explicit marker at the end of a sentence, it is quite problematic to pinpoint sentence boundaries. To the best of our knowledge, there is still no implementation of Thai sentence segmentation elsewhere. So, in this project, sentence segmentation is done by applying simple heuristics, considering spaces, sentence length and common conjunctions.
After preprocessing, the training corpus consists of approximately 2 million sentences and 40 million words (counting words after word segmentation by [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)). The plain and segmented texts can be downloaded **[`here`](https://drive.google.com/file/d/1QZSOpikO6Qc02gRmyeb_UiRLtTmUwGz1/view?usp=sharing)**.
## Tokenization
BERT uses [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) as a tokenization mechanism. But it is Google internal, we cannot apply existing Thai word segmentation and then utilize WordPiece to learn the set of subword units. The best alternative is [SentencePiece](https://github.com/google/sentencepiece) which implements [BPE](https://arxiv.org/abs/1508.07909) and needs no word segmentation.
In this project, we adopt a pre-trained Thai SentencePiece model from [BPEmb](https://github.com/bheinzerling/bpemb). The model of 25000 vocabularies is chosen and the vocabulary file has to be augmented with BERT's special characters, including '[PAD]', '[CLS]', '[SEP]' and '[MASK]'. The model and vocabulary files can be downloaded **[`here`](https://drive.google.com/file/d/1F7pCgt3vPlarI9RxKtOZUrC_67KMNQ1W/view?usp=sharing)**.
`SentencePiece` and `bpe_helper.py` from BPEmb are both used to tokenize data. `ThaiTokenizer class` has been added to BERT's `tokenization.py` for tokenizing Thai texts.
## Pre-training
The data can be prepared before pre-training by using this script.
```shell
export BPE_DIR=/path/to/bpe
export TEXT_DIR=/path/to/text
export DATA_DIR=/path/to/data
python create_pretraining_data.py \
--input_file=$TEXT_DIR/thaiwikitext_sentseg \
--output_file=$DATA_DIR/tf_examples.tfrecord \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5 \
--thai_text=True \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
```
Then, the following script can be run to learn a model from scratch.
```shell
export DATA_DIR=/path/to/data
export BERT_BASE_DIR=/path/to/bert_base
python run_pretraining.py \
--input_file=$DATA_DIR/tf_examples.tfrecord \
--output_dir=$BERT_BASE_DIR \
--do_train=True \
--do_eval=True \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=1000000 \
--num_warmup_steps=100000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=200000
```
We have trained the model for 1 million steps. On Tesla K80 GPU, it took around 20 days to complete. Though, we provide a snapshot at 0.8 million steps because it yields better results for downstream classification tasks.
## Downstream Classification Tasks
### XNLI
[XNLI](http://www.nyu.edu/projects/bowman/xnli/) is a dataset for evaluating a cross-lingual inferential classification task. The development and test sets contain 15 languages which data are thoroughly edited. The machine-translated versions of training data are also provided.
The Thai-only pre-trained BERT model can be applied to the XNLI task by using training data which are translated to Thai. Spaces between words in the training data need to be removed to make them consistent with inputs in the pre-training step. The processed files of XNLI related to Thai language can be downloaded **[`here`](https://drive.google.com/file/d/1ZAk1JfR6a0TSCkeyQ-EkRtk1w_mQDWFG/view?usp=sharing)**.
Afterwards, the XNLI task can be learned by using this script.
```shell
export BPE_DIR=/path/to/bpe
export XNLI_DIR=/path/to/xnli
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base
python run_classifier.py \
--task_name=XNLI \
--do_train=true \
--do_eval=true \
--data_dir=$XNLI_DIR \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--output_dir=$OUTPUT_DIR \
--xnli_language=th \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
```
This table compares the Thai-only model with XNLI baselines and the Multilingual Cased model which is also trained by using translated data.
<!-- Use html table because github markdown doesn't support colspan -->
<table>
<tr>
<td colspan="2" align="center"><b>XNLI Baseline</b></td>
<td colspan="2" align="center"><b>BERT</b></td>
</tr>
<tr>
<td align="center">Translate Train</td>
<td align="center">Translate Test</td>
<td align="center">Multilingual Model</td>
<td align="center">Thai-only Model</td>
</tr>
<td align="center">62.8</td>
<td align="center">64.4</td>
<td align="center">66.1</td>
<td align="center"><b>68.9</b></td>
</table>
### Wongnai Review Dataset
Wongnai Review Dataset collects restaurant reviews and ratings from [Wongnai](https://www.wongnai.com/) website. The task is to classify a review into one of five ratings (1 to 5 stars). The dataset can be downloaded **[`here`](https://github.com/wongnai/wongnai-corpus)** and the following script can be run to use the Thai-only model for this task.
```shell
export BPE_DIR=/path/to/bpe
export WONGNAI_DIR=/path/to/wongnai
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base
python run_classifier.py \
--task_name=wongnai \
--do_train=true \
--do_predict=true \
--data_dir=$WONGNAI_DIR \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--output_dir=$OUTPUT_DIR \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
```
Without additional preprocessing and further fine-tuning, the Thai-only BERT model can achieve 0.56612 and 0.57057 for public and private test-set scores respectively. |