yacht's picture
update latte url in model card
7e76d40
metadata
language: th
license: cc-by-sa-4.0
tags:
  - word segmentation
datasets:
  - best2010
  - lst20
  - tlc
  - vistec-tp-th-2021
  - wisesight_sentiment
pipeline_tag: token-classification

Multi-criteria BERT base Thai with Lattice for Word Segmentation

This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Thai language and fine-tuned for word segmentation based on bert-base-multilingual-cased. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.

The scripts for the pre-training are available at tchayintr/latte-ptm-ws.

The LATTE scripts are available at tchayintr/latte-ws.

Model architecture

The model architecture is described in this paper.

Training Data

The model is trained on multiple Thai word segmented datasets, including best2010, lst20, tlc (tnhc), vistec-tp-th-2021 (vistec2021) and wisesight_sentiment (ws160). The datasets can be accessed as follows:

Licenses

The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.