yacht commited on
Commit
c281753
1 Parent(s): 283eea5

add model, tokenizer files, and model card

Browse files
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: th
3
+ license: cc-by-sa-4.0
4
+ tags:
5
+ - word segmentation
6
+ datasets:
7
+ - best2010
8
+ - lst20
9
+ - tlc
10
+ - vistec-tp-th-2021
11
+ - wisesight_sentiment
12
+ pipeline_tag: token-classification
13
+ ---
14
+
15
+ # Multi-criteria BERT base Thai with Lattice for Word Segmentation
16
+
17
+ This is a variant of the pre-trained model [BERT](https://github.com/google-research/bert) model.
18
+ The model was pre-trained on texts in the Thai language and fine-tuned for word segmentation based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased).
19
+ This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.
20
+
21
+ The scripts for the pre-training are available at [tchayintr/latte-ptm-ws](https://github.com/tchayintr/latte-ptm-ws).
22
+
23
+ ## Model architecture
24
+
25
+ The model architecture is described in this [paper](https://www.jstage.jst.go.jp/article/jnlp/30/2/30_456/_article/-char/ja).
26
+
27
+ ## Training Data
28
+
29
+ The model is trained on multiple Thai word segmented datasets, including best2010, lst20, tlc (tnhc), vistec-tp-th-2021 (vistec2021) and wisesight_sentiment (ws160).
30
+ The datasets can be accessed as follows:
31
+ - [best2010](https://thailang.nectec.or.th)
32
+ - [lst20](https://huggingface.co/datasets/lst20)
33
+ - [tlc](https://huggingface.co/datasets/tlc)
34
+ - [vistec-tp-th-2021](https://github.com/mrpeerat/OSKut/tree/main/VISTEC-TP-TH-2021)
35
+ - [wisesight_sentiment](https://huggingface.co/datasets/wisesight_sentiment).
36
+
37
+ ## Licenses
38
+
39
+ The pre-trained model is distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
40
+
41
+ ## Acknowledgments
42
+
43
+ This model was trained with GPU servers provided by [Okumura-Funakoshi NLP Group](https://lr-www.pi.titech.ac.jp).
added_tokens.json ADDED
The diff for this file is too large to render. See raw diff
 
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "data/ptm/bert-base-multilingual-cased",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "directionality": "bidi",
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "output_hidden_states": true,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.15.0",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 277964
32
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf4bf78eceb5bcf685db96e138942f82f7b5ee344ea12a5f09b345e73df2b0e1
3
+ size 1198149361
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["[UNC]", "[LEN]", "[LPP]", "[BEST2010]", "[LST20]", "[TNHC]", "[VISTEC-TPTH2021]", "[WS160]"]}
tokenizer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d7c818a62877090642e2ae669cab30a90a0ceaa2b0f42a3c113a128c5c17e917
3
+ size 15059005
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": "data/ptm/bert-base-multilingual-cased/tokenizer.json", "name_or_path": "data/ptm/bert-base-multilingual-cased", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff