airesearch
/

wangchanberta-base-att-spm-uncased

Inference Endpoints

Model card Files Files and versions Community

lalital commited on Feb 16, 2022

Commit

abe46f3

•

1 Parent(s): 7d568ae

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -57,11 +57,11 @@ You can use the pretrained model for masked language modeling (i.e. predicting a
 -  `thainer`
-    Named-entity recognition tagging with 13 named-entities as descibed in this [page](https://huggingface.co/datasets/thainer).
 -  `lst20` : NER NER and POS tagging
-     Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as descibed in this [page](https://huggingface.co/datasets/lst20).
 <br>
@@ -105,7 +105,7 @@ Regarding the masking procedure, for each sequence, we sampled 15% of the tokens
 **Train/Val/Test splits**
-After preprocessing and deduplication, we have a training set of 381,034,638 unique,mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords astokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
 <br>
 **Pretraining**

 -  `thainer`
+    Named-entity recognition tagging with 13 named-entities as described in this [page](https://huggingface.co/datasets/thainer).
 -  `lst20` : NER NER and POS tagging
+     Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this [page](https://huggingface.co/datasets/lst20).
 <br>
 **Train/Val/Test splits**
+After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords as tokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
 <br>
 **Pretraining**