lalital commited on
Commit
abe46f3
1 Parent(s): 7d568ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -57,11 +57,11 @@ You can use the pretrained model for masked language modeling (i.e. predicting a
57
 
58
  - `thainer`
59
 
60
- Named-entity recognition tagging with 13 named-entities as descibed in this [page](https://huggingface.co/datasets/thainer).
61
 
62
  - `lst20` : NER NER and POS tagging
63
 
64
- Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as descibed in this [page](https://huggingface.co/datasets/lst20).
65
 
66
  <br>
67
 
@@ -105,7 +105,7 @@ Regarding the masking procedure, for each sequence, we sampled 15% of the tokens
105
 
106
  **Train/Val/Test splits**
107
 
108
- After preprocessing and deduplication, we have a training set of 381,034,638 unique,mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords astokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
109
  <br>
110
 
111
  **Pretraining**
 
57
 
58
  - `thainer`
59
 
60
+ Named-entity recognition tagging with 13 named-entities as described in this [page](https://huggingface.co/datasets/thainer).
61
 
62
  - `lst20` : NER NER and POS tagging
63
 
64
+ Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this [page](https://huggingface.co/datasets/lst20).
65
 
66
  <br>
67
 
 
105
 
106
  **Train/Val/Test splits**
107
 
108
+ After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords as tokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
109
  <br>
110
 
111
  **Pretraining**