airesearch
/

wangchanberta-base-wiki-spm

Inference Endpoints

Model card Files Files and versions Community

lalital commited on Jan 21, 2021

Commit

f012b3e

•

1 Parent(s): 0bdbe15

Add model card

Files changed (1) hide show

README.md +95 -1

README.md CHANGED Viewed

	@@ -1 +1,95 @@
1	- ## wangchanberta-base-wiki-spm

+# `wangchanberta-base-wiki-spm`
+Pretrained RoBERTa BASE model on Thai Wikipedia corpus.
+<br>
+## Model description
+<br>
+The architecture of the pretrained model is based on RoBERTa [[Liu et al., 2019]](https://arxiv.org/abs/1907.11692).
+<br>
+## Intended uses & limitations
+<br>
+You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task.
+<br>
+**Multiclass text classification**
+-  `wisesight_sentiment`
+   4-class text classification task (`positive`, `neutral`, `negative`, and `question`) based on social media posts and tweets.
+-  `wongnai_reivews`
+     Users' review rating classification task (scale is ranging from 1 to 5)
+-  `generated_reviews_enth` : (`review_star` as label)
+     Generated users' review rating classification task (scale is ranging from 1 to 5).
+**Multilabel text classification**
+-  `prachathai67k`
+    Thai topic classification with 12 labels based on news article corpus from prachathai.com. The detail is described in this [page](https://huggingface.co/datasets/prachathai67k).
+**Token classification**
+-  `thainer`
+    Named-entity recognition tagging with 13 named-entities as descibed in this [page](https://huggingface.co/datasets/thainer).
+-  `lst20` : NER NER and POS tagging
+     Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as descibed in this [page](https://huggingface.co/datasets/lst20).
+<br>
+## How to use
+<br>
+<br>
+## Training data
+`wangchanberta-base-wiki-spm` model was pretrained on Thai Wikipedia. Specifically we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables.
+### Preprocessing
+Texts are preprocessed with the following rules:
+- Replace non-breaking space, zero-width non-breaking space, and soft hyphen with spaces.
+- Remove an empty parenthesis that occur right after the title of the first paragraph.
+- Replace spaces wtth <_>.
+<br>
+Regarding the vocabulary, we use subword token trained with [SentencePice](https://github.com/google/sentencepiece) library on the training set of Thai Wikipedia corpus. The total number of subword tokens is 24,000.
+We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [[Liu et al., 2019]](https://arxiv.org/abs/1907.11692) (called "FULL-SENTENCES").
+The details of the masking procedure for each sentence are the following:
+**Train/Val/Test splits**
+We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set.
+<br>