w11wo commited on
Commit
d15f7ea
1 Parent(s): 3966110

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: lo
3
+ tags:
4
+ - lao-roberta-base-pos-tagger
5
+ license: mit
6
+ widget:
7
+ - text: "ຮ້ອງ ມ່ວນ ແທ້ ສຽງດີ ອິຫຼີ"
8
+ ---
9
+
10
+ ## Lao RoBERTa Base POS Tagger
11
+
12
+ Lao RoBERTa Base POS Tagger is a part-of-speech token-classification model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. The model was originally the pre-trained [Lao RoBERTa Base](https://huggingface.co/w11wo/lao-roberta-base) model, which is then fine-tuned on the [`Yunshan Cup 2020`](https://github.com/GKLMIP/Yunshan-Cup-2020) dataset consisting of tag-labelled Lao corpus.
13
+
14
+ After training, the model achieved an evaluation accuracy of 83.14%. On the benchmark test set, the model achieved an accuracy of 83.30%.
15
+
16
+ Hugging Face's `Trainer` class from the [Transformers](https://huggingface.co/transformers) library was used to train the model. PyTorch was used as the backend framework during training, but the model remains compatible with other frameworks nonetheless.
17
+
18
+ ## Model
19
+
20
+ | Model | #params | Arch. | Training/Validation data (text) |
21
+ | ----------------------------- | ------- | ------------ | ------------------------------- |
22
+ | `lao-roberta-base-pos-tagger` | 124M | RoBERTa Base | `Yunshan Cup 2020` |
23
+
24
+ ## Evaluation Results
25
+
26
+ The model was trained for 15 epochs, with a batch size of 8, a learning rate of 5e-5, with cosine annealing to 0. The best model was loaded at the end.
27
+
28
+ | Epoch | Training Loss | Validation Loss | Accuracy |
29
+ | ----- | ------------- | --------------- | -------- |
30
+ | 1 | 1.026100 | 0.733780 | 0.746021 |
31
+ | 2 | 0.646900 | 0.659625 | 0.775688 |
32
+ | 3 | 0.500400 | 0.576214 | 0.798523 |
33
+ | 4 | 0.385400 | 0.606503 | 0.805269 |
34
+ | 5 | 0.288000 | 0.652493 | 0.809092 |
35
+ | 6 | 0.204600 | 0.671678 | 0.815216 |
36
+ | 7 | 0.145200 | 0.704693 | 0.818209 |
37
+ | 8 | 0.098700 | 0.830561 | 0.816998 |
38
+ | 9 | 0.066100 | 0.883329 | 0.825232 |
39
+ | 10 | 0.043900 | 0.933347 | 0.825664 |
40
+ | 11 | 0.027200 | 0.992055 | 0.828449 |
41
+ | 12 | 0.017300 | 1.054874 | 0.830819 |
42
+ | 13 | 0.011500 | 1.081638 | 0.830940 |
43
+ | 14 | 0.008500 | 1.094252 | 0.831304 |
44
+ | 15 | 0.007400 | 1.097428 | 0.831442 |
45
+
46
+ ## How to Use
47
+
48
+ ### As Token Classifier
49
+
50
+ ```python
51
+ from transformers import pipeline
52
+
53
+ pretrained_name = "w11wo/lao-roberta-base-pos-tagger"
54
+
55
+ nlp = pipeline(
56
+ "token-classification",
57
+ model=pretrained_name,
58
+ tokenizer=pretrained_name
59
+ )
60
+
61
+ nlp("ຮ້ອງ ມ່ວນ ແທ້ ສຽງດີ ອິຫຼີ")
62
+ ```
63
+
64
+ ## Disclaimer
65
+
66
+ Do consider the biases which come from both the pre-trained RoBERTa model and the `Yunshan Cup 2020` dataset that may be carried over into the results of this model.
67
+
68
+ ## Author
69
+
70
+ Lao RoBERTa Base POS Tagger was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory using their free GPU access.