w11wo commited on
Commit
42e5235
1 Parent(s): ca0d6a6

End of training

Browse files
README.md CHANGED
@@ -1,67 +1,99 @@
1
  ---
2
- language: id
3
- tags:
4
- - indonesian-roberta-base-posp-tagger
5
  license: mit
 
 
 
6
  datasets:
7
- - indonlu
8
- widget:
9
- - text: "Budi sedang pergi ke pasar."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- ## Indonesian RoBERTa Base POSP Tagger
13
-
14
- Indonesian RoBERTa Base POSP Tagger is a part-of-speech token-classification model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. The model was originally the pre-trained [Indonesian RoBERTa Base](https://hf.co/flax-community/indonesian-roberta-base) model, which is then fine-tuned on [`indonlu`](https://hf.co/datasets/indonlu)'s `POSP` dataset consisting of tag-labelled news.
15
-
16
- After training, the model achieved an evaluation F1-macro of 95.34%. On the benchmark test set, the model achieved an accuracy of 93.99% and F1-macro of 88.93%.
17
 
18
- Hugging Face's `Trainer` class from the [Transformers](https://huggingface.co/transformers) library was used to train the model. PyTorch was used as the backend framework during training, but the model remains compatible with other frameworks nonetheless.
19
 
20
- ## Model
 
 
 
 
 
 
21
 
22
- | Model | #params | Arch. | Training/Validation data (text) |
23
- | ------------------------------------- | ------- | ------------ | ------------------------------- |
24
- | `indonesian-roberta-base-posp-tagger` | 124M | RoBERTa Base | `POSP` |
25
 
26
- ## Evaluation Results
27
 
28
- The model was trained for 10 epochs and the best model was loaded at the end.
29
 
30
- | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
31
- | ----- | ------------- | --------------- | --------- | -------- | -------- | -------- |
32
- | 1 | 0.898400 | 0.343731 | 0.894324 | 0.894324 | 0.894324 | 0.894324 |
33
- | 2 | 0.294700 | 0.236619 | 0.929620 | 0.929620 | 0.929620 | 0.929620 |
34
- | 3 | 0.214100 | 0.202723 | 0.938349 | 0.938349 | 0.938349 | 0.938349 |
35
- | 4 | 0.171100 | 0.183630 | 0.945264 | 0.945264 | 0.945264 | 0.945264 |
36
- | 5 | 0.143300 | 0.169744 | 0.948469 | 0.948469 | 0.948469 | 0.948469 |
37
- | 6 | 0.124700 | 0.174946 | 0.947963 | 0.947963 | 0.947963 | 0.947963 |
38
- | 7 | 0.109800 | 0.167450 | 0.951590 | 0.951590 | 0.951590 | 0.951590 |
39
- | 8 | 0.101300 | 0.163191 | 0.952475 | 0.952475 | 0.952475 | 0.952475 |
40
- | 9 | 0.093500 | 0.163255 | 0.953361 | 0.953361 | 0.953361 | 0.953361 |
41
- | 10 | 0.089000 | 0.164673 | 0.953445 | 0.953445 | 0.953445 | 0.953445 |
42
 
43
- ## How to Use
44
 
45
- ### As Token Classifier
46
 
47
- ```python
48
- from transformers import pipeline
49
 
50
- pretrained_name = "w11wo/indonesian-roberta-base-posp-tagger"
51
 
52
- nlp = pipeline(
53
- "token-classification",
54
- model=pretrained_name,
55
- tokenizer=pretrained_name
56
- )
 
 
 
57
 
58
- nlp("Budi sedang pergi ke pasar.")
59
- ```
60
 
61
- ## Disclaimer
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- Do consider the biases which come from both the pre-trained RoBERTa model and the `POSP` dataset that may be carried over into the results of this model.
64
 
65
- ## Author
66
 
67
- Indonesian RoBERTa Base POSP Tagger was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory using their free GPU access.
 
 
 
1
  ---
 
 
 
2
  license: mit
3
+ base_model: flax-community/indonesian-roberta-base
4
+ tags:
5
+ - generated_from_trainer
6
  datasets:
7
+ - indonlu
8
+ metrics:
9
+ - precision
10
+ - recall
11
+ - f1
12
+ - accuracy
13
+ model-index:
14
+ - name: indonesian-roberta-base-posp-tagger
15
+ results:
16
+ - task:
17
+ name: Token Classification
18
+ type: token-classification
19
+ dataset:
20
+ name: indonlu
21
+ type: indonlu
22
+ config: posp
23
+ split: validation
24
+ args: posp
25
+ metrics:
26
+ - name: Precision
27
+ type: precision
28
+ value: 0.9625100240577386
29
+ - name: Recall
30
+ type: recall
31
+ value: 0.9625100240577386
32
+ - name: F1
33
+ type: f1
34
+ value: 0.9625100240577386
35
+ - name: Accuracy
36
+ type: accuracy
37
+ value: 0.9625100240577386
38
  ---
39
 
40
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
41
+ should probably proofread and complete it, then remove this comment. -->
 
 
 
42
 
43
+ # indonesian-roberta-base-posp-tagger
44
 
45
+ This model is a fine-tuned version of [flax-community/indonesian-roberta-base](https://huggingface.co/flax-community/indonesian-roberta-base) on the indonlu dataset.
46
+ It achieves the following results on the evaluation set:
47
+ - Loss: 0.1395
48
+ - Precision: 0.9625
49
+ - Recall: 0.9625
50
+ - F1: 0.9625
51
+ - Accuracy: 0.9625
52
 
53
+ ## Model description
 
 
54
 
55
+ More information needed
56
 
57
+ ## Intended uses & limitations
58
 
59
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ ## Training and evaluation data
62
 
63
+ More information needed
64
 
65
+ ## Training procedure
 
66
 
67
+ ### Training hyperparameters
68
 
69
+ The following hyperparameters were used during training:
70
+ - learning_rate: 2e-05
71
+ - train_batch_size: 16
72
+ - eval_batch_size: 16
73
+ - seed: 42
74
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
75
+ - lr_scheduler_type: linear
76
+ - num_epochs: 10
77
 
78
+ ### Training results
 
79
 
80
+ | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
81
+ |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
82
+ | No log | 1.0 | 420 | 0.2254 | 0.9313 | 0.9313 | 0.9313 | 0.9313 |
83
+ | 0.4398 | 2.0 | 840 | 0.1617 | 0.9499 | 0.9499 | 0.9499 | 0.9499 |
84
+ | 0.1566 | 3.0 | 1260 | 0.1431 | 0.9569 | 0.9569 | 0.9569 | 0.9569 |
85
+ | 0.103 | 4.0 | 1680 | 0.1412 | 0.9605 | 0.9605 | 0.9605 | 0.9605 |
86
+ | 0.0723 | 5.0 | 2100 | 0.1408 | 0.9635 | 0.9635 | 0.9635 | 0.9635 |
87
+ | 0.051 | 6.0 | 2520 | 0.1408 | 0.9642 | 0.9642 | 0.9642 | 0.9642 |
88
+ | 0.051 | 7.0 | 2940 | 0.1510 | 0.9635 | 0.9635 | 0.9635 | 0.9635 |
89
+ | 0.0368 | 8.0 | 3360 | 0.1653 | 0.9645 | 0.9645 | 0.9645 | 0.9645 |
90
+ | 0.0277 | 9.0 | 3780 | 0.1664 | 0.9644 | 0.9644 | 0.9644 | 0.9644 |
91
+ | 0.0231 | 10.0 | 4200 | 0.1668 | 0.9646 | 0.9646 | 0.9646 | 0.9646 |
92
 
 
93
 
94
+ ### Framework versions
95
 
96
+ - Transformers 4.37.2
97
+ - Pytorch 2.2.0+cu118
98
+ - Datasets 2.16.1
99
+ - Tokenizers 0.15.1
config.json CHANGED
@@ -1,10 +1,11 @@
1
  {
2
- "_name_or_path": "indonesian-roberta-base-posp-tagger",
3
  "architectures": [
4
  "RobertaForTokenClassification"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
  "bos_token_id": 0,
 
8
  "eos_token_id": 2,
9
  "gradient_checkpointing": false,
10
  "hidden_act": "gelu",
@@ -76,7 +77,7 @@
76
  "pad_token_id": 1,
77
  "position_embedding_type": "absolute",
78
  "torch_dtype": "float32",
79
- "transformers_version": "4.8.2",
80
  "type_vocab_size": 1,
81
  "use_cache": true,
82
  "vocab_size": 50265
1
  {
2
+ "_name_or_path": "flax-community/indonesian-roberta-base",
3
  "architectures": [
4
  "RobertaForTokenClassification"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
  "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
  "eos_token_id": 2,
10
  "gradient_checkpointing": false,
11
  "hidden_act": "gelu",
77
  "pad_token_id": 1,
78
  "position_embedding_type": "absolute",
79
  "torch_dtype": "float32",
80
+ "transformers_version": "4.37.2",
81
  "type_vocab_size": 1,
82
  "use_cache": true,
83
  "vocab_size": 50265
merges.txt CHANGED
@@ -1,4 +1,4 @@
1
- #version: 0.2 - Trained by `huggingface/tokenizers`
2
  a n
3
  Ġ d
4
  e r
1
+ #version: 0.2
2
  a n
3
  Ġ d
4
  e r
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e7e84b30ea165a7c83e532b6b78184ba7bdc6c9d44d606a3d269726f44faac1a
3
- size 496328272
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74a172fdb16b3bc119fdf9e657c16e6a31f11ba4e00ea149ae343456ba5da238
3
+ size 496324072
runs/Feb19_11-00-22_bookbot-h100/events.out.tfevents.1708340422.bookbot-h100.8811.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a3ee78f3b0ce792d75fc103dabeec566e4a20b04c066bd52dc3f7e431c361b3
3
+ size 11805
runs/Feb19_11-00-22_bookbot-h100/events.out.tfevents.1708340574.bookbot-h100.8811.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07f96ad90142a04884a683d82e5468514babcb3d9804c9c733708055997c9516
3
+ size 560
special_tokens_map.json CHANGED
@@ -1 +1,51 @@
1
- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
tokenizer_config.json CHANGED
@@ -1 +1,57 @@
1
- {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": true, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "special_tokens_map_file": null, "name_or_path": "flax-community/indonesian-roberta-base", "tokenizer_class": "RobertaTokenizer"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d0253fe1cd338bbbba23d8f23104c4b0a2197245ed788cb131216b57cd655e6
3
+ size 4728