Updated README.md with instructions to classify text and, specified class labels in config.json
Browse files- README.md +32 -3
- config.json +6 -6
README.md
CHANGED
@@ -5,20 +5,49 @@ tags:
|
|
5 |
- indobenchmark
|
6 |
- indonlu
|
7 |
license: mit
|
8 |
-
inference:
|
9 |
datasets:
|
10 |
- Indo4B
|
11 |
-
- IndoNLU (SmSA)
|
12 |
---
|
13 |
|
14 |
# IndoBERT-Lite Large Model (phase2 - uncased) Finetuned on IndoNLU SmSA dataset
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
## How to use
|
17 |
|
18 |
### Load model and tokenizer
|
19 |
|
20 |
```python
|
21 |
-
from transformers import BertTokenizer,
|
|
|
|
|
|
|
22 |
tokenizer = BertTokenizer.from_pretrained("tyqiangz/indobert-lite-large-p2-smsa")
|
23 |
model = AutoModel.from_pretrained("tyqiangz/indobert-lite-large-p2-smsa")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
```
|
|
|
5 |
- indobenchmark
|
6 |
- indonlu
|
7 |
license: mit
|
8 |
+
inference: true
|
9 |
datasets:
|
10 |
- Indo4B
|
|
|
11 |
---
|
12 |
|
13 |
# IndoBERT-Lite Large Model (phase2 - uncased) Finetuned on IndoNLU SmSA dataset
|
14 |
|
15 |
+
Finetuned the IndoBERT-Lite Large Model (phase2 - uncased) model following the procedues stated in the paper [IndoNLU: Benchmark and Resources for Evaluating Indonesian
|
16 |
+
Natural Language Understanding](https://arxiv.org/pdf/2009.05387.pdf).
|
17 |
+
|
18 |
+
Finetuning hyperparameters:
|
19 |
+
- learning rate: 2e-5
|
20 |
+
- batch size: 16
|
21 |
+
- no. of epochs: 5
|
22 |
+
- max sequence length: 512
|
23 |
+
- random seed: 42
|
24 |
+
|
25 |
## How to use
|
26 |
|
27 |
### Load model and tokenizer
|
28 |
|
29 |
```python
|
30 |
+
from transformers import BertTokenizer, AutoModelForSequenceClassification
|
31 |
+
import torch
|
32 |
+
import torch.nn.functional as F
|
33 |
+
|
34 |
tokenizer = BertTokenizer.from_pretrained("tyqiangz/indobert-lite-large-p2-smsa")
|
35 |
model = AutoModel.from_pretrained("tyqiangz/indobert-lite-large-p2-smsa")
|
36 |
+
|
37 |
+
text = "Penyakit koronavirus 2019"
|
38 |
+
|
39 |
+
index_to_word = {0: 'positive', 1: 'neutral', 2: 'negative'}
|
40 |
+
|
41 |
+
subwords = tokenizer.encode(text, add_special_tokens=True)
|
42 |
+
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)
|
43 |
+
|
44 |
+
logits = model(subwords)[0]
|
45 |
+
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
|
46 |
+
|
47 |
+
print(index_to_word[label])
|
48 |
+
|
49 |
+
"""
|
50 |
+
Output:
|
51 |
+
'negative'
|
52 |
+
"""
|
53 |
```
|
config.json
CHANGED
@@ -15,17 +15,17 @@
|
|
15 |
"hidden_dropout_prob": 0,
|
16 |
"hidden_size": 1024,
|
17 |
"id2label": {
|
18 |
-
"0": "
|
19 |
-
"1": "
|
20 |
-
"2": "
|
21 |
},
|
22 |
"initializer_range": 0.02,
|
23 |
"inner_group_num": 1,
|
24 |
"intermediate_size": 4096,
|
25 |
"label2id": {
|
26 |
-
"
|
27 |
-
"
|
28 |
-
"
|
29 |
},
|
30 |
"layer_norm_eps": 1e-12,
|
31 |
"max_position_embeddings": 512,
|
|
|
15 |
"hidden_dropout_prob": 0,
|
16 |
"hidden_size": 1024,
|
17 |
"id2label": {
|
18 |
+
"0": "positive",
|
19 |
+
"1": "neutral",
|
20 |
+
"2": "negative"
|
21 |
},
|
22 |
"initializer_range": 0.02,
|
23 |
"inner_group_num": 1,
|
24 |
"intermediate_size": 4096,
|
25 |
"label2id": {
|
26 |
+
"positive": 0,
|
27 |
+
"neutral": 1,
|
28 |
+
"negative": 2
|
29 |
},
|
30 |
"layer_norm_eps": 1e-12,
|
31 |
"max_position_embeddings": 512,
|