julien-c HF staff commited on
Commit
6e83f77
1 Parent(s): 7960549

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/sarahlintang/IndoBERT/README.md

Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: id
3
+ datasets:
4
+ - oscar
5
+ ---
6
+ # IndoBERT (Indonesian BERT Model)
7
+
8
+ ## Model description
9
+ IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.
10
+
11
+ This model is base-uncased version which use bert-base config.
12
+
13
+ ## Intended uses & limitations
14
+
15
+ #### How to use
16
+
17
+ ```python
18
+ from transformers import AutoTokenizer, AutoModel
19
+ tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
20
+ model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
21
+ tokenizer.encode("hai aku mau makan.")
22
+ [2, 8078, 1785, 2318, 1946, 18, 4]
23
+ ```
24
+
25
+
26
+ ## Training data
27
+
28
+ This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).
29
+
30
+ This model is equal to bert-base model which has 32,000 vocabulary size.
31
+
32
+ ## Training procedure
33
+
34
+ The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
35
+ We used a Google Cloud Storage bucket, for persistent storage of training data and models.
36
+
37
+ ## Eval results
38
+
39
+ We evaluate this model on three Indonesian NLP downstream task:
40
+ - some extractive summarization model
41
+ - sentiment analysis
42
+ - Part-of-Speech Tagger
43
+ it was proven that this model outperforms multilingual BERT for all downstream tasks.