ChristopherA08 commited on
Commit
ccebcb7
1 Parent(s): 6f3e960

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: id
3
+ datasets:
4
+ - oscar
5
+ ---
6
+ # IndoBERT (Indonesian BERT Model)
7
+
8
+ ## Model description
9
+ ELECTRA is a new method for self-supervised language representation learning. This repository contains the pre-trained Electra Base model (tensorflow 1.15.0) trained in a Large Indonesian corpus (~16GB of raw text | ~2B indonesian words).
10
+ IndoELECTRA is a pre-trained language model based on ELECTRA architecture for the Indonesian Language.
11
+
12
+ This model is base version which use electra-base config.
13
+
14
+ ## Intended uses & limitations
15
+
16
+ #### How to use
17
+
18
+ ```python
19
+ from transformers import AutoTokenizer, AutoModel
20
+ tokenizer = AutoTokenizer.from_pretrained("ChristopherA08/IndoELECTRA")
21
+ model = AutoModel.from_pretrained("ChristopherA08/IndoELECTRA")
22
+ tokenizer.encode("hai aku mau makan.")
23
+ [2, 8078, 1785, 2318, 1946, 18, 4]
24
+ ```
25
+
26
+ ## Training procedure
27
+
28
+ The training of the model has been performed using Google's original Tensorflow code on eight core Google Cloud TPU v2.
29
+ We used a Google Cloud Storage bucket, for persistent storage of training data and models.
30
+