File size: 1,058 Bytes
ccebcb7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
---
language: id
datasets:
- oscar
---
# IndoBERT (Indonesian BERT Model)
## Model description
ELECTRA is a new method for self-supervised language representation learning. This repository contains the pre-trained Electra Base model (tensorflow 1.15.0) trained in a Large Indonesian corpus (~16GB of raw text | ~2B indonesian words).
IndoELECTRA is a pre-trained language model based on ELECTRA architecture for the Indonesian Language.
This model is base version which use electra-base config.
## Intended uses & limitations
#### How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ChristopherA08/IndoELECTRA")
model = AutoModel.from_pretrained("ChristopherA08/IndoELECTRA")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]
```
## Training procedure
The training of the model has been performed using Google's original Tensorflow code on eight core Google Cloud TPU v2.
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
|