ilos-vigil commited on
Commit
2f5fb80
1 Parent(s): 7f12618

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
 
2
  license: mit
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: id
3
  license: mit
4
+ datasets:
5
+ - oscar
6
+ - wikipedia
7
+ - id_newspapers_2018
8
+ widget:
9
+ - text: "Saya [MASK] makan nasi goreng."
10
+ - text: "Kucing itu sedang bermain dengan [MASK]."
11
  ---
12
+
13
+ # Indonesian small BigBird model
14
+
15
+ **Disclaimer:** This is work in progress. Current checkpoint is trained with ~1.0 epoch/6450 steps with 2.565 train loss and 2.466 eval loss. Newer checkpoint and additional information will be added in the future.
16
+
17
+ ## Model Description
18
+
19
+ This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
20
+
21
+ ```py
22
+ config = BigBirdConfig(
23
+ vocab_size = 30_000,
24
+ hidden_size = 512,
25
+ num_hidden_layers = 4,
26
+ num_attention_heads = 8,
27
+ intermediate_size = 2048,
28
+ max_position_embeddings = 4096,
29
+ is_encoder_decoder=False,
30
+ attention_type='block_sparse'
31
+ )
32
+ ```
33
+
34
+ ## How to use
35
+
36
+ > TBD
37
+
38
+ ## Limitations and bias
39
+
40
+ > TBD
41
+
42
+ ## Training and evaluation data
43
+
44
+ This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
45
+
46
+ ## Training Procedure
47
+
48
+ > TBD
49
+
50
+ ## Evaluation
51
+
52
+ > TBD