cahya
/

distilbert-base-indonesian

Inference Endpoints

Model card Files Files and versions Community

cahya commited on Feb 8, 2021

Commit

613b72c

·

1 Parent(s): 1d1b1a1

updated the readme and tehe model

Files changed (2) hide show

README.md +35 -17
pytorch_model.bin +1 -1

README.md CHANGED Viewed

@@ -24,23 +24,41 @@ You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
 >>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
->>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")
-[{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
-  'score': 0.7983310222625732,
-  'token': 1495},
- {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
-  'score': 0.090003103017807,
-  'token': 17},
- {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
-  'score': 0.025469014421105385,
-  'token': 1600},
- {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
-  'score': 0.017966199666261673,
-  'token': 1555},
- {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
-  'score': 0.016971781849861145,
-  'token': 1572}]
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
 ```python
@@ -67,7 +85,7 @@ output = model(encoded_input)
 ## Training data
-This model was pre-trained with 522MB of indonesian Wikipedia and 1GB of
 [indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
 The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
 then of the form:

 ```python
 >>> from transformers import pipeline
 >>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
+>>> unmasker("Ayahku sedang bekerja di sawah untuk [MASK] padi")
+[
+  {
+    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk menanam padi [SEP]",
+    "score": 0.6853187084197998,
+    "token": 12712,
+    "token_str": "menanam"
+  },
+  {
+    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk bertani padi [SEP]",
+    "score": 0.03739545866847038,
+    "token": 15484,
+    "token_str": "bertani"
+  },
+  {
+    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk memetik padi [SEP]",
+    "score": 0.02742469497025013,
+    "token": 30338,
+    "token_str": "memetik"
+  },
+  {
+    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk penggilingan padi [SEP]",
+    "score": 0.02214187942445278,
+    "token": 28252,
+    "token_str": "penggilingan"
+  },
+  {
+    "sequence": "[CLS] ayahku sedang bekerja di sawah untuk tanam padi [SEP]",
+    "score": 0.0185895636677742,
+    "token": 11308,
+    "token_str": "tanam"
+  }
+]
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
 ```python
 ## Training data
+This model was distiled with 522MB of indonesian Wikipedia and 1GB of
 [indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
 The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
 then of the form:

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:550663fcbd5a7473047e55ef8778939ecab3a04685fa6666d755159cd929b1a5
 size 272513919

 version https://git-lfs.github.com/spec/v1
+oid sha256:39b114f8d3260960d4a3a28c2b1ba0543e4ec09a96342d88747f1bed1cd9ab0e
 size 272513919