julien-c HF staff commited on
Commit
674a318
1 Parent(s): 03b499c

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/zanelim/singbert-large-sg/README.md

Files changed (1) hide show
  1. README.md +215 -0
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - singapore
5
+ - sg
6
+ - singlish
7
+ - malaysia
8
+ - ms
9
+ - manglish
10
+ - bert-large-uncased
11
+ license: mit
12
+ datasets:
13
+ - reddit singapore, malaysia
14
+ - hardwarezone
15
+ widget:
16
+ - text: "kopi c siew [MASK]"
17
+ - text: "die [MASK] must try"
18
+ ---
19
+
20
+ # Model name
21
+
22
+ SingBert Large - Bert for Singlish (SG) and Manglish (MY).
23
+
24
+ ## Model description
25
+
26
+ Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the large version, which was initialized from [BERT large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models), with pre-training finetuned on
27
+ [singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
28
+
29
+ ## Intended uses & limitations
30
+
31
+ #### How to use
32
+
33
+ ```python
34
+ >>> from transformers import pipeline
35
+ >>> nlp = pipeline('fill-mask', model='zanelim/singbert-large-sg')
36
+ >>> nlp("kopi c siew [MASK]")
37
+
38
+ [{'sequence': '[CLS] kopi c siew dai [SEP]',
39
+ 'score': 0.9003700017929077,
40
+ 'token': 18765,
41
+ 'token_str': 'dai'},
42
+ {'sequence': '[CLS] kopi c siew mai [SEP]',
43
+ 'score': 0.0779474675655365,
44
+ 'token': 14736,
45
+ 'token_str': 'mai'},
46
+ {'sequence': '[CLS] kopi c siew. [SEP]',
47
+ 'score': 0.0032227332703769207,
48
+ 'token': 1012,
49
+ 'token_str': '.'},
50
+ {'sequence': '[CLS] kopi c siew bao [SEP]',
51
+ 'score': 0.0017727474914863706,
52
+ 'token': 25945,
53
+ 'token_str': 'bao'},
54
+ {'sequence': '[CLS] kopi c siew peng [SEP]',
55
+ 'score': 0.0012526646023616195,
56
+ 'token': 26473,
57
+ 'token_str': 'peng'}]
58
+
59
+ >>> nlp("one teh c siew dai, and one kopi [MASK]")
60
+
61
+ [{'sequence': '[CLS] one teh c siew dai, and one kopi. [SEP]',
62
+ 'score': 0.5249741077423096,
63
+ 'token': 1012,
64
+ 'token_str': '.'},
65
+ {'sequence': '[CLS] one teh c siew dai, and one kopi o [SEP]',
66
+ 'score': 0.27349168062210083,
67
+ 'token': 1051,
68
+ 'token_str': 'o'},
69
+ {'sequence': '[CLS] one teh c siew dai, and one kopi peng [SEP]',
70
+ 'score': 0.057190295308828354,
71
+ 'token': 26473,
72
+ 'token_str': 'peng'},
73
+ {'sequence': '[CLS] one teh c siew dai, and one kopi c [SEP]',
74
+ 'score': 0.04022320732474327,
75
+ 'token': 1039,
76
+ 'token_str': 'c'},
77
+ {'sequence': '[CLS] one teh c siew dai, and one kopi? [SEP]',
78
+ 'score': 0.01191170234233141,
79
+ 'token': 1029,
80
+ 'token_str': '?'}]
81
+
82
+ >>> nlp("die [MASK] must try")
83
+
84
+ [{'sequence': '[CLS] die die must try [SEP]',
85
+ 'score': 0.9921030402183533,
86
+ 'token': 3280,
87
+ 'token_str': 'die'},
88
+ {'sequence': '[CLS] die also must try [SEP]',
89
+ 'score': 0.004993876442313194,
90
+ 'token': 2036,
91
+ 'token_str': 'also'},
92
+ {'sequence': '[CLS] die liao must try [SEP]',
93
+ 'score': 0.000317625846946612,
94
+ 'token': 727,
95
+ 'token_str': 'liao'},
96
+ {'sequence': '[CLS] die still must try [SEP]',
97
+ 'score': 0.0002260878391098231,
98
+ 'token': 2145,
99
+ 'token_str': 'still'},
100
+ {'sequence': '[CLS] die i must try [SEP]',
101
+ 'score': 0.00016935862367972732,
102
+ 'token': 1045,
103
+ 'token_str': 'i'}]
104
+
105
+ >>> nlp("dont play [MASK] leh")
106
+
107
+ [{'sequence': '[CLS] dont play play leh [SEP]',
108
+ 'score': 0.9079819321632385,
109
+ 'token': 2377,
110
+ 'token_str': 'play'},
111
+ {'sequence': '[CLS] dont play punk leh [SEP]',
112
+ 'score': 0.006846973206847906,
113
+ 'token': 7196,
114
+ 'token_str': 'punk'},
115
+ {'sequence': '[CLS] dont play games leh [SEP]',
116
+ 'score': 0.004041737411171198,
117
+ 'token': 2399,
118
+ 'token_str': 'games'},
119
+ {'sequence': '[CLS] dont play politics leh [SEP]',
120
+ 'score': 0.003728888463228941,
121
+ 'token': 4331,
122
+ 'token_str': 'politics'},
123
+ {'sequence': '[CLS] dont play cheat leh [SEP]',
124
+ 'score': 0.0032805048394948244,
125
+ 'token': 21910,
126
+ 'token_str': 'cheat'}]
127
+
128
+ >>> nlp("confirm plus [MASK]")
129
+
130
+ {'sequence': '[CLS] confirm plus chop [SEP]',
131
+ 'score': 0.9749826192855835,
132
+ 'token': 24494,
133
+ 'token_str': 'chop'},
134
+ {'sequence': '[CLS] confirm plus chopped [SEP]',
135
+ 'score': 0.017554156482219696,
136
+ 'token': 24881,
137
+ 'token_str': 'chopped'},
138
+ {'sequence': '[CLS] confirm plus minus [SEP]',
139
+ 'score': 0.002725469646975398,
140
+ 'token': 15718,
141
+ 'token_str': 'minus'},
142
+ {'sequence': '[CLS] confirm plus guarantee [SEP]',
143
+ 'score': 0.000900257145985961,
144
+ 'token': 11302,
145
+ 'token_str': 'guarantee'},
146
+ {'sequence': '[CLS] confirm plus one [SEP]',
147
+ 'score': 0.0004384620988275856,
148
+ 'token': 2028,
149
+ 'token_str': 'one'}]
150
+
151
+ >>> nlp("catch no [MASK]")
152
+
153
+ [{'sequence': '[CLS] catch no ball [SEP]',
154
+ 'score': 0.9381157159805298,
155
+ 'token': 3608,
156
+ 'token_str': 'ball'},
157
+ {'sequence': '[CLS] catch no balls [SEP]',
158
+ 'score': 0.060842301696538925,
159
+ 'token': 7395,
160
+ 'token_str': 'balls'},
161
+ {'sequence': '[CLS] catch no fish [SEP]',
162
+ 'score': 0.00030917322146706283,
163
+ 'token': 3869,
164
+ 'token_str': 'fish'},
165
+ {'sequence': '[CLS] catch no breath [SEP]',
166
+ 'score': 7.552534952992573e-05,
167
+ 'token': 3052,
168
+ 'token_str': 'breath'},
169
+ {'sequence': '[CLS] catch no tail [SEP]',
170
+ 'score': 4.208395694149658e-05,
171
+ 'token': 5725,
172
+ 'token_str': 'tail'}]
173
+
174
+ ```
175
+
176
+ Here is how to use this model to get the features of a given text in PyTorch:
177
+ ```python
178
+ from transformers import BertTokenizer, BertModel
179
+ tokenizer = BertTokenizer.from_pretrained('zanelim/singbert-large-sg')
180
+ model = BertModel.from_pretrained("zanelim/singbert-large-sg")
181
+ text = "Replace me by any text you'd like."
182
+ encoded_input = tokenizer(text, return_tensors='pt')
183
+ output = model(**encoded_input)
184
+ ```
185
+
186
+ and in TensorFlow:
187
+ ```python
188
+ from transformers import BertTokenizer, TFBertModel
189
+ tokenizer = BertTokenizer.from_pretrained("zanelim/singbert-large-sg")
190
+ model = TFBertModel.from_pretrained("zanelim/singbert-large-sg")
191
+ text = "Replace me by any text you'd like."
192
+ encoded_input = tokenizer(text, return_tensors='tf')
193
+ output = model(encoded_input)
194
+ ```
195
+
196
+ #### Limitations and bias
197
+ This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
198
+ constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
199
+
200
+ ## Training data
201
+ Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
202
+ corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
203
+
204
+ ## Training procedure
205
+
206
+ Initialized with [bert large uncased (whole word masking)](https://github.com/google-research/bert#pre-trained-models) vocab and checkpoints (pre-trained weights).
207
+ Top 1000 custom vocab tokens (non-overlapped with original bert vocab) were further extracted from training data and filled into unused tokens in original bert vocab.
208
+
209
+ Pre-training was further finetuned on training data with the following hyperparameters
210
+ * train_batch_size: 512
211
+ * max_seq_length: 128
212
+ * num_train_steps: 300000
213
+ * num_warmup_steps: 5000
214
+ * learning_rate: 2e-5
215
+ * hardware: TPU v3-8