julien-c HF staff commited on
Commit
74e501f
1 Parent(s): 3065c64

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/zanelim/singbert-lite-sg/README.md

Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - singapore
5
+ - sg
6
+ - singlish
7
+ - malaysia
8
+ - ms
9
+ - manglish
10
+ - albert-base-v2
11
+ license: mit
12
+ datasets:
13
+ - reddit singapore, malaysia
14
+ - hardwarezone
15
+ widget:
16
+ - text: "dont play [MASK] leh"
17
+ - text: "die [MASK] must try"
18
+ ---
19
+
20
+ # Model name
21
+
22
+ SingBert Lite - Bert for Singlish (SG) and Manglish (MY).
23
+
24
+ ## Model description
25
+
26
+ Similar to [SingBert](https://huggingface.co/zanelim/singbert) but the lite-version, which was initialized from [Albert base v2](https://github.com/google-research/albert#albert), with pre-training finetuned on
27
+ [singlish](https://en.wikipedia.org/wiki/Singlish) and [manglish](https://en.wikipedia.org/wiki/Manglish) data.
28
+
29
+ ## Intended uses & limitations
30
+
31
+ #### How to use
32
+
33
+ ```python
34
+ >>> from transformers import pipeline
35
+ >>> nlp = pipeline('fill-mask', model='zanelim/singbert-lite-sg')
36
+ >>> nlp("die [MASK] must try")
37
+
38
+ [{'sequence': '[CLS] die die must try[SEP]',
39
+ 'score': 0.7731555700302124,
40
+ 'token': 1327,
41
+ 'token_str': '▁die'},
42
+ {'sequence': '[CLS] die also must try[SEP]',
43
+ 'score': 0.04763784259557724,
44
+ 'token': 67,
45
+ 'token_str': '▁also'},
46
+ {'sequence': '[CLS] die still must try[SEP]',
47
+ 'score': 0.01859409362077713,
48
+ 'token': 174,
49
+ 'token_str': '▁still'},
50
+ {'sequence': '[CLS] die u must try[SEP]',
51
+ 'score': 0.015824034810066223,
52
+ 'token': 287,
53
+ 'token_str': '▁u'},
54
+ {'sequence': '[CLS] die is must try[SEP]',
55
+ 'score': 0.011271446943283081,
56
+ 'token': 25,
57
+ 'token_str': '▁is'}]
58
+
59
+ >>> nlp("dont play [MASK] leh")
60
+
61
+ [{'sequence': '[CLS] dont play play leh[SEP]',
62
+ 'score': 0.4365769624710083,
63
+ 'token': 418,
64
+ 'token_str': '▁play'},
65
+ {'sequence': '[CLS] dont play punk leh[SEP]',
66
+ 'score': 0.06880936771631241,
67
+ 'token': 6769,
68
+ 'token_str': '▁punk'},
69
+ {'sequence': '[CLS] dont play game leh[SEP]',
70
+ 'score': 0.051739856600761414,
71
+ 'token': 250,
72
+ 'token_str': '▁game'},
73
+ {'sequence': '[CLS] dont play games leh[SEP]',
74
+ 'score': 0.045703962445259094,
75
+ 'token': 466,
76
+ 'token_str': '▁games'},
77
+ {'sequence': '[CLS] dont play around leh[SEP]',
78
+ 'score': 0.013458190485835075,
79
+ 'token': 140,
80
+ 'token_str': '▁around'}]
81
+
82
+ >>> nlp("catch no [MASK]")
83
+
84
+ [{'sequence': '[CLS] catch no ball[SEP]',
85
+ 'score': 0.6197211146354675,
86
+ 'token': 1592,
87
+ 'token_str': '▁ball'},
88
+ {'sequence': '[CLS] catch no balls[SEP]',
89
+ 'score': 0.08441998809576035,
90
+ 'token': 7152,
91
+ 'token_str': '▁balls'},
92
+ {'sequence': '[CLS] catch no joke[SEP]',
93
+ 'score': 0.0676785409450531,
94
+ 'token': 8186,
95
+ 'token_str': '▁joke'},
96
+ {'sequence': '[CLS] catch no?[SEP]',
97
+ 'score': 0.040638409554958344,
98
+ 'token': 60,
99
+ 'token_str': '?'},
100
+ {'sequence': '[CLS] catch no one[SEP]',
101
+ 'score': 0.03546864539384842,
102
+ 'token': 53,
103
+ 'token_str': '▁one'}]
104
+
105
+ >>> nlp("confirm plus [MASK]")
106
+
107
+ [{'sequence': '[CLS] confirm plus chop[SEP]',
108
+ 'score': 0.9608421921730042,
109
+ 'token': 17144,
110
+ 'token_str': '▁chop'},
111
+ {'sequence': '[CLS] confirm plus guarantee[SEP]',
112
+ 'score': 0.011784233152866364,
113
+ 'token': 9120,
114
+ 'token_str': '▁guarantee'},
115
+ {'sequence': '[CLS] confirm plus confirm[SEP]',
116
+ 'score': 0.010571340098977089,
117
+ 'token': 10265,
118
+ 'token_str': '▁confirm'},
119
+ {'sequence': '[CLS] confirm plus egg[SEP]',
120
+ 'score': 0.0033525123726576567,
121
+ 'token': 6387,
122
+ 'token_str': '▁egg'},
123
+ {'sequence': '[CLS] confirm plus bet[SEP]',
124
+ 'score': 0.0008760977652855217,
125
+ 'token': 5676,
126
+ 'token_str': '▁bet'}]
127
+
128
+ ```
129
+
130
+ Here is how to use this model to get the features of a given text in PyTorch:
131
+ ```python
132
+ from transformers import AlbertTokenizer, AlbertModel
133
+ tokenizer = AlbertTokenizer.from_pretrained('zanelim/singbert-lite-sg')
134
+ model = AlbertModel.from_pretrained("zanelim/singbert-lite-sg")
135
+ text = "Replace me by any text you'd like."
136
+ encoded_input = tokenizer(text, return_tensors='pt')
137
+ output = model(**encoded_input)
138
+ ```
139
+
140
+ and in TensorFlow:
141
+ ```python
142
+ from transformers import AlbertTokenizer, TFAlbertModel
143
+ tokenizer = AlbertTokenizer.from_pretrained("zanelim/singbert-lite-sg")
144
+ model = TFAlbertModel.from_pretrained("zanelim/singbert-lite-sg")
145
+ text = "Replace me by any text you'd like."
146
+ encoded_input = tokenizer(text, return_tensors='tf')
147
+ output = model(encoded_input)
148
+ ```
149
+
150
+ #### Limitations and bias
151
+ This model was finetuned on colloquial Singlish and Manglish corpus, hence it is best applied on downstream tasks involving the main
152
+ constituent languages- english, mandarin, malay. Also, as the training data is mainly from forums, beware of existing inherent bias.
153
+
154
+ ## Training data
155
+ Colloquial singlish and manglish (both are a mixture of English, Mandarin, Tamil, Malay, and other local dialects like Hokkien, Cantonese or Teochew)
156
+ corpus. The corpus is collected from subreddits- `r/singapore` and `r/malaysia`, and forums such as `hardwarezone`.
157
+
158
+ ## Training procedure
159
+
160
+ Initialized with [albert base v2](https://github.com/google-research/albert#albert) vocab and checkpoints (pre-trained weights).
161
+
162
+ Pre-training was further finetuned on training data with the following hyperparameters
163
+ * train_batch_size: 4096
164
+ * max_seq_length: 128
165
+ * num_train_steps: 125000
166
+ * num_warmup_steps: 5000
167
+ * learning_rate: 0.00176
168
+ * hardware: TPU v3-8