igorgavi commited on
Commit
cf6b726
1 Parent(s): 8498227

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -116
README.md CHANGED
@@ -77,41 +77,7 @@ its implementation and the article from which it originated.
77
  | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
78
 
79
 
80
- ## Model variations
81
-
82
- With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
83
- strategy under the assumption that small data available for training was insufficient for adequate embedding training.
84
- In this context, we considered two approaches:
85
-
86
- i) pre-training wordembeddings using similar datasets for text classification;
87
- ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
88
-
89
- XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
90
- also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
91
- Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
92
- two models.
93
-
94
- Other 24 smaller models are released afterward.
95
-
96
- The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
97
-
98
- | Model | #params | Language |
99
- |------------------------------|--------------------|-------|
100
- | [`mcti-base-uncased`] | 110M | English |
101
- | [`mcti-large-uncased`] | 340M | English | sub
102
- | [`mcti-base-cased`] | 110M | English |
103
- | [`mcti-large-cased`] | 110M | Chinese |
104
- | [`-base-multilingual-cased`] | 110M | Multiple |
105
-
106
- | Dataset | Compatibility to base* |
107
- |----------------------------|------------------------|
108
- | Labeled MCTI | 100% |
109
- | Full MCTI | 100% |
110
- | BBC News Articles | 56.77% |
111
- | New unlabeled MCTI | 75.26% |
112
-
113
-
114
- ## Intended uses
115
 
116
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
117
  be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
@@ -174,10 +140,6 @@ encoded_input = tokenizer(text, return_tensors='tf')
174
  output = model(encoded_input)
175
  ```
176
 
177
- ### Limitations and bias
178
-
179
-
180
-
181
  ## Training data
182
 
183
  The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
@@ -189,69 +151,8 @@ headers).
189
 
190
  ### Preprocessing
191
 
192
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
193
- then of the form:
194
-
195
- ```
196
- [CLS] Sentence A [SEP] Sentence B [SEP]
197
- ```
198
-
199
- With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
200
- the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
201
- consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
202
- "sentences" has a combined length of less than 512 tokens.
203
-
204
- The details of the masking procedure for each sentence are the following:
205
- - 15% of the tokens are masked.
206
- - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
207
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
208
- - In the 10% remaining cases, the masked tokens are left as is.
209
-
210
- ### Pretraining
211
-
212
- The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
213
- of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
214
- used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
215
- learning rate warmup for 10,000 steps and linear decay of the learning rate after.
216
-
217
  ## Evaluation results
218
 
219
- ### Model training with Word2Vec embeddings
220
-
221
- Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
222
- We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
223
- data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
224
- obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
225
- architecture and 88% for the LSTM architecture.
226
-
227
-
228
- Table 1: Results from Pre-trained WE + ML models.
229
-
230
- | ML Model | Accuracy | F1 Score | Precision | Recall |
231
- |:--------:|:---------:|:---------:|:---------:|:---------:|
232
- | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
233
- | DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
234
- | CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
235
- | LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
236
-
237
- ### Transformer-based implementation
238
-
239
- Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
240
- of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
241
- the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
242
- input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
243
- without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
244
-
245
- For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
246
- would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
247
- length of 4096, only eight sentences will have their information shortened.
248
-
249
- To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
250
- of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
251
- models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
252
- computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
253
- This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
254
-
255
 
256
  Table 2: Results from Pre-trained Longformer + ML models.
257
 
@@ -270,22 +171,6 @@ Table 2: Results from Pre-trained Longformer + ML models.
270
  - >>>
271
  - >>> ...
272
 
273
-
274
- ## Config
275
-
276
- ## Tokenizer
277
-
278
- ## Training data
279
-
280
- ## Training procedure
281
-
282
- ## Preprocessing
283
-
284
- ## Pretraining
285
-
286
- ## Evaluation results
287
- ## Benchmarks
288
-
289
  ### BibTeX entry and citation info
290
 
291
  ```bibtex
 
77
  | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
78
 
79
 
80
+ ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
83
  be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
 
140
  output = model(encoded_input)
141
  ```
142
 
 
 
 
 
143
  ## Training data
144
 
145
  The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
 
151
 
152
  ### Preprocessing
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ## Evaluation results
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  Table 2: Results from Pre-trained Longformer + ML models.
158
 
 
171
  - >>>
172
  - >>> ...
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
  ### BibTeX entry and citation info
175
 
176
  ```bibtex