Update README.md
Browse files
README.md
CHANGED
@@ -77,41 +77,7 @@ its implementation and the article from which it originated.
|
|
77 |
| mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
|
78 |
|
79 |
|
80 |
-
##
|
81 |
-
|
82 |
-
With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
|
83 |
-
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
84 |
-
In this context, we considered two approaches:
|
85 |
-
|
86 |
-
i) pre-training wordembeddings using similar datasets for text classification;
|
87 |
-
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
88 |
-
|
89 |
-
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
|
90 |
-
also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
|
91 |
-
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
|
92 |
-
two models.
|
93 |
-
|
94 |
-
Other 24 smaller models are released afterward.
|
95 |
-
|
96 |
-
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
97 |
-
|
98 |
-
| Model | #params | Language |
|
99 |
-
|------------------------------|--------------------|-------|
|
100 |
-
| [`mcti-base-uncased`] | 110M | English |
|
101 |
-
| [`mcti-large-uncased`] | 340M | English | sub
|
102 |
-
| [`mcti-base-cased`] | 110M | English |
|
103 |
-
| [`mcti-large-cased`] | 110M | Chinese |
|
104 |
-
| [`-base-multilingual-cased`] | 110M | Multiple |
|
105 |
-
|
106 |
-
| Dataset | Compatibility to base* |
|
107 |
-
|----------------------------|------------------------|
|
108 |
-
| Labeled MCTI | 100% |
|
109 |
-
| Full MCTI | 100% |
|
110 |
-
| BBC News Articles | 56.77% |
|
111 |
-
| New unlabeled MCTI | 75.26% |
|
112 |
-
|
113 |
-
|
114 |
-
## Intended uses
|
115 |
|
116 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
117 |
be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
|
@@ -174,10 +140,6 @@ encoded_input = tokenizer(text, return_tensors='tf')
|
|
174 |
output = model(encoded_input)
|
175 |
```
|
176 |
|
177 |
-
### Limitations and bias
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
## Training data
|
182 |
|
183 |
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
|
@@ -189,69 +151,8 @@ headers).
|
|
189 |
|
190 |
### Preprocessing
|
191 |
|
192 |
-
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
|
193 |
-
then of the form:
|
194 |
-
|
195 |
-
```
|
196 |
-
[CLS] Sentence A [SEP] Sentence B [SEP]
|
197 |
-
```
|
198 |
-
|
199 |
-
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
|
200 |
-
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
201 |
-
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
202 |
-
"sentences" has a combined length of less than 512 tokens.
|
203 |
-
|
204 |
-
The details of the masking procedure for each sentence are the following:
|
205 |
-
- 15% of the tokens are masked.
|
206 |
-
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
207 |
-
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
208 |
-
- In the 10% remaining cases, the masked tokens are left as is.
|
209 |
-
|
210 |
-
### Pretraining
|
211 |
-
|
212 |
-
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
|
213 |
-
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
214 |
-
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
215 |
-
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
216 |
-
|
217 |
## Evaluation results
|
218 |
|
219 |
-
### Model training with Word2Vec embeddings
|
220 |
-
|
221 |
-
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
|
222 |
-
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
|
223 |
-
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
|
224 |
-
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
225 |
-
architecture and 88% for the LSTM architecture.
|
226 |
-
|
227 |
-
|
228 |
-
Table 1: Results from Pre-trained WE + ML models.
|
229 |
-
|
230 |
-
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
231 |
-
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
232 |
-
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
233 |
-
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
|
234 |
-
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
|
235 |
-
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
|
236 |
-
|
237 |
-
### Transformer-based implementation
|
238 |
-
|
239 |
-
Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
|
240 |
-
of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
|
241 |
-
the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
|
242 |
-
input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
|
243 |
-
without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
|
244 |
-
|
245 |
-
For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
|
246 |
-
would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
|
247 |
-
length of 4096, only eight sentences will have their information shortened.
|
248 |
-
|
249 |
-
To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
|
250 |
-
of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
|
251 |
-
models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
|
252 |
-
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
253 |
-
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
254 |
-
|
255 |
|
256 |
Table 2: Results from Pre-trained Longformer + ML models.
|
257 |
|
@@ -270,22 +171,6 @@ Table 2: Results from Pre-trained Longformer + ML models.
|
|
270 |
- >>>
|
271 |
- >>> ...
|
272 |
|
273 |
-
|
274 |
-
## Config
|
275 |
-
|
276 |
-
## Tokenizer
|
277 |
-
|
278 |
-
## Training data
|
279 |
-
|
280 |
-
## Training procedure
|
281 |
-
|
282 |
-
## Preprocessing
|
283 |
-
|
284 |
-
## Pretraining
|
285 |
-
|
286 |
-
## Evaluation results
|
287 |
-
## Benchmarks
|
288 |
-
|
289 |
### BibTeX entry and citation info
|
290 |
|
291 |
```bibtex
|
|
|
77 |
| mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
|
78 |
|
79 |
|
80 |
+
## Intended uses & limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
83 |
be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
|
|
|
140 |
output = model(encoded_input)
|
141 |
```
|
142 |
|
|
|
|
|
|
|
|
|
143 |
## Training data
|
144 |
|
145 |
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
|
|
|
151 |
|
152 |
### Preprocessing
|
153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
## Evaluation results
|
155 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
|
157 |
Table 2: Results from Pre-trained Longformer + ML models.
|
158 |
|
|
|
171 |
- >>>
|
172 |
- >>> ...
|
173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
### BibTeX entry and citation info
|
175 |
|
176 |
```bibtex
|