|
--- |
|
language: en |
|
tags: |
|
- Clsssification |
|
license: apache-2.0 |
|
datasets: |
|
- tensorflow |
|
- numpy |
|
- keras |
|
- pandas |
|
- openpyxl |
|
- gensin |
|
- contractions |
|
- nltk |
|
- spacy |
|
thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png |
|
--- |
|
|
|
 |
|
|
|
|
|
# MCTI Text Classification Task (uncased) DRAFT |
|
|
|
Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project. |
|
|
|
The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses |
|
on the task of Text Classification and explores different machine learning strategies to classify a small amount |
|
of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding |
|
solutions were used to learn word relationships from other datasets with considerable similarity and larger scale. |
|
Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning |
|
models were applied to improve the understanding of each sentence. |
|
|
|
## According to the abstract, |
|
|
|
Compared to the 81% baseline accuracy rate based on available datasets and the 85% accuracy rate achieved using a |
|
Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 93%, according to |
|
["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318). |
|
|
|
## Model description |
|
|
|
Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet, |
|
nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet, |
|
consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate |
|
bibendum cursus. Nunc volutpat vitae neque ut bibendum: |
|
|
|
- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet, |
|
nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet, |
|
consectetur adipiscing elit. |
|
- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet, |
|
nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet, |
|
consectetur adipiscing elit. |
|
|
|
Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet, |
|
nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet, |
|
consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate |
|
bibendum cursus. Nunc volutpat vitae neque ut bibendum. |
|
|
|
 |
|
|
|
## Model variations |
|
|
|
With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning |
|
strategy under the assumption that small data available for training was insufficient for adequate embedding training. |
|
In this context, we considered two approaches: |
|
|
|
i) pre-training wordembeddings using similar datasets for text classification; |
|
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings. |
|
|
|
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models |
|
also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after. |
|
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of |
|
two models. |
|
|
|
Other 24 smaller models are released afterward. |
|
|
|
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github. |
|
|
|
#### Table 1: |
|
| Model | #params | Language | |
|
|------------------------------|:-------:|:--------:| |
|
| [`mcti-base-uncased`] | 110M | English | |
|
| [`mcti-large-uncased`] | 340M | English | |
|
| [`mcti-base-cased`] | 110M | English | |
|
| [`mcti-large-cased`] | 110M | Chinese | |
|
| [`-base-multilingual-cased`] | 110M | Multiple | |
|
|
|
#### Table 2: |
|
| Dataset | Compatibility to base* | |
|
|--------------------------------------|:----------------------:| |
|
| Labeled MCTI | 100% | |
|
| Full MCTI | 100% | |
|
| BBC News Articles | 56.77% | |
|
| New unlabeled MCTI | 75.26% | |
|
## Intended uses |
|
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to |
|
be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for |
|
fine-tuned versions of a task that interests you. |
|
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) |
|
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text |
|
generation you should look at model like XXX. |
|
### How to use |
|
You can use this model directly with a pipeline for masked language modeling: |
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased') |
|
>>> unmasker("Hello I'm a [MASK] model.") |
|
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]", |
|
'score': 0.1073106899857521, |
|
'token': 4827, |
|
'token_str': 'fashion'}, |
|
{'sequence': "[CLS] hello i'm a fine model. [SEP]", |
|
'score': 0.027095865458250046, |
|
'token': 2986, |
|
'token_str': 'fine'}] |
|
``` |
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
```python |
|
from transformers import BertTokenizer, BertModel |
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') |
|
model = BertModel.from_pretrained("bert-base-uncased") |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
and in TensorFlow: |
|
```python |
|
from transformers import BertTokenizer, TFBertModel |
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') |
|
model = TFBertModel.from_pretrained("bert-base-uncased") |
|
text = "Replace me by any text you'd like." |
|
encoded_input = tokenizer(text, return_tensors='tf') |
|
output = model(encoded_input) |
|
``` |
|
### Limitations and bias |
|
This model is uncased: it does not make a difference between english |
|
and English. |
|
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased |
|
predictions: |
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased') |
|
>>> unmasker("The man worked as a [MASK].") |
|
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]', |
|
'score': 0.09747550636529922, |
|
'token': 10533, |
|
'token_str': 'carpenter'}, |
|
{'sequence': '[CLS] the man worked as a salesman. [SEP]', |
|
'score': 0.037680890411138535, |
|
'token': 18968, |
|
'token_str': 'salesman'}] |
|
>>> unmasker("The woman worked as a [MASK].") |
|
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]', |
|
'score': 0.21981462836265564, |
|
'token': 6821, |
|
'token_str': 'nurse'}, |
|
{'sequence': '[CLS] the woman worked as a cook. [SEP]', |
|
'score': 0.03042375110089779, |
|
'token': 5660, |
|
'token_str': 'cook'}] |
|
``` |
|
This bias will also affect all fine-tuned versions of this model. |
|
## Training data |
|
The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 |
|
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and |
|
headers). |
|
## Training procedure |
|
### Preprocessing |
|
Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and |
|
optimize the training of the models. |
|
The following assumptions were considered: |
|
- The Data Entry base is obtained from the result of goal 4. |
|
- Labeling (Goal 4) is considered true for accuracy measurement purposes; |
|
- Preprocessing experiments compare accuracy in a shallow neural network (SNN); |
|
- Pre-processing was investigated for the classification goal. |
|
From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com) |
|
to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub. |
|
Several Python packages were used to develop the preprocessing code: |
|
#### Table 3: Python packages used |
|
| Objective | Package | |
|
|--------------------------------------------------------|--------------| |
|
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) | |
|
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) | |
|
| Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) | |
|
| Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) | |
|
| http library | [requests](https://pypi.org/project/requests) | |
|
| Training model | [scikit-learn](https://pypi.org/project/scikit-learn) | |
|
| Machine learning | [tensorflow](https://pypi.org/project/tensorflow) | |
|
| Machine learning | [keras](https://keras.io/) | |
|
| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) | |
|
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different |
|
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2. |
|
#### Table 4: Preprocessing methods evaluated |
|
| id | Experiments | |
|
|--------|------------------------------------------------------------------------| |
|
| Base | Original Texts | |
|
| xp1 | Expand Contractions | |
|
| xp2 | Expand Contractions + Convert text to lowercase | |
|
| xp3 | Expand Contractions + Remove Punctuation | |
|
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | |
|
| xp5 | xp4 + Stemming | |
|
| xp6 | xp4 + Lemmatization | |
|
| xp7 | xp4 + Stemming + Stopwords Removal | |
|
| xp8 | ap4 + Lemmatization + Stopwords Removal | |
|
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and |
|
evaluation of the first four bases (xp1, xp2, xp3, xp4). |
|
Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6), |
|
stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8). |
|
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow |
|
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are |
|
shown in Table 5. |
|
#### Table 5: Results obtained in Preprocessing |
|
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght | |
|
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------| |
|
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 | |
|
| xp1 | Expand Contractions | 88,71% | 81,59% | 71,54% | 97,33% | 414,715 | 23768 | 5636 | |
|
| xp2 | Expand Contractions + Convert text to lowercase | 90,32% | 85,64% | 77,19% | 97,44% | 368,375 | 20322 | 5629 | |
|
| xp3 | Expand Contractions + Remove Punctuation | 91,94% | 87,73% | 79,66% | 98,72% | 386,650 | 22121 | 4950 | |
|
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | 90,86% | 86,61% | 80,85% | 94,25% | 326,830 | 18616 | 4950 | |
|
| xp5 | xp4 + Stemming | 91,94% | 87,68% | 78,47% | 100,00% | 257,960 | 14319 | 4950 | |
|
| xp6 | xp4 + Lemmatization | 89,78% | 85,06% | 79,66% | 91,87% | 282,645 | 16194 | 4950 | |
|
| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 | |
|
| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 | |
|
Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time, |
|
less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice |
|
was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings, |
|
document-embedding). The training time is so close that it did not have such a large weight for the analysis. |
|
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made |
|
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized). |
|
### Pretraining |
|
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size |
|
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer |
|
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01, |
|
learning rate warmup for 10,000 steps and linear decay of the learning rate after. |
|
## Evaluation results |
|
### Model training with Word2Vec embeddings |
|
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem. |
|
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled |
|
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the |
|
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN |
|
architecture and 88% for the LSTM architecture. |
|
#### Table 6: Results from Pre-trained WE + ML models |
|
| ML Model | Accuracy | F1 Score | Precision | Recall | |
|
|:--------:|:---------:|:---------:|:---------:|:---------:| |
|
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 | |
|
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 | |
|
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 | |
|
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 | |
|
### Transformer-based implementation |
|
Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because |
|
of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences: |
|
the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the |
|
input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters |
|
without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks. |
|
For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences |
|
would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum |
|
length of 4096, only eight sentences will have their information shortened. |
|
To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination |
|
of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification |
|
models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more |
|
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4. |
|
This approach achieved adequate accuracy scores, above 82% in all implementation architectures. |
|
#### Table 7: Results from Pre-trained Longformer + ML models |
|
| ML Model | Accuracy | F1 Score | Precision | Recall | |
|
|:--------:|:---------:|:---------:|:---------:|:---------:| |
|
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 | |
|
| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | |
|
| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 | |
|
## Checkpoints |
|
- Examples |
|
- Implementation Notes |
|
- Usage Example |
|
- >>> |
|
- >>> ... |
|
## Config |
|
## Tokenizer |
|
## Benchmarks |
|
### BibTeX entry and citation info |
|
```bibtex |
|
@conference{webist22, |
|
author ={Carlos Rocha. and Marcos Dib. and Li Weigang. and Andrea Nunes. and Allan Faria. and Daniel Cajueiro. |
|
and Maísa {Kely de Melo}. and Victor Celestino.}, |
|
title ={Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data}, |
|
booktitle ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,}, |
|
year ={2022}, |
|
pages ={201-213}, |
|
publisher ={SciTePress}, |
|
organization ={INSTICC}, |
|
doi ={10.5220/0011527700003318}, |
|
isbn ={978-989-758-613-2}, |
|
issn ={2184-3252}, |
|
} |
|
``` |
|
<a href="https://huggingface.co/exbert/?model=bert-base-uncased"> |
|
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png"> |
|
</a> |