unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

6 datasets

English

Summarization

5 papers

Model card Files Files and versions Community

igorgavi commited on Dec 10, 2022

Commit

7b5f399

•

1 Parent(s): cedcfdf

Update README.md

Browse files

Files changed (1) hide show

README.md +340 -0

README.md CHANGED Viewed

@@ -1,3 +1,343 @@
 ---
 license: apache-2.0
 ---

 ---
+language: en
+tags:
+- Clsssification
 license: apache-2.0
+datasets:
+- Opinosis
+- ...
+- MCTI_data
+thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png
 ---
+![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
+# MCTI Text Classification Task (uncased) DRAFT
+Disclaimer:
+## According to the abstract,
+Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
+require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
+and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
+different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
+with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
+provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
+the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
+of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
+available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
+Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
+Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
+asuccessful case of artificial intelligence in a federal government application.
+This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
+supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
+[this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
+and English.
+## Model description
+Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
+bibendum cursus. Nunc volutpat vitae neque ut bibendum:
+- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+  consectetur adipiscing elit.
+- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+  nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+  consectetur adipiscing elit.
+Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
+nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
+consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
+bibendum cursus. Nunc volutpat vitae neque ut bibendum.
+![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
+## Model variations
+With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
+strategy under the assumption that small data available for training was insufficient for adequate embedding training.
+In this context, we considered two approaches:
+   i) pre-training wordembeddings using similar datasets for text classification;
+   ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
+XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
+also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
+Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
+two models.
+Other 24 smaller models are released afterward.
+The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
+| Model                        | #params | Language |
+|------------------------------|--------------------|-------|
+| [`mcti-base-uncased`]        | 110M    | English  |
+| [`mcti-large-uncased`]       | 340M    | English  | sub
+| [`mcti-base-cased`]          | 110M    | English  |
+| [`mcti-large-cased`]         | 110M    | Chinese  |
+| [`-base-multilingual-cased`] | 110M    | Multiple |
+  | Dataset                    | Compatibility to base* |
+  |----------------------------|------------------------|
+  | Labeled MCTI               | 100%                   |
+  | Full MCTI                  | 100%                   |
+  | BBC News Articles          | 56.77%                 |
+  | New unlabeled MCTI         | 75.26%                 |
+## Intended uses
+You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
+be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
+fine-tuned versions of a task that interests you.
+Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
+to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
+generation you should look at model like XXX.
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
+>>> unmasker("Hello I'm a [MASK] model.")
+[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
+  'score': 0.1073106899857521,
+  'token': 4827,
+  'token_str': 'fashion'},
+ {'sequence': "[CLS] hello i'm a role model. [SEP]",
+  'score': 0.08774490654468536,
+  'token': 2535,
+  'token_str': 'role'},
+ {'sequence': "[CLS] hello i'm a new model. [SEP]",
+  'score': 0.05338378623127937,
+  'token': 2047,
+  'token_str': 'new'},
+ {'sequence': "[CLS] hello i'm a super model. [SEP]",
+  'score': 0.04667217284440994,
+  'token': 3565,
+  'token_str': 'super'},
+ {'sequence': "[CLS] hello i'm a fine model. [SEP]",
+  'score': 0.027095865458250046,
+  'token': 2986,
+  'token_str': 'fine'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertModel.from_pretrained("bert-base-uncased")
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = TFBertModel.from_pretrained("bert-base-uncased")
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+### Limitations and bias
+Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
+predictions:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
+>>> unmasker("The man worked as a [MASK].")
+[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
+  'score': 0.09747550636529922,
+  'token': 10533,
+  'token_str': 'carpenter'},
+ {'sequence': '[CLS] the man worked as a waiter. [SEP]',
+  'score': 0.0523831807076931,
+  'token': 15610,
+  'token_str': 'waiter'},
+ {'sequence': '[CLS] the man worked as a barber. [SEP]',
+  'score': 0.04962705448269844,
+  'token': 13362,
+  'token_str': 'barber'},
+ {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
+  'score': 0.03788609802722931,
+  'token': 15893,
+  'token_str': 'mechanic'},
+ {'sequence': '[CLS] the man worked as a salesman. [SEP]',
+  'score': 0.037680890411138535,
+  'token': 18968,
+  'token_str': 'salesman'}]
+>>> unmasker("The woman worked as a [MASK].")
+[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
+  'score': 0.21981462836265564,
+  'token': 6821,
+  'token_str': 'nurse'},
+ {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
+  'score': 0.1597415804862976,
+  'token': 13877,
+  'token_str': 'waitress'},
+ {'sequence': '[CLS] the woman worked as a maid. [SEP]',
+  'score': 0.1154729500412941,
+  'token': 10850,
+  'token_str': 'maid'},
+ {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
+  'score': 0.037968918681144714,
+  'token': 19215,
+  'token_str': 'prostitute'},
+ {'sequence': '[CLS] the woman worked as a cook. [SEP]',
+  'score': 0.03042375110089779,
+  'token': 5660,
+  'token_str': 'cook'}]
+```
+This bias will also affect all fine-tuned versions of this model.
+## Training data
+The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
+unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
+headers).
+## Training procedure
+### Preprocessing
+The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
+then of the form:
+```
+[CLS] Sentence A [SEP] Sentence B [SEP]
+```
+With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
+the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
+consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
+"sentences" has a combined length of less than 512 tokens.
+The details of the masking procedure for each sentence are the following:
+- 15% of the tokens are masked.
+- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
+- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+- In the 10% remaining cases, the masked tokens are left as is.
+### Pretraining
+The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
+of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
+used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
+learning rate warmup for 10,000 steps and linear decay of the learning rate after.
+## Evaluation results
+### Model training with Word2Vec embeddings
+Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
+We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
+data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
+obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
+architecture and 88% for the LSTM architecture.
+Table 1: Results from Pre-trained WE + ML models.
+| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN       |  0.8269   |  0.8545   |  0.8392   |  0.8712   |
+| DNN      |  0.7115   |  0.7794   |  0.7255   |  0.8485   |
+| CNN      |  0.8654   |  0.9083   |  0.8486   |  0.9773   |
+| LSTM     |  0.8846   |  0.9139   |  0.9056   |  0.9318   |
+### Transformer-based implementation
+Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
+of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
+the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
+input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
+without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
+For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
+would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
+length of 4096, only eight sentences will have their information shortened.
+To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
+of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
+models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
+computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
+This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
+Table 2: Results from Pre-trained Longformer + ML models.
+| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN       |  0.8269   |  0.8754   |0.7950     |  0.9773   |
+| DNN      |  0.8462   |  0.8776   |0.8474     |  0.9123   |
+| CNN      |  0.8462   |  0.8776   |0.8474     |  0.9123   |
+| LSTM     |  0.8269   |  0.8801   |0.8571     |  0.9091   |
+## Checkpoints
+- Examples
+- Implementation Notes
+- Usage Example
+- >>>
+- >>> ...
+## Config
+## Tokenizer
+## Training data
+## Training procedure
+## Preprocessing
+## Pretraining
+## Evaluation results
+## Benchmarks
+### BibTeX entry and citation info
+```bibtex
+@conference{webist22,
+author       ={Carlos Rocha. and Marcos Dib. and Li Weigang. and Andrea Nunes. and Allan Faria. and Daniel Cajueiro.
+               and Maísa {Kely de Melo}. and Victor Celestino.},
+title        ={Using Transfer Learning To Classify Long Unstructured Texts with Small Amounts of Labeled Data},
+booktitle    ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,},
+year         ={2022},
+pages        ={201-213},
+publisher    ={SciTePress},
+organization ={INSTICC},
+doi          ={10.5220/0011527700003318},
+isbn         ={978-989-758-613-2},
+issn         ={2184-3252},
+}
+```
+<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
+	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
+</a>