unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

6 datasets

English

Summarization

5 papers

Model card Files Files and versions Community

igorgavi commited on Dec 14, 2022

Commit

cf6b726

•

1 Parent(s): 8498227

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -116

README.md CHANGED Viewed

@@ -77,41 +77,7 @@ its implementation and the article from which it originated.
 | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
-## Model variations
-With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
-strategy under the assumption that small data available for training was insufficient for adequate embedding training.
-In this context, we considered two approaches:
-   i) pre-training wordembeddings using similar datasets for text classification;
-   ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
-XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
-also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
-Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
-two models.
-Other 24 smaller models are released afterward.
-The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
-| Model                        | #params | Language |
-|------------------------------|--------------------|-------|
-| [`mcti-base-uncased`]        | 110M    | English  |
-| [`mcti-large-uncased`]       | 340M    | English  | sub
-| [`mcti-base-cased`]          | 110M    | English  |
-| [`mcti-large-cased`]         | 110M    | Chinese  |
-| [`-base-multilingual-cased`] | 110M    | Multiple |
-  | Dataset                    | Compatibility to base* |
-  |----------------------------|------------------------|
-  | Labeled MCTI               | 100%                   |
-  | Full MCTI                  | 100%                   |
-  | BBC News Articles          | 56.77%                 |
-  | New unlabeled MCTI         | 75.26%                 |
-## Intended uses
 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
 be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
@@ -174,10 +140,6 @@ encoded_input = tokenizer(text, return_tensors='tf')
 output = model(encoded_input)
 ```
-### Limitations and bias
 ## Training data
 The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
@@ -189,69 +151,8 @@ headers).
 ### Preprocessing
-The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
-then of the form:
-```
-[CLS] Sentence A [SEP] Sentence B [SEP]
-```
-With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
-the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
-consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
-"sentences" has a combined length of less than 512 tokens.
-The details of the masking procedure for each sentence are the following:
-- 15% of the tokens are masked.
-- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
-- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
-- In the 10% remaining cases, the masked tokens are left as is.
-### Pretraining
-The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
-of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
-used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
-learning rate warmup for 10,000 steps and linear decay of the learning rate after.
 ## Evaluation results
-### Model training with Word2Vec embeddings
-Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
-We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
-data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
-obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
-architecture and 88% for the LSTM architecture.
-Table 1: Results from Pre-trained WE + ML models.
-| ML Model |  Accuracy | F1 Score  | Precision |   Recall  |
-|:--------:|:---------:|:---------:|:---------:|:---------:|
-| NN       |  0.8269   |  0.8545   |  0.8392   |  0.8712   |
-| DNN      |  0.7115   |  0.7794   |  0.7255   |  0.8485   |
-| CNN      |  0.8654   |  0.9083   |  0.8486   |  0.9773   |
-| LSTM     |  0.8846   |  0.9139   |  0.9056   |  0.9318   |
-### Transformer-based implementation
-Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
-of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
-the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
-input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
-without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
-For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
-would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
-length of 4096, only eight sentences will have their information shortened.
-To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
-of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
-models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
-computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
-This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
 Table 2: Results from Pre-trained Longformer + ML models.
@@ -270,22 +171,6 @@ Table 2: Results from Pre-trained Longformer + ML models.
 - >>>
 - >>> ...
-## Config
-## Tokenizer
-## Training data
-## Training procedure
-## Preprocessing
-## Pretraining
-## Evaluation results
-## Benchmarks
 ### BibTeX entry and citation info
 ```bibtex

 | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
+## Intended uses & limitations
 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
 be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
 output = model(encoded_input)
 ```
 ## Training data
 The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
 ### Preprocessing
 ## Evaluation results
 Table 2: Results from Pre-trained Longformer + ML models.
 - >>>
 - >>> ...
 ### BibTeX entry and citation info
 ```bibtex