unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

6 datasets

English

Summarization

5 papers

Model card Files Files and versions Community

igorgavi commited on Dec 14, 2022

Commit

3d341cc

•

1 Parent(s): cd926e4

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -85

README.md CHANGED Viewed

@@ -20,25 +20,18 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
 Disclaimer:
-## According to the abstract,
-Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
-require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
-and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
-different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
-with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
-provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
-the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
-of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
-available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
-Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
-Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
-asuccessful case of artificial intelligence in a federal government application.
-This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
-supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
-[this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
-and English.
 ## Model description
@@ -65,19 +58,19 @@ methods used for text summarization will be described indvidually in the followi
 ## Methods
-| Method                 | Kind of ATS | Description | Documentation |
-|:----------------------:|:-----------:|:-----------:|:-------------:|
-| SumyRandom             | Extractive  |             |   []()        |
-| Sumy Luhn              | Extractive  |             |   []()        |
-| SumyLsa                | Extractive  |             |   []()        |
-| SumyLexRank            | Extractive  |             |   []()        |
-| SumyTextRank           | Extractive  |             |   []()        |
-| SumySumBasic           | Extractive  |             |   []()        |
-| SumyKL                 | Extractive  |             |   []()        |
-| SumyReduction          | Extractive  |             |   []()        |
-| BART-Large CNN         | Abstractive |             | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)                    |
-| Pegasus-XSUM           | Abstractive |             | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum)                            |
-| mT5 Multilingual XLSUM | Abstractive |             | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)|
 ## Model variations
@@ -179,60 +172,7 @@ output = model(encoded_input)
 ### Limitations and bias
-Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
-predictions:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("The man worked as a [MASK].")
-[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
-  'score': 0.09747550636529922,
-  'token': 10533,
-  'token_str': 'carpenter'},
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
-  'score': 0.0523831807076931,
-  'token': 15610,
-  'token_str': 'waiter'},
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
-  'score': 0.04962705448269844,
-  'token': 13362,
-  'token_str': 'barber'},
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
-  'score': 0.03788609802722931,
-  'token': 15893,
-  'token_str': 'mechanic'},
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
-  'score': 0.037680890411138535,
-  'token': 18968,
-  'token_str': 'salesman'}]
->>> unmasker("The woman worked as a [MASK].")
-[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
-  'score': 0.21981462836265564,
-  'token': 6821,
-  'token_str': 'nurse'},
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
-  'score': 0.1597415804862976,
-  'token': 13877,
-  'token_str': 'waitress'},
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
-  'score': 0.1154729500412941,
-  'token': 10850,
-  'token_str': 'maid'},
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
-  'score': 0.037968918681144714,
-  'token': 19215,
-  'token_str': 'prostitute'},
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
-  'score': 0.03042375110089779,
-  'token': 5660,
-  'token_str': 'cook'}]
-```
-This bias will also affect all fine-tuned versions of this model.
 ## Training data

 Disclaimer:
+## According to the abstract of the literature review,
+We provide a literature review about Automatic Text Summarization systems. We consider a citation-based approach. We start with some popular and well-known
+papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we
+knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we
+present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive
+review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries.  Finally, we present an empirical
+exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
+This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
+summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
+It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
 ## Model description
 ## Methods
+| Method                 | Kind of ATS | Documentation   | Source Article |
+|:----------------------:|:-----------:|:---------------:|:--------------:|
+| SumyRandom             | Extractive  |   [Sumy GitHub](https://github.com/miso-belica/sumy/) | None (picks out random sentences from source text) |
+| SumyLuhn               | Extractive  |   Ibid.         | (Luhn, 1958) |
+| SumyLsa                | Extractive  |   Ibid.         | [(Steinberger et al., 2004)](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) |
+| SumyLexRank            | Extractive  |   Ibid.         | (Erkan and Radev, 2004) |
+| SumyTextRank           | Extractive  |   Ibid.         | (Mihalcea and Tarau, 2004) |
+| SumySumBasic           | Extractive  |   Ibid.         | None (often used as a baseline model in the literature) |
+| SumyKL                 | Extractive  |   Ibid.         | (Haghighi and Vanderwende, 2009) |
+| SumyReduction          | Extractive  |   Ibid.         | None. |
+| BART-Large CNN         | Abstractive |             | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)                    | (Lewis et al., 2019) |
+| Pegasus-XSUM           | Abstractive |             | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum)                            | (Zhang et al., 2020) |
+| mT5 Multilingual XLSUM | Abstractive |             | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| (Raffel et al., 2019) |
 ## Model variations
 ### Limitations and bias
 ## Training data