unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

English

Summarization

5 papers

Model card Files Files and versions Community

igorgavi commited on Dec 20, 2022

Commit

ad39c52

•

1 Parent(s): fd22e34

Update README.md

Browse files

Files changed (1) hide show

README.md +43 -10

README.md CHANGED Viewed

@@ -78,8 +78,8 @@ its implementation and the article from which it originated.
 - The only language supported by the model is English
 - For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
-[PERGUNTAR ARTHUR]
 ### How to use
@@ -165,7 +165,47 @@ if __name__ == "__main__":
 ```
 ### Preprocessing
-[PERGUNTAR ARTHUR]
 ## Datasets
@@ -232,13 +272,6 @@ Table 5: XSum results
 ![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
 ![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
-## Checkpoints
-- Examples
-- Implementation Notes
-- Usage Example
-- >>>
-- >>> ...
 ### BibTeX entry and citation info
 ```bibtex

 - The only language supported by the model is English
 - For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
+- No specific training is done for the application of the model and only pretrained transformers are used (e.g. BART is trained with CNN Corpus and Pegasus with XSum)
+- There is a difference between the quality of the results depending on the sort of text which is being summarized. BART, for example, having been trained with a dataset in the news domain, will be better at summarizing news in comparison to scientific articles.
 ### How to use
 ```
 ### Preprocessing
+The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
+elements which would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
+```python
+class Data:
+    def _clean_text(self, content):
+        if isinstance(content, str):
+            pass
+        else:
+            content = str(content)
+        # strange jump lines
+        content = re.sub(r"\.", ". ", str(content))
+        # trouble characters
+        content = re.sub(r"\\r\\n", " ", str(content))
+        # clean jump lines
+        content = re.sub(r"\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]", " ", content)
+        # Replace different spaces
+        content = re.sub(r"\u00A0\u1680\u180e\u2000-\u2009\u200a\u200b\u202f\u205f\u3000", " ", content)
+        # replace multiple spaces
+        content = re.sub(r" +", " ", content)
+        # normalize hiphens
+        content = regex.sub(r"\p{Pd}+", "-", content)
+        # normalize single quotations
+        content = re.sub(r"[\u02BB\u02BC\u066C\u2018-\u201A\u275B\u275C]", "'", content)
+        # normalize double quotations
+        content = re.sub(r"[\u201C-\u201E\u2033\u275D\u275E\u301D\u301E]", '"', content)
+        # normalize apostrophes
+        content = re.sub(r"[\u0027\u02B9\u02BB\u02BC\u02BE\u02C8\u02EE\u0301\u0313\u0315\u055A\u05F3\u07F4\u07F5\u1FBF\u2018\u2019\u2032\uA78C\uFF07]", "'", content)
+        return content
+```
+There is a preprocessing done specifically in the dataset provided by MCTI, which aims at specifically adjusting the csv file for it to
+be properly read. Lines in which data is missing are removed, for example.
+```python
+with alive_bar(len(texts), title="Processing data") as bar:
+  texts = texts.dropna()
+  texts[text_key] = texts[text_key].apply(lambda x: " ".join(x.split()))
+  texts[title_key] = texts[title_key].apply(lambda x: " ".join(x.split()))
+```
 ## Datasets
 ![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
 ![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
 ### BibTeX entry and citation info
 ```bibtex