English
Summarization
5 papers
igorgavi commited on
Commit
ad39c52
1 Parent(s): fd22e34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -10
README.md CHANGED
@@ -78,8 +78,8 @@ its implementation and the article from which it originated.
78
 
79
  - The only language supported by the model is English
80
  - For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
81
-
82
- [PERGUNTAR ARTHUR]
83
 
84
  ### How to use
85
 
@@ -165,7 +165,47 @@ if __name__ == "__main__":
165
  ```
166
 
167
  ### Preprocessing
168
- [PERGUNTAR ARTHUR]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  ## Datasets
171
 
@@ -232,13 +272,6 @@ Table 5: XSum results
232
  ![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
233
  ![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
234
 
235
- ## Checkpoints
236
- - Examples
237
- - Implementation Notes
238
- - Usage Example
239
- - >>>
240
- - >>> ...
241
-
242
  ### BibTeX entry and citation info
243
 
244
  ```bibtex
 
78
 
79
  - The only language supported by the model is English
80
  - For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
81
+ - No specific training is done for the application of the model and only pretrained transformers are used (e.g. BART is trained with CNN Corpus and Pegasus with XSum)
82
+ - There is a difference between the quality of the results depending on the sort of text which is being summarized. BART, for example, having been trained with a dataset in the news domain, will be better at summarizing news in comparison to scientific articles.
83
 
84
  ### How to use
85
 
 
165
  ```
166
 
167
  ### Preprocessing
168
+
169
+ The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
170
+ elements which would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
171
+
172
+ ```python
173
+ class Data:
174
+ def _clean_text(self, content):
175
+ if isinstance(content, str):
176
+ pass
177
+ else:
178
+ content = str(content)
179
+ # strange jump lines
180
+ content = re.sub(r"\.", ". ", str(content))
181
+ # trouble characters
182
+ content = re.sub(r"\\r\\n", " ", str(content))
183
+ # clean jump lines
184
+ content = re.sub(r"\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]", " ", content)
185
+ # Replace different spaces
186
+ content = re.sub(r"\u00A0\u1680​\u180e\u2000-\u2009\u200a​\u200b​\u202f\u205f​\u3000", " ", content)
187
+ # replace multiple spaces
188
+ content = re.sub(r" +", " ", content)
189
+ # normalize hiphens
190
+ content = regex.sub(r"\p{Pd}+", "-", content)
191
+ # normalize single quotations
192
+ content = re.sub(r"[\u02BB\u02BC\u066C\u2018-\u201A\u275B\u275C]", "'", content)
193
+ # normalize double quotations
194
+ content = re.sub(r"[\u201C-\u201E\u2033\u275D\u275E\u301D\u301E]", '"', content)
195
+ # normalize apostrophes
196
+ content = re.sub(r"[\u0027\u02B9\u02BB\u02BC\u02BE\u02C8\u02EE\u0301\u0313\u0315\u055A\u05F3\u07F4\u07F5\u1FBF\u2018\u2019\u2032\uA78C\uFF07]", "'", content)
197
+ return content
198
+ ```
199
+
200
+ There is a preprocessing done specifically in the dataset provided by MCTI, which aims at specifically adjusting the csv file for it to
201
+ be properly read. Lines in which data is missing are removed, for example.
202
+
203
+ ```python
204
+ with alive_bar(len(texts), title="Processing data") as bar:
205
+ texts = texts.dropna()
206
+ texts[text_key] = texts[text_key].apply(lambda x: " ".join(x.split()))
207
+ texts[title_key] = texts[title_key].apply(lambda x: " ".join(x.split()))
208
+ ```
209
 
210
  ## Datasets
211
 
 
272
  ![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
273
  ![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
274
 
 
 
 
 
 
 
 
275
  ### BibTeX entry and citation info
276
 
277
  ```bibtex