Update README.md
Browse files
README.md
CHANGED
@@ -78,8 +78,8 @@ its implementation and the article from which it originated.
|
|
78 |
|
79 |
- The only language supported by the model is English
|
80 |
- For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
|
81 |
-
|
82 |
-
|
83 |
|
84 |
### How to use
|
85 |
|
@@ -165,7 +165,47 @@ if __name__ == "__main__":
|
|
165 |
```
|
166 |
|
167 |
### Preprocessing
|
168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
169 |
|
170 |
## Datasets
|
171 |
|
@@ -232,13 +272,6 @@ Table 5: XSum results
|
|
232 |
![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
|
233 |
![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
|
234 |
|
235 |
-
## Checkpoints
|
236 |
-
- Examples
|
237 |
-
- Implementation Notes
|
238 |
-
- Usage Example
|
239 |
-
- >>>
|
240 |
-
- >>> ...
|
241 |
-
|
242 |
### BibTeX entry and citation info
|
243 |
|
244 |
```bibtex
|
|
|
78 |
|
79 |
- The only language supported by the model is English
|
80 |
- For texts summarized by transformers, the size of the original text is limited by the maximum number of tokens supported by the transformer
|
81 |
+
- No specific training is done for the application of the model and only pretrained transformers are used (e.g. BART is trained with CNN Corpus and Pegasus with XSum)
|
82 |
+
- There is a difference between the quality of the results depending on the sort of text which is being summarized. BART, for example, having been trained with a dataset in the news domain, will be better at summarizing news in comparison to scientific articles.
|
83 |
|
84 |
### How to use
|
85 |
|
|
|
165 |
```
|
166 |
|
167 |
### Preprocessing
|
168 |
+
|
169 |
+
The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
|
170 |
+
elements which would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
|
171 |
+
|
172 |
+
```python
|
173 |
+
class Data:
|
174 |
+
def _clean_text(self, content):
|
175 |
+
if isinstance(content, str):
|
176 |
+
pass
|
177 |
+
else:
|
178 |
+
content = str(content)
|
179 |
+
# strange jump lines
|
180 |
+
content = re.sub(r"\.", ". ", str(content))
|
181 |
+
# trouble characters
|
182 |
+
content = re.sub(r"\\r\\n", " ", str(content))
|
183 |
+
# clean jump lines
|
184 |
+
content = re.sub(r"\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]", " ", content)
|
185 |
+
# Replace different spaces
|
186 |
+
content = re.sub(r"\u00A0\u1680\u180e\u2000-\u2009\u200a\u200b\u202f\u205f\u3000", " ", content)
|
187 |
+
# replace multiple spaces
|
188 |
+
content = re.sub(r" +", " ", content)
|
189 |
+
# normalize hiphens
|
190 |
+
content = regex.sub(r"\p{Pd}+", "-", content)
|
191 |
+
# normalize single quotations
|
192 |
+
content = re.sub(r"[\u02BB\u02BC\u066C\u2018-\u201A\u275B\u275C]", "'", content)
|
193 |
+
# normalize double quotations
|
194 |
+
content = re.sub(r"[\u201C-\u201E\u2033\u275D\u275E\u301D\u301E]", '"', content)
|
195 |
+
# normalize apostrophes
|
196 |
+
content = re.sub(r"[\u0027\u02B9\u02BB\u02BC\u02BE\u02C8\u02EE\u0301\u0313\u0315\u055A\u05F3\u07F4\u07F5\u1FBF\u2018\u2019\u2032\uA78C\uFF07]", "'", content)
|
197 |
+
return content
|
198 |
+
```
|
199 |
+
|
200 |
+
There is a preprocessing done specifically in the dataset provided by MCTI, which aims at specifically adjusting the csv file for it to
|
201 |
+
be properly read. Lines in which data is missing are removed, for example.
|
202 |
+
|
203 |
+
```python
|
204 |
+
with alive_bar(len(texts), title="Processing data") as bar:
|
205 |
+
texts = texts.dropna()
|
206 |
+
texts[text_key] = texts[text_key].apply(lambda x: " ".join(x.split()))
|
207 |
+
texts[title_key] = texts[title_key].apply(lambda x: " ".join(x.split()))
|
208 |
+
```
|
209 |
|
210 |
## Datasets
|
211 |
|
|
|
272 |
![xsum1](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-1.png)
|
273 |
![xsum2](https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/tabela-xsum-2.png)
|
274 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
275 |
### BibTeX entry and citation info
|
276 |
|
277 |
```bibtex
|