Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,7 @@ I undertook the following preprocessing steps:
|
|
16 |
- Language identification with [langid](https://github.com/saffsd/langid.py)
|
17 |
- Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
|
18 |
- Retain only sentences with a Latin vocabulary ratio of > 85%.
|
19 |
-
-
|
20 |
-
- deduplication of the corpus
|
21 |
|
22 |
The result is a corpus of ~390 million tokens.
|
23 |
|
|
|
16 |
- Language identification with [langid](https://github.com/saffsd/langid.py)
|
17 |
- Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
|
18 |
- Retain only sentences with a Latin vocabulary ratio of > 85%.
|
19 |
+
- Exclude all lines containing '^' --> hints at the presence of OCR errors.
|
|
|
20 |
|
21 |
The result is a corpus of ~390 million tokens.
|
22 |
|