pstroe
/

roberta-base-latin-cased2

Inference Endpoints

Model card Files Files and versions Community

pstroe commited on Jul 29, 2022

Commit

503d71b

•

1 Parent(s): 7729aa8

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -16,8 +16,7 @@ I undertook the following preprocessing steps:
   - Language identification with [langid](https://github.com/saffsd/langid.py)
   - Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
   - Retain only sentences with a Latin vocabulary ratio of > 85%.
-  - Retaining only lines containing letters of the Latin alphabet, numerals, and certain punctuation (--> `grep -P '^[A-z0-9ÄÖÜäöüÆæŒœᵫĀāūōŌ.,;:?!\- Ęę]+$' la.nolorem.tok.txt`
-  - deduplication of the corpus
 The result is a corpus of ~390 million tokens.

   - Language identification with [langid](https://github.com/saffsd/langid.py)
   - Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
   - Retain only sentences with a Latin vocabulary ratio of > 85%.
+  - Exclude all lines containing '^' --> hints at the presence of OCR errors.
 The result is a corpus of ~390 million tokens.