license: mit
Speliuk
A more accurate spelling correction for the Ukrainian language.
Motivation
When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:
- How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
- How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.
To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:
- KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
- Transfomer-based NER pipeline to detect misspelled words.
- SymSpell. As of now, this is the only supported spell checker.
Installation
- For CPU-only inference, install the CPU version of PyTorch.
- Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:
sudo apt-get install python-dev
- Install Speliuk:
pip install speliuk
Usage
By default, Speliuk will use pre-trained models stored on Hugging Face.
>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])
Speliuk can also be used directly from a spaCy model:
>>> import spacy
>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]
Training Details
Spelling Error Detection
To detect spelling errors, a spaCy NER model is used.
It was trained on a combination of synthetic and golden data:
- For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
- For golden data, we used spelling errors from the UA-GEC corpus.
Perplexity Calculation
We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.
Spell Checker
We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.