Speliuk / README.md

BonySmoke

Update README.md

bba23cc verified about 1 year ago

preview code

raw

history blame contribute delete

3.52 kB

metadata

license: mit

Speliuk

A more accurate spelling correction for the Ukrainian language.

Motivation

When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:

How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.

To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.
We improve the accuracy of a spell checker by using these complimentary models:

KenLM. The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
Transfomer-based NER pipeline to detect misspelled words.
SymSpell. As of now, this is the only supported spell checker.

Installation

For CPU-only inference, install the CPU version of PyTorch.
Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:

sudo apt-get install python-dev

Install Speliuk:

pip install speliuk

Usage

By default, Speliuk will use pre-trained models stored on Hugging Face.

>>> from speliuk.correct import Speliuk
>>> speliuk = Speliuk()
>>> speliuk.load()
>>> speliuk.correct("то він моее це зраабити для меніе?")
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])

Speliuk can also be used directly from a spaCy model:

>>> import spacy
>>> from speliuk.correct import CorrectionPipe
>>> nlp = spacy.blank('uk')
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
>>> doc = nlp("то він моее це зраабити для меніе?")
>>> doc._.speliuk_corrected
'то він може це зробити для мене?'
>>> doc.spans["speliuk_errors"]
[моее, зраабити, меніе]

Training Details

Spelling Error Detection

To detect spelling errors, a spaCy NER model is used.

It was trained on a combination of synthetic and golden data:

For synthetic data generation, we used UberText as base texts and nlpaug for errors generation. In total, 10k samples from different categories were used.
For golden data, we used spelling errors from the UA-GEC corpus.

Perplexity Calculation

We used KenLM for quick perplexity calculation. We used an existing model Yehor/kenlm-uk trained on UberText.

Spell Checker

We used SymSpell for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.