Khmer + English Spell Checker

A bilingual spell checker supporting Khmer and English using SymSpell (dictionary-based edit-distance correction). Built with the Royal Academy of Cambodia Khmer Dictionary 2022.


Model Details

Item Details
Languages Khmer (km), English (en)
Approach Dictionary-based (SymSpell)
Khmer dictionary seanghay/khmer-dictionary-44k (44,700 words)
English dictionary SymSpell built-in (82,765 words + bigrams)
Max edit distance 2
Mixed text support Yes (auto-detects and splits segments)

Files in This Repository

File Description
symspell_english.pkl Serialized English SymSpell model
symspell_khmer.pkl Serialized Khmer SymSpell model
khmer_frequency_dict.txt Khmer word frequency list (plain text, 44k words)
README.md This file

Installation

pip install symspellpy huggingface_hub

Load Models from HuggingFace

from huggingface_hub import hf_hub_download
import pickle

REPO_ID = 'phonsobon/khmer-english-spellchecker'

en_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_english.pkl')
km_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_khmer.pkl')

with open(en_path, 'rb') as f:
    sym_spell_en = pickle.load(f)

with open(km_path, 'rb') as f:
    sym_spell_km = pickle.load(f)

print(f'English words : {sym_spell_en.word_count:,}')
print(f'Khmer words   : {sym_spell_km.word_count:,}')

Usage

English Spell Check

from symspellpy import Verbosity

# Single word
suggestions = sym_spell_en.lookup('speling', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: spelling

# Full sentence
suggestions = sym_spell_en.lookup_compound('I havv a problm', max_edit_distance=2)
print(suggestions[0].term)
# Output: I have a problem

Khmer Spell Check

from symspellpy import Verbosity

# Single word
suggestions = sym_spell_km.lookup('αž—αžΆαžŸαžΆ', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: αž—αžΆαžŸαžΆ

# With distance info
for s in suggestions:
    print(f'  word: {s.term}  distance: {s.distance}  freq: {s.count}')

Quick Test

from symspellpy import Verbosity

def test_spell_checker(sym_spell_en, sym_spell_km):
    print('=== English Tests ===')
    english_tests = [
        ('speling',    'spelling'),
        ('beautifull', 'beautiful'),
        ('recieve',    'receive'),
        ('writting',   'writing'),
        ('tomorow',    'tomorrow'),
    ]
    en_pass = 0
    for wrong, expected in english_tests:
        result = sym_spell_en.lookup(wrong, Verbosity.CLOSEST, max_edit_distance=2)
        got = result[0].term if result else ''
        status = 'PASS' if got == expected else 'FAIL'
        en_pass += int(got == expected)
        print(f'  [{status}] "{wrong}" => "{got}" (expected: "{expected}")')

    print(f'\n=== Khmer Tests ===')
    khmer_tests = [
        ('αž—αžΆαžŸαžΆ',    'αž—αžΆαžŸαžΆ'),
        ('αž€αžΆαžšαž„αžΆαžš',   'αž€αžΆαžšαž„αžΆαžš'),
        ('αž”αŸ’αžšαž‘αŸαžŸ',  'αž”αŸ’αžšαž‘αŸαžŸ'),
        ('αž’αžšαž‚αž»αžŽ',   'αž’αžšαž‚αž»αžŽ'),
        ('αž—αŸ’αž“αŸ†αž–αŸαž‰',  'αž—αŸ’αž“αŸ†αž–αŸαž‰'),
    ]
    km_pass = 0
    for word, expected in khmer_tests:
        result = sym_spell_km.lookup(word, Verbosity.CLOSEST, max_edit_distance=2)
        got = result[0].term if result else ''
        status = 'PASS' if got == expected else 'FAIL'
        km_pass += int(got == expected)
        print(f'  [{status}] "{word}" => "{got}" (expected: "{expected}")')

    total = len(english_tests) + len(khmer_tests)
    passed = en_pass + km_pass
    print(f'\nResult: {passed}/{total} passed ({passed/total*100:.1f}%)')
    print(f'  English : {en_pass}/{len(english_tests)}')
    print(f'  Khmer   : {km_pass}/{len(khmer_tests)}')

test_spell_checker(sym_spell_en, sym_spell_km)

Language Detection

The checker uses Khmer Unicode range (U+1780 to U+17FF) to auto-detect language:

import re

KHMER_PATTERN = re.compile(r'[\u1780-\u17FF]')

def detect_language(text):
    khmer_chars = len(KHMER_PATTERN.findall(text))
    alpha_chars = len([c for c in text if c.isalpha()])
    if alpha_chars == 0:
        return 'unknown'
    ratio = khmer_chars / alpha_chars
    if ratio > 0.6:
        return 'km'
    elif ratio > 0.1:
        return 'mixed'
    else:
        return 'en'

print(detect_language('Hello world'))                     # en
print(detect_language('αžαŸ’αž‰αž»αŸ†αžŸαŸ’αžšαž‘αžΆαž‰αŸ‹αž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆ'))      # km
print(detect_language('I love αž”αŸ’αžšαž‘αŸαžŸαž€αž˜αŸ’αž–αž»αž‡αžΆ'))           # mixed

Khmer Dictionary Source

The Khmer dictionary was built from seanghay/khmer-dictionary-44k, extracted from the Royal Academy of Cambodia Khmer Dictionary 2022. This model is intended for research purposes only.


License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support