Khmer + English Spell Checker

A bilingual spell checker supporting Khmer and English using SymSpell (dictionary-based edit-distance correction). Built with the Royal Academy of Cambodia Khmer Dictionary 2022.

Model Details

Item	Details
Languages	Khmer (km), English (en)
Approach	Dictionary-based (SymSpell)
Khmer dictionary	seanghay/khmer-dictionary-44k (44,700 words)
English dictionary	SymSpell built-in (82,765 words + bigrams)
Max edit distance	2
Mixed text support	Yes (auto-detects and splits segments)

Files in This Repository

File	Description
`symspell_english.pkl`	Serialized English SymSpell model
`symspell_khmer.pkl`	Serialized Khmer SymSpell model
`khmer_frequency_dict.txt`	Khmer word frequency list (plain text, 44k words)
`README.md`	This file

Installation

pip install symspellpy huggingface_hub

Load Models from HuggingFace

from huggingface_hub import hf_hub_download
import pickle

REPO_ID = 'phonsobon/khmer-english-spellchecker'

en_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_english.pkl')
km_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_khmer.pkl')

with open(en_path, 'rb') as f:
    sym_spell_en = pickle.load(f)

with open(km_path, 'rb') as f:
    sym_spell_km = pickle.load(f)

print(f'English words : {sym_spell_en.word_count:,}')
print(f'Khmer words   : {sym_spell_km.word_count:,}')

Usage

English Spell Check

from symspellpy import Verbosity

# Single word
suggestions = sym_spell_en.lookup('speling', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: spelling

# Full sentence
suggestions = sym_spell_en.lookup_compound('I havv a problm', max_edit_distance=2)
print(suggestions[0].term)
# Output: I have a problem

Khmer Spell Check

from symspellpy import Verbosity

# Single word
suggestions = sym_spell_km.lookup('ភាសា', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: ភាសា

# With distance info
for s in suggestions:
    print(f'  word: {s.term}  distance: {s.distance}  freq: {s.count}')

Quick Test

from symspellpy import Verbosity

def test_spell_checker(sym_spell_en, sym_spell_km):
    print('=== English Tests ===')
    english_tests = [
        ('speling',    'spelling'),
        ('beautifull', 'beautiful'),
        ('recieve',    'receive'),
        ('writting',   'writing'),
        ('tomorow',    'tomorrow'),
    ]
    en_pass = 0
    for wrong, expected in english_tests:
        result = sym_spell_en.lookup(wrong, Verbosity.CLOSEST, max_edit_distance=2)
        got = result[0].term if result else ''
        status = 'PASS' if got == expected else 'FAIL'
        en_pass += int(got == expected)
        print(f'  [{status}] "{wrong}" => "{got}" (expected: "{expected}")')

    print(f'\n=== Khmer Tests ===')
    khmer_tests = [
        ('ភាសា',    'ភាសា'),
        ('ការងារ',   'ការងារ'),
        ('ប្រទេស',  'ប្រទេស'),
        ('អរគុណ',   'អរគុណ'),
        ('ភ្នំពេញ',  'ភ្នំពេញ'),
    ]
    km_pass = 0
    for word, expected in khmer_tests:
        result = sym_spell_km.lookup(word, Verbosity.CLOSEST, max_edit_distance=2)
        got = result[0].term if result else ''
        status = 'PASS' if got == expected else 'FAIL'
        km_pass += int(got == expected)
        print(f'  [{status}] "{word}" => "{got}" (expected: "{expected}")')

    total = len(english_tests) + len(khmer_tests)
    passed = en_pass + km_pass
    print(f'\nResult: {passed}/{total} passed ({passed/total*100:.1f}%)')
    print(f'  English : {en_pass}/{len(english_tests)}')
    print(f'  Khmer   : {km_pass}/{len(khmer_tests)}')

test_spell_checker(sym_spell_en, sym_spell_km)

Language Detection

The checker uses Khmer Unicode range (U+1780 to U+17FF) to auto-detect language:

import re

KHMER_PATTERN = re.compile(r'[\u1780-\u17FF]')

def detect_language(text):
    khmer_chars = len(KHMER_PATTERN.findall(text))
    alpha_chars = len([c for c in text if c.isalpha()])
    if alpha_chars == 0:
        return 'unknown'
    ratio = khmer_chars / alpha_chars
    if ratio > 0.6:
        return 'km'
    elif ratio > 0.1:
        return 'mixed'
    else:
        return 'en'

print(detect_language('Hello world'))                     # en
print(detect_language('ខ្ញុំស្រឡាញ់ប្រទេសកម្ពុជា'))      # km
print(detect_language('I love ប្រទេសកម្ពុជា'))           # mixed

Khmer Dictionary Source

The Khmer dictionary was built from seanghay/khmer-dictionary-44k, extracted from the Royal Academy of Cambodia Khmer Dictionary 2022. This model is intended for research purposes only.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support