Khmer + English Spell Checker
A bilingual spell checker supporting Khmer and English using SymSpell (dictionary-based edit-distance correction). Built with the Royal Academy of Cambodia Khmer Dictionary 2022.
Model Details
| Item | Details |
|---|---|
| Languages | Khmer (km), English (en) |
| Approach | Dictionary-based (SymSpell) |
| Khmer dictionary | seanghay/khmer-dictionary-44k (44,700 words) |
| English dictionary | SymSpell built-in (82,765 words + bigrams) |
| Max edit distance | 2 |
| Mixed text support | Yes (auto-detects and splits segments) |
Files in This Repository
| File | Description |
|---|---|
symspell_english.pkl |
Serialized English SymSpell model |
symspell_khmer.pkl |
Serialized Khmer SymSpell model |
khmer_frequency_dict.txt |
Khmer word frequency list (plain text, 44k words) |
README.md |
This file |
Installation
pip install symspellpy huggingface_hub
Load Models from HuggingFace
from huggingface_hub import hf_hub_download
import pickle
REPO_ID = 'phonsobon/khmer-english-spellchecker'
en_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_english.pkl')
km_path = hf_hub_download(repo_id=REPO_ID, filename='symspell_khmer.pkl')
with open(en_path, 'rb') as f:
sym_spell_en = pickle.load(f)
with open(km_path, 'rb') as f:
sym_spell_km = pickle.load(f)
print(f'English words : {sym_spell_en.word_count:,}')
print(f'Khmer words : {sym_spell_km.word_count:,}')
Usage
English Spell Check
from symspellpy import Verbosity
# Single word
suggestions = sym_spell_en.lookup('speling', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: spelling
# Full sentence
suggestions = sym_spell_en.lookup_compound('I havv a problm', max_edit_distance=2)
print(suggestions[0].term)
# Output: I have a problem
Khmer Spell Check
from symspellpy import Verbosity
# Single word
suggestions = sym_spell_km.lookup('ααΆααΆ', Verbosity.CLOSEST, max_edit_distance=2)
print(suggestions[0].term)
# Output: ααΆααΆ
# With distance info
for s in suggestions:
print(f' word: {s.term} distance: {s.distance} freq: {s.count}')
Quick Test
from symspellpy import Verbosity
def test_spell_checker(sym_spell_en, sym_spell_km):
print('=== English Tests ===')
english_tests = [
('speling', 'spelling'),
('beautifull', 'beautiful'),
('recieve', 'receive'),
('writting', 'writing'),
('tomorow', 'tomorrow'),
]
en_pass = 0
for wrong, expected in english_tests:
result = sym_spell_en.lookup(wrong, Verbosity.CLOSEST, max_edit_distance=2)
got = result[0].term if result else ''
status = 'PASS' if got == expected else 'FAIL'
en_pass += int(got == expected)
print(f' [{status}] "{wrong}" => "{got}" (expected: "{expected}")')
print(f'\n=== Khmer Tests ===')
khmer_tests = [
('ααΆααΆ', 'ααΆααΆ'),
('ααΆαααΆα', 'ααΆαααΆα'),
('αααααα', 'αααααα'),
('α’ααα»α', 'α’ααα»α'),
('ααααααα', 'ααααααα'),
]
km_pass = 0
for word, expected in khmer_tests:
result = sym_spell_km.lookup(word, Verbosity.CLOSEST, max_edit_distance=2)
got = result[0].term if result else ''
status = 'PASS' if got == expected else 'FAIL'
km_pass += int(got == expected)
print(f' [{status}] "{word}" => "{got}" (expected: "{expected}")')
total = len(english_tests) + len(khmer_tests)
passed = en_pass + km_pass
print(f'\nResult: {passed}/{total} passed ({passed/total*100:.1f}%)')
print(f' English : {en_pass}/{len(english_tests)}')
print(f' Khmer : {km_pass}/{len(khmer_tests)}')
test_spell_checker(sym_spell_en, sym_spell_km)
Language Detection
The checker uses Khmer Unicode range (U+1780 to U+17FF) to auto-detect language:
import re
KHMER_PATTERN = re.compile(r'[\u1780-\u17FF]')
def detect_language(text):
khmer_chars = len(KHMER_PATTERN.findall(text))
alpha_chars = len([c for c in text if c.isalpha()])
if alpha_chars == 0:
return 'unknown'
ratio = khmer_chars / alpha_chars
if ratio > 0.6:
return 'km'
elif ratio > 0.1:
return 'mixed'
else:
return 'en'
print(detect_language('Hello world')) # en
print(detect_language('αααα»ααααα‘αΆααααααααααααα»ααΆ')) # km
print(detect_language('I love ααααααααααα»ααΆ')) # mixed
Khmer Dictionary Source
The Khmer dictionary was built from seanghay/khmer-dictionary-44k, extracted from the Royal Academy of Cambodia Khmer Dictionary 2022. This model is intended for research purposes only.
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support