KenLM (arpa) models for Dutch based on SONAR

This repository contains KenLM models (n=5) for Dutch, based on the SONAR corpus - sentence-segmented (one sentence per line). Models are provided on tokens, part-of-speech, dependency labels, and lemmas, as processed with spaCy nl_core_news_sm:

kenlm_sonar_token.arpa[.bin]: token
kenlm_sonar_pos.arpa[.bin]: part-of-speech tag
kenlm_sonar_dep.arpa[.bin]: dependency label
kenlm_sonar_lemma.arpa[.bin]: lemma

More noisy SONAR components (WRPEA, WRPED, WRUEA, WRUED, WRUEB) were excluded.

Both regular .arpa files as well as more efficient KenLM binary files (.arpa.bin) are provided. You probably want to use the binary versions.

Usage from within Python

Make sure to install dependencies:

pip install huggingface_hub
pip install https://github.com/kpu/kenlm/archive/master.zip

# If you want to use spaCy preprocessing
pip install spacy
python -m spacy download nl_core_news_sm

We can then use the Hugging Face hub software to download and cache the model file that we want, and directly use it with KenLM.

import kenlm
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_sonar", filename="kenlm_sonar_token.arpa.bin")
model = kenlm.Model(model_file)

text = "Ik eet graag koekjes !"  # pre-tokenized
model.perplexity(text)
# 148.21996373689134

It is recommended to use spaCy as a preprocessor to automatically use the same tagsets and tokenization as were used when creating the LMs.

import kenlm
import spacy
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_sonar", filename="kenlm_sonar_pos.arpa.bin")  # pos file
model = kenlm.Model(model_file)

nlp = spacy.load("nl_core_news_sm")

text = "Ik eet graag koekjes!" 
pos_sequence = " ".join([token.pos_ for token in nlp(text)])
# 'PRON VERB ADV NOUN PUNCT'
model.perplexity(pos_sequence)
# 6.916279238079976

Reproduction

bin/lmplz -o 5 -S 75% -T ../data/tmp/ < ../data/processed_sonar_token_dedup.txt > ../data/kenlm_sonar_token.arpa

For class-based LMs (POS and DEP), the --discount_fallback was used and the parsed data was not deduplicated (but it was deduplicated on the sentence-level for token and lemma models).