|
--- |
|
license: mit |
|
datasets: |
|
- mlengineer-ai/jomleh |
|
language: |
|
- fa |
|
metrics: |
|
- perplexity |
|
tags: |
|
- kneser-ney |
|
- n-gram |
|
- kenlm |
|
--- |
|
|
|
# KenLM models for Farsi |
|
|
|
This repository contains KenLM models trained on the Jomleh dataset for the Farsi (Persian) |
|
language. Among the various use cases for KenLM language models, the models provided here are |
|
particularly useful for automatic speech recognition (ASR) tasks. They can be used in conjunction |
|
with CTC to select the most likely sequence of tokens extracted from a spectrogram. |
|
|
|
The models in this repository are KenLM arpa files that have been converted to binary format. |
|
KenLM supports two binary formats: probing and trie. The models provided here are in the probing |
|
format, which KenLM claims are faster but have a larger memory footprint. |
|
|
|
There are a total of 36 different KenLM models available in this repository. Unless you are |
|
conducting research, you will not need all of them. In that case, it is recommended that you |
|
download only the models you require rather than the entire repository since the total file size |
|
is over half a terabyte. |
|
|
|
# Sample code how to use the models |
|
|
|
Unfortunately, I could not find an easy way to integrate the Python code that loads the models |
|
using Huggingface library. These are the steps that you have to take when you want to use any of |
|
the models provided here: |
|
|
|
1. Install KenLM package: |
|
|
|
``` |
|
pip install https://github.com/kpu/kenlm/archive/master.zip |
|
``` |
|
|
|
2. Install the SentencePiece for the tokenization: |
|
``` |
|
pip install sentencepiece |
|
``` |
|
|
|
3. Download the model that you are interested in from this repository along the Python code |
|
`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file |
|
structure in the repository). Don't forget to download the SentencePiece files as well. For |
|
instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum |
|
pruning, these are the files you'll need: |
|
``` |
|
model.py |
|
files/jomleh-sp-32000.model |
|
files/jomleh-sp-32000.vocab |
|
files/jomleh-sp-32000-o5-prune01111.probing |
|
``` |
|
|
|
4. In your script, instantiate a model and use it like this: |
|
|
|
``` |
|
from model import KenlmModel |
|
|
|
|
|
# Load the model |
|
model = KenlmModel.from_pretrained("57218", "3", "011") |
|
|
|
# Get perplexity |
|
print(model.perplexity("من در را بستم")) |
|
# Outputs: 72.5 |
|
|
|
# Get score |
|
print(model.score("من در را بستم")) |
|
# Outputs: -11.160577774047852 |
|
``` |
|
|
|
# What are the different files you can find in this repository? |
|
|
|
The files you can find in this repository are either SentencePiece tokenizer models or KenLM |
|
binary models. For the tokenizers, this is the template their file name follows: |
|
|
|
``` |
|
<dataset-name>-<tokenizer-type>-<vocabulary-size>.<model|vocab> |
|
``` |
|
|
|
In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only |
|
tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed |
|
of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Due to hardware limitations that I've |
|
faced, I could only use 4 out of 60 jomleh's text file to train the tokenizer, namely: 10, 11, |
|
12, and 13. Also, 57218 was the largest number that SentencePiece allowed to set as for the |
|
vocabulary size. |
|
|
|
Here's an example of the tokenizer files you can find in this repository: |
|
|
|
``` |
|
jomleh-sp-32000.model |
|
``` |
|
|
|
Moving on to the KenLM binary models, their file names follow this template: |
|
|
|
``` |
|
<dataset-name>-<tokenizer-type>-<vocabulary-size>-o<n-gram>-prune<pruning>.<model|vocab> |
|
``` |
|
|
|
Just like with the tokenizers, the only available options for dataset and tokenizer type are |
|
`jomleh` and `sp`. The same applies to vocabulary sizes. There are two n-grams trained, |
|
3-grams, and 5-grams. Additionally, there are three different pruning options available for |
|
each configuration. To interpret the pruning numbers, add a space between each pair of digits. |
|
For example, `011` means `0 1 1` was set during training of the KenLM model. |
|
|
|
Here is a complete example: To train the binary model named `jomleh-sp-32000-o5-prune01111.probing`, |
|
the tokenizer `jomleh-sp-32000.model` was used to encode (tokenize) the 95% Jomleh dataset, |
|
resulting in a large text file holding space-separated tokens. Then, the file was fed into |
|
the `lmplz` program with the following input arguments: |
|
|
|
``` |
|
lmplz -o 5 -T /tmp --vocab_estimate 32000 -S 80% --discount_fallback --prune "0 1 1 1 1" < enocoded.txt > jomleh-sp-32000-o5-prune01111.arpa |
|
``` |
|
|
|
This command will produce the raw arpa file, which can then be converted into binary format |
|
using the `build_binary` program, as shown below: |
|
|
|
``` |
|
build_binary -T /tmp -S 80% probing jomleh-sp-32000-o5-prune01111.arpa jomleh-sp-32000-o5-prune01111.probing |
|
``` |
|
|
|
# Which model to use? |
|
|
|
Based on my personal evaluation, I recommend using the `jomleh-sp-57218-o3-prune011.probing`. |
|
It's the perfect balanced between file size (6GB) and accuracy (80%). But if you have no concern for file |
|
size, then go for the largest model, `jomleh-sp-57218-o5-prune00011.probing` (size: 36GB, accuracy: 82%). |
|
|