kenlm-sp-jomleh / README.md

Update README.md

dfd4faa over 1 year ago

5.02 kB

	---
	license: mit
	datasets:
	- mlengineer-ai/jomleh
	language:
	- fa
	metrics:
	- perplexity
	tags:
	- kneser-ney
	- n-gram
	- kenlm
	---

	# KenLM models for Farsi

	This repository contains KenLM models trained on the Jomleh dataset for the Farsi (Persian)
	language. Among the various use cases for KenLM language models, the models provided here are
	particularly useful for automatic speech recognition (ASR) tasks. They can be used in conjunction
	with CTC to select the most likely sequence of tokens extracted from a spectrogram.

	The models in this repository are KenLM arpa files that have been converted to binary format.
	KenLM supports two binary formats: probing and trie. The models provided here are in the probing
	format, which KenLM claims are faster but have a larger memory footprint.

	There are a total of 36 different KenLM models available in this repository. Unless you are
	conducting research, you will not need all of them. In that case, it is recommended that you
	download only the models you require rather than the entire repository since the total file size
	is over half a terabyte.

	# Sample code how to use the models

	Unfortunately, I could not find an easy way to integrate the Python code that loads the models
	using Huggingface library. These are the steps that you have to take when you want to use any of
	the models provided here:

	1. Install KenLM package:

	```
	pip install https://github.com/kpu/kenlm/archive/master.zip
	```

	2. Install the SentencePiece for the tokenization:
	```
	pip install sentencepiece
	```

	3. Download the model that you are interested in from this repository along the Python code
	`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
	structure in the repository). Don't forget to download the SentencePiece files as well. For
	instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
	pruning, these are the files you'll need:
	```
	model.py
	files/jomleh-sp-32000.model
	files/jomleh-sp-32000.vocab
	files/jomleh-sp-32000-o5-prune01111.probing
	```

	4. In your script, instantiate a model and use it like this:

	```
	from model import KenlmModel


	# Load the model
	model = KenlmModel.from_pretrained("57218", "3", "011")

	# Get perplexity
	print(model.perplexity("من در را بستم"))
	# Outputs: 72.5

	# Get score
	print(model.score("من در را بستم"))
	# Outputs: -11.160577774047852
	```

	# What are the different files you can find in this repository?

	The files you can find in this repository are either SentencePiece tokenizer models or KenLM
	binary models. For the tokenizers, this is the template their file name follows:

	```
	<dataset-name>-<tokenizer-type>-<vocabulary-size>.<model\|vocab>
	```

	In this repository, all the models are based on the Jomleh dataset (`jomleh`). And the only
	tokenizer used is SentencePiece (`sp`). Finally, the list of vocabulary sizes used is composed
	of 2000, 4000, 8000, 16000, 32000, and 57218 tokens. Due to hardware limitations that I've
	faced, I could only use 4 out of 60 jomleh's text file to train the tokenizer, namely: 10, 11,
	12, and 13. Also, 57218 was the largest number that SentencePiece allowed to set as for the
	vocabulary size.

	Here's an example of the tokenizer files you can find in this repository:

	```
	jomleh-sp-32000.model
	```

	Moving on to the KenLM binary models, their file names follow this template:

	```
	<dataset-name>-<tokenizer-type>-<vocabulary-size>-o<n-gram>-prune<pruning>.<model\|vocab>
	```

	Just like with the tokenizers, the only available options for dataset and tokenizer type are
	`jomleh` and `sp`. The same applies to vocabulary sizes. There are two n-grams trained,
	3-grams, and 5-grams. Additionally, there are three different pruning options available for
	each configuration. To interpret the pruning numbers, add a space between each pair of digits.
	For example, `011` means `0 1 1` was set during training of the KenLM model.

	Here is a complete example: To train the binary model named `jomleh-sp-32000-o5-prune01111.probing`,
	the tokenizer `jomleh-sp-32000.model` was used to encode (tokenize) the 95% Jomleh dataset,
	resulting in a large text file holding space-separated tokens. Then, the file was fed into
	the `lmplz` program with the following input arguments:

	```
	lmplz -o 5 -T /tmp --vocab_estimate 32000 -S 80% --discount_fallback --prune "0 1 1 1 1" < enocoded.txt > jomleh-sp-32000-o5-prune01111.arpa
	```

	This command will produce the raw arpa file, which can then be converted into binary format
	using the `build_binary` program, as shown below:

	```
	build_binary -T /tmp -S 80% probing jomleh-sp-32000-o5-prune01111.arpa jomleh-sp-32000-o5-prune01111.probing
	```

	# Which model to use?

	Based on my personal evaluation, I recommend using the `jomleh-sp-57218-o3-prune011.probing`.
	It's the perfect balanced between file size (6GB) and accuracy (80%). But if you have no concern for file
	size, then go for the largest model, `jomleh-sp-57218-o5-prune00011.probing` (size: 36GB, accuracy: 82%).