mlengineer-ai
/

kenlm-sp-jomleh

Model card Files Files and versions Community

mehran commited on May 9, 2023

Commit

a409788

•

1 Parent(s): d8699f6

Update README.md

Files changed (1) hide show

README.md +61 -1

README.md CHANGED Viewed

@@ -10,4 +10,64 @@ tags:
 - kneser-ney
 - n-gram
 - kenlm
----

 - kneser-ney
 - n-gram
 - kenlm
+---
+# KenLM models for Farsi
+This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
+dataset. Among all the use cases for the language models like KenLM, the models provided here are
+very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
+the more likely sequence of tokens extracted from spectogram.
+The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
+binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
+that they are faster but with bigger memory footprint.
+There are a total 36 different KenLM models that you can find here. Unless you are doing some
+research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
+need and not the whole repository. As the total size of files is larger than half a TB.
+# Sample code how to use the models
+Unfortunately, I could not find an easy way to integrate the Python code that loads the models
+using Huggingface library. These are the steps that you have to take if you want to use any of the
+models provided here:
+1. Install KenLM package:
+```
+pip install https://github.com/kpu/kenlm/archive/master.zip
+```
+2. Install the SentencePiece for the tokenization:
+```
+pip install sentencepiece
+```
+3. Download the model that you are interested in from this repository along the Python code
+`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
+structure in the repository). Don't forget to download the SentencePiece files as well. For
+instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
+pruning, these are the files you'll need:
+```
+model.py
+files/jomleh-sp-32000.model
+files/jomleh-sp-32000.vocab
+files/jomleh-sp-32000-o5-prune01111.probing
+```
+4. Write your own code to instantiate a model and use it:
+```
+```
+# What are the different models provided here
+There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
+which is a Farsi dataset, there differences among them. Namely:
+1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
+Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
+once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
+here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
+57218 token vocabulary size should be the best option.