Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,64 @@ tags:
|
|
10 |
- kneser-ney
|
11 |
- n-gram
|
12 |
- kenlm
|
13 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
- kneser-ney
|
11 |
- n-gram
|
12 |
- kenlm
|
13 |
+
---
|
14 |
+
|
15 |
+
# KenLM models for Farsi
|
16 |
+
|
17 |
+
This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
|
18 |
+
dataset. Among all the use cases for the language models like KenLM, the models provided here are
|
19 |
+
very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
|
20 |
+
the more likely sequence of tokens extracted from spectogram.
|
21 |
+
|
22 |
+
The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
|
23 |
+
binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
|
24 |
+
that they are faster but with bigger memory footprint.
|
25 |
+
|
26 |
+
There are a total 36 different KenLM models that you can find here. Unless you are doing some
|
27 |
+
research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
|
28 |
+
need and not the whole repository. As the total size of files is larger than half a TB.
|
29 |
+
|
30 |
+
# Sample code how to use the models
|
31 |
+
|
32 |
+
Unfortunately, I could not find an easy way to integrate the Python code that loads the models
|
33 |
+
using Huggingface library. These are the steps that you have to take if you want to use any of the
|
34 |
+
models provided here:
|
35 |
+
|
36 |
+
1. Install KenLM package:
|
37 |
+
|
38 |
+
```
|
39 |
+
pip install https://github.com/kpu/kenlm/archive/master.zip
|
40 |
+
```
|
41 |
+
|
42 |
+
2. Install the SentencePiece for the tokenization:
|
43 |
+
```
|
44 |
+
pip install sentencepiece
|
45 |
+
```
|
46 |
+
|
47 |
+
3. Download the model that you are interested in from this repository along the Python code
|
48 |
+
`model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
|
49 |
+
structure in the repository). Don't forget to download the SentencePiece files as well. For
|
50 |
+
instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
|
51 |
+
pruning, these are the files you'll need:
|
52 |
+
```
|
53 |
+
model.py
|
54 |
+
files/jomleh-sp-32000.model
|
55 |
+
files/jomleh-sp-32000.vocab
|
56 |
+
files/jomleh-sp-32000-o5-prune01111.probing
|
57 |
+
```
|
58 |
+
|
59 |
+
4. Write your own code to instantiate a model and use it:
|
60 |
+
|
61 |
+
```
|
62 |
+
```
|
63 |
+
|
64 |
+
# What are the different models provided here
|
65 |
+
|
66 |
+
There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
|
67 |
+
which is a Farsi dataset, there differences among them. Namely:
|
68 |
+
|
69 |
+
1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
|
70 |
+
Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
|
71 |
+
once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
|
72 |
+
here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
|
73 |
+
57218 token vocabulary size should be the best option.
|