mehran commited on
Commit
a409788
1 Parent(s): d8699f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -1
README.md CHANGED
@@ -10,4 +10,64 @@ tags:
10
  - kneser-ney
11
  - n-gram
12
  - kenlm
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - kneser-ney
11
  - n-gram
12
  - kenlm
13
+ ---
14
+
15
+ # KenLM models for Farsi
16
+
17
+ This repository contains trained KenLM models for Farsi (Persian) language trained on the Jomleh
18
+ dataset. Among all the use cases for the language models like KenLM, the models provided here are
19
+ very useful for ASR (automatic speech recognition) task. They can be used along with CTC to select
20
+ the more likely sequence of tokens extracted from spectogram.
21
+
22
+ The models in this repository are KenLM arpa files turned into binary. KenLM supports two types of
23
+ binary formats: probing and trie. The models provided here are of the probing format. KenLM claims
24
+ that they are faster but with bigger memory footprint.
25
+
26
+ There are a total 36 different KenLM models that you can find here. Unless you are doing some
27
+ research, you won't be needing all of them. If that's the case, I suggest downloading the ones you
28
+ need and not the whole repository. As the total size of files is larger than half a TB.
29
+
30
+ # Sample code how to use the models
31
+
32
+ Unfortunately, I could not find an easy way to integrate the Python code that loads the models
33
+ using Huggingface library. These are the steps that you have to take if you want to use any of the
34
+ models provided here:
35
+
36
+ 1. Install KenLM package:
37
+
38
+ ```
39
+ pip install https://github.com/kpu/kenlm/archive/master.zip
40
+ ```
41
+
42
+ 2. Install the SentencePiece for the tokenization:
43
+ ```
44
+ pip install sentencepiece
45
+ ```
46
+
47
+ 3. Download the model that you are interested in from this repository along the Python code
48
+ `model.py`. Keep the model in the `files` folder with the `model.py` by it (just like the file
49
+ structure in the repository). Don't forget to download the SentencePiece files as well. For
50
+ instance, if you were interested in 32000 vocabulary size tokenizer, 5-gram model with maximum
51
+ pruning, these are the files you'll need:
52
+ ```
53
+ model.py
54
+ files/jomleh-sp-32000.model
55
+ files/jomleh-sp-32000.vocab
56
+ files/jomleh-sp-32000-o5-prune01111.probing
57
+ ```
58
+
59
+ 4. Write your own code to instantiate a model and use it:
60
+
61
+ ```
62
+ ```
63
+
64
+ # What are the different models provided here
65
+
66
+ There a total of 36 models in this repository and while all of the are trained on Jomleh daatset,
67
+ which is a Farsi dataset, there differences among them. Namely:
68
+
69
+ 1. Different vocabulary sizes: For research purposes, I trained on 6 different vocabulary sizes.
70
+ Of course, the vocabulary size is a hyperparameter for the tokenizer (SentencePiece here) but
71
+ once you have a new tokenizer, it will result in a new model. The different vocabulary sizes used
72
+ here are: 2000, 4000, 8000, 16000, 32000, and 57218 tokens. For most use cases, ethier 32000 or
73
+ 57218 token vocabulary size should be the best option.