edugp
/

kenlm

edugp commited on Jan 17, 2022

Commit

d75c55c

•

2 Parent(s): c866ac0 f91a468

Merge branch 'main' of https://huggingface.co/edugp/kenlm

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,41 @@
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
@@ -11,7 +49,7 @@ The models have been trained using some of the preprocessing steps from [cc_net]
 # Dependencies
 * KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip`
-* SentencePiece: `pip install https://github.com/kpu/kenlm/archive/master.zip`
 # Example:
 ```

+---
+language:
+  - es
+  - af
+  - ar
+  - arz
+  - as
+  - bn
+  - fr
+  - sw
+  - eu
+  - ca
+  - zh
+  - en
+  - hi
+  - ur
+  - id
+  - pt
+  - vi
+  - gu
+  - kn
+  - ml
+  - mr
+  - ta
+  - te
+  - yo
+tags:
+- kenlm
+- perplexity
+- n-gram
+- kneser-ney
+- bigscience
+license: "mit"
+datasets:
+- wikipedia
+- oscar
+---
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
 # Dependencies
 * KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip`
+* SentencePiece: `pip install sentencepiece`
 # Example:
 ```