philschmid HF staff commited on
Commit
d0d8d15
1 Parent(s): fb27ada

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -24,6 +24,7 @@ language:
24
  - ta
25
  - te
26
  - yo
 
27
  tags:
28
  - kenlm
29
  - perplexity
@@ -37,11 +38,14 @@ datasets:
37
  duplicated_from: edugp/kenlm
38
  ---
39
 
 
 
 
 
40
  # KenLM models
41
  This repo contains several KenLM models trained on different tokenized datasets and languages.
42
  KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
43
 
44
- At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files
45
  * `{language}.arpa.bin`: The trained KenLM model binary
46
  * `{language}.sp.model`: The trained SentencePiece model used for tokenization
47
  * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
 
24
  - ta
25
  - te
26
  - yo
27
+ - de
28
  tags:
29
  - kenlm
30
  - perplexity
 
38
  duplicated_from: edugp/kenlm
39
  ---
40
 
41
+ # Fork of `edugp/kenlm`
42
+
43
+ * adds German wikipedia model.
44
+
45
  # KenLM models
46
  This repo contains several KenLM models trained on different tokenized datasets and languages.
47
  KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
48
 
 
49
  * `{language}.arpa.bin`: The trained KenLM model binary
50
  * `{language}.sp.model`: The trained SentencePiece model used for tokenization
51
  * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model