philschmid
/

kenlm

philschmid commited on Aug 11, 2023

Commit

d0d8d15

1 Parent(s): fb27ada

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ language:
 - ta
 - te
 - yo
 tags:
 - kenlm
 - perplexity
@@ -37,11 +38,14 @@ datasets:
 duplicated_from: edugp/kenlm
 ---
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
-At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files
 * `{language}.arpa.bin`: The trained KenLM model binary
 * `{language}.sp.model`: The trained SentencePiece model used for tokenization
 * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model

 - ta
 - te
 - yo
+- de
 tags:
 - kenlm
 - perplexity
 duplicated_from: edugp/kenlm
 ---
+# Fork of `edugp/kenlm`
+* adds German wikipedia model.
 # KenLM models
 This repo contains several KenLM models trained on different tokenized datasets and languages.
 KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
 * `{language}.arpa.bin`: The trained KenLM model binary
 * `{language}.sp.model`: The trained SentencePiece model used for tokenization
 * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model