malteos commited on
Commit
f43e254
1 Parent(s): c0c608e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -19,6 +19,7 @@ You can try out the model at [European Language Grid](https://live.european-lang
19
  - ca. 50B German tokens
20
  - Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
21
  - Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
 
22
  - German court decisions from [Open Legal Data](http://openlegaldata.io/)
23
 
24
  ## Code
 
19
  - ca. 50B German tokens
20
  - Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
21
  - Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
22
+ - Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
23
  - German court decisions from [Open Legal Data](http://openlegaldata.io/)
24
 
25
  ## Code