teven commited on
Commit
a505bc0
1 Parent(s): f0817b9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between:
7
+ On the NL side:
8
+ - Books
9
+ - C4
10
+ - v1 of our CC (helen quality classifier)
11
+ - enwiki
12
+ - Gutenberg
13
+ - Reddit
14
+
15
+ On the code side:
16
+ - Jupyter notebooks (0.5 weight, it was small)
17
+ - GH issues
18
+ - Stackexchange
19
+ - The cleaned Python Stack
20
+
21
+ For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).