File size: 389 Bytes
8c6fd05
 
 
 
 
 
 
 
 
 
 
 
 
0e15c52
8c6fd05
 
 
 
 
f45ec5f
8c6fd05
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
---
language:
- en
---

V1 of an English/code tokenizer. Equal mix between:
On the NL side:
- Books
- C4
- v1 of our CC (helen quality classifier)
- enwiki
- Gutenberg
- Reddit

On the code side:
- Jupyter notebooks (0.5 weight, it was small)
- GH issues
- Stackexchange
- The cleaned Python Stack

For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).