Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,8 @@ Compact sentencepiece tokenizer for sample-efficient English language modeling.
|
|
12 |
|
13 |
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
|
14 |
- CHILDES (child-directed speech)
|
15 |
-
- Subtitles (speech)
|
|
|
16 |
- TED talks (speech)
|
17 |
- children's books (simple written language).
|
18 |
|
|
|
12 |
|
13 |
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
|
14 |
- CHILDES (child-directed speech)
|
15 |
+
- Subtitles (speech)
|
16 |
+
- BNC (speech)
|
17 |
- TED talks (speech)
|
18 |
- children's books (simple written language).
|
19 |
|