nilq commited on
Commit
33886d8
1 Parent(s): 3b07e52

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -2,11 +2,14 @@
2
  license: mit
3
  language:
4
  - en
 
 
 
5
  ---
6
 
7
  ## Baby Tokenizer
8
 
9
- Compact sentencepiece tokenizer for sample-efficient English language modeling.
10
 
11
  ### Data
12
 
@@ -21,4 +24,4 @@ This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, con
21
 
22
  - Vocabulary size: 20k
23
  - Alphabet limit: 150
24
- - Minimum token frequency: 5
 
2
  license: mit
3
  language:
4
  - en
5
+ tags:
6
+ - babylm
7
+ - tokenizer
8
  ---
9
 
10
  ## Baby Tokenizer
11
 
12
+ Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
13
 
14
  ### Data
15
 
 
24
 
25
  - Vocabulary size: 20k
26
  - Alphabet limit: 150
27
+ - Minimum token frequency: 100