--- license: mit language: - en tags: - babylm - tokenizer datasets: - nilq/babylm-100M --- ## Baby Tokenizer Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language. ### Usage #### Transformers ```py from transformers import AutoTokenizer tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer") ``` #### Tokenizers ```py from tokenizers import Tokenizer tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer") ``` ### Data This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources: - CHILDES (child-directed speech) - Subtitles (speech) - BNC (speech) - TED talks (speech) - children's books (simple written language). ### Specifications - Vocabulary size: 20k - Alphabet limit: 150 - Minimum token frequency: 100