Bauwens
/

ULM-32k_SlimPajama-3M

Model card Files Files and versions Community

Edit model card

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ULM-32k SlimPajama-3M

ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.

Tokeniser details

ULM trainer implementation:

Back-end: SentencePiece's SentencePieceTrainer.
Front-end: TkTkT's KudoPieceTrainer

Preprocessor:

During training: TkTkT's SentencePiecePreprocessor
During inference: TkTkT's ModernEnglishPreprocessor
1. NFKC normalisation
2. Punctuation splitter, whitespace splitter, English contraction splitter
3. GPT-2's pseudo-byte mapping
4. Start-of-word marker Ġ
5. Digit and hyphen isolation

Training details

Time: 3h40m

Preprocessing and counting the 3M corpus: 2h45m
ULM algorithm: 55m

Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).

Data sizes:

Examples considered: 3 000 000
Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
Characters counted: 6 685 212 190
Unique words after whitespace splitting: 9 254 839

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference API

Unable to determine this model's library. Check the docs .

Collection including Bauwens/ULM-32k_SlimPajama-3M

Tokenisers

A collection of tokenisers I have trained (so you don't have to). • 2 items • Updated about 1 month ago