finewebedu_32000

About

๐Ÿ‡ฌ๐Ÿ‡ง An English tokenizer, trained on the FineWeb-Edu dataset.

Description

This is a character-level (mainly) English (en) tokenizer, trained on the CC-MAIN-2024-10 subset of FineWeb-Edu. It has a vocabulary size of 32,000 (multiple of 128), which makes it fast for integration in models.

Usage

import tokenizers

dataset = tokenizers.Tokenizer.from_pretrained("gvlassis/finewebedu_32000")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.