YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Custom Tokenizer

Examples

Example sentence: This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces. and a backslash!! Eléonore est un prénom français. __name__ isInstance

Encoded sentence: ['▁This', '▁is', '▁a', '▁test', '▁sent', 'ence.', '▁On', '▁va', '▁voir', '▁comment', '▁elle', '▁est', '▁g', 'érée', '▁....', '▁', '1', '2', '3', '▁+', '▁', '5', '6', '▁=', '▁', '2', '5', '6', '7', '.', "▁Let's", '▁go', '!', '▁Im', 'ag', 'ine', '▁I', '▁have', '▁code', '▁', '▁', '▁', '▁', '4', '▁spaces', '.\n', '▁and', '▁a', '▁', '▁', '▁', '▁', '▁', '▁back', 'sl', 'ash', '!!', '▁El', 'éon', 'ore', '▁est', '▁un', '▁prénom', '▁français.', '▁__name__', '▁is', 'Instance']

Decoded sentence: <s> This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces. and a backslash!! Eléonore est un prénom français. __name__ isInstance

Usage


from transformers import LlamaTokenizerFast

tok = LlamaTokenizerFast.from_pretrained('<tok_name>')

tok.tokenize('This is a test sentence')

Dataset Stats

Samples are trained on dataset manu/tok-corpus-shuffled

The dataset consists of french, english and code samples

More info on the dataset can be found here

For speed purposes, the tokenizer was trained on a sample of the dataset. Only the first samples were selected.

Sample size: 5000000

Size of Sampled: 19.0 GB

Tokenizer Configs

Build from scratch: True

Pretrained tokenizer: None

Tokenizer is trained with digit separation, whitespaces (for code), byte fallback...

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.