Model Card for Model ID

This is a tokenizer for translated version of Tiny stories dataset to Azerbaijani.

Model Details

Model Description

This is a tokenizer trained on translated version of Tiny Stories dataset (to Azerbaijani) I trained Byte-fallback BPE tokenizer on this dataset and I used the similar parameters that used in tokenizer of Mistral. Like sentencepiece "_" used for the beginning of the pieces in the sub-words.

Developed by: Javidan Aslanli
Language(s) (NLP): Azerbaijani
License: Apache license 2.0

Training Details

Training Data

Translated Tiny stories

Training:

This is a Byte-fallback BPE tokenizer. What I used in tokenizer is:

Normalizers are same with the tokenizer of Mistral's normalizers
I used Meta-Space pre-tokenizer before training BPE.
For training I used Byte-fallback trick and other parameters are same with Mistral's.

khaosai
/

azerbaijani-tinystories-tokenizer