Edit model card

Model Card for Model ID

This is a tokenizer for translated version of Tiny stories dataset to Azerbaijani.

Model Details

Model Description

This is a tokenizer trained on translated version of Tiny Stories dataset (to Azerbaijani) I trained Byte-fallback BPE tokenizer on this dataset and I used the similar parameters that used in tokenizer of Mistral. Like sentencepiece "_" used for the beginning of the pieces in the sub-words.

  • Developed by: Javidan Aslanli
  • Language(s) (NLP): Azerbaijani
  • License: Apache license 2.0

Training Details

Training Data

Translated Tiny stories

Training:

This is a Byte-fallback BPE tokenizer. What I used in tokenizer is:

  • Normalizers are same with the tokenizer of Mistral's normalizers
  • I used Meta-Space pre-tokenizer before training BPE.
  • For training I used Byte-fallback trick and other parameters are same with Mistral's.
Downloads last month
0
Unable to determine this model's library. Check the docs .

Dataset used to train khaosai/azerbaijani-tinystories-tokenizer