Kyrgyz Tokenizers Collection
Collection
5 items
•
Updated
•
1
This is a Unigram SentencePiece tokenizer . It follows the T5 tokenizer format and is designed for use with mT5, ByT5, or other T5 models fine-tuned for the Kyrgyz language.
This tokenizer is useful for Kyrgyz NLP tasks, including:
It is expected to be used for:
Users should evaluate the tokenizer on their own datasets to ensure it works well for their use cases.
You can load and use this tokenizer with the following code:
from transformers import AutoTokenizer
# Load the tokenizer from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("metinovadilet/kyrgyz_t5_tokenizer")
# Example Kyrgyz sentence
text = "Кыргызстандагы илимий долбоорлор өнүгүп жатат."
# Tokenize the text
tokens = tokenizer(text, return_tensors="pt")
# Print token IDs
print("Token IDs:", tokens["input_ids"][0].tolist())
# Print individual tokenized words
decoded_tokens = [tokenizer.decode([tid]) for tid in tokens["input_ids"][0].tolist()]
print("Tokenized Words:", decoded_tokens)
# Print segmentation
print("\nWord Segmentation Breakdown:")
for token_id in tokens["input_ids"][0].tolist():
token = tokenizer.decode([token_id])
print(f"'{token}' -> ID: {token_id}")
If you use this tokenizer, please cite:
@misc{metinovadilet2025kyrgyzT5tokenizer,
title={Kyrgyz T5 Tokenizer},
author={Metinov Adilet},
year={2025},
publisher={Hugging Face},
url=https://huggingface.co/metinovadilet/kyrgyz_t5_tokenizer
}
Contact: metinovadilet@gmail.com