Tokenizer for Uzbek Language
Introduction
Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences
Features
- Matnlarni tokenlarga ajratadi.
- Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.
Installation
Python va kerakli kutubxonalar:
pip install transformers datasets
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")
text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
tokens = tokenizer.tokenize(text)
print(tokens)
Dataset Description
Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.
Contact
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for jamshidahmadov/uz_tokenizer
Base model
FacebookAI/xlm-roberta-base