Uzbek

Tokenizer for Uzbek Language

Introduction

Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences

Features

  • Matnlarni tokenlarga ajratadi.
  • Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.

Installation

Python va kerakli kutubxonalar:

pip install transformers datasets

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")

text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
tokens = tokenizer.tokenize(text)
print(tokens)

Dataset Description

Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.

Contact

Jamshid Ahmadov

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for jamshidahmadov/uz_tokenizer

Finetuned
(2876)
this model

Dataset used to train jamshidahmadov/uz_tokenizer