File size: 1,395 Bytes
ee10f9a
45c723e
ee10f9a
 
 
 
 
 
 
6771bd0
c2b39a5
6771bd0
 
 
 
 
dbf297d
c2b39a5
6771bd0
 
dbf297d
6771bd0
 
 
 
 
 
dbf297d
c2b39a5
dbf297d
c2b39a5
 
6771bd0
c2b39a5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
license: mit
language:
- ky
tags:
- tokenization
- BPE
- kyrgyz
- tokenizer
---
A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 50,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage.
Features:

Language: Kyrgyz
Vocabulary Size: 50,000 subwords
Method: SentencePiece (BPE)

Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
Usage Example (Python with transformers):

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
text = "Кыргыз тили – бай жана кооз тил."
tokens = tokenizer(text)
print(tokens)
```
Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.

License and Attribution
This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.

Feedback and Contributions
We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.