metinovadilet
/

KyrgyzTokenizer-BPE-50k

Model card Files Files and versions Community

metinovadilet commited on Dec 13, 2024

Commit

c2b39a5

·

verified ·

1 Parent(s): dbf297d

Update README.md

Files changed (1) hide show

README.md +7 -11

README.md CHANGED Viewed

@@ -8,15 +8,14 @@ tags:
 - kyrgyz
 - tokenizer
 ---
-This tokenizer is designed for the Kyrgyz language and uses SentencePiece with Byte Pair Encoding (BPE). It includes a 50,000-subword vocabulary. Developed in cooperation with UlutSoft LLC, it reflects common Kyrgyz language usage and aims to produce precise tokenization for downstream NLP tasks.
 Features:
 Language: Kyrgyz
 Vocabulary Size: 50,000 subwords
 Method: SentencePiece (BPE)
-Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
 Usage Example (Python with transformers):
 ```python
@@ -27,13 +26,10 @@ text = "Кыргыз тили – бай жана кооз тил."
 tokens = tokenizer(text)
 print(tokens)
 ```
-Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
-Consider applying normalization or lemmatization to refine results.
-License and Attribution:
-Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
-Feedback and Contributions:
-Issues, suggestions, and contributions are welcome. Please open an Issue or Pull Request in the repository to help refine this resource.

 - kyrgyz
 - tokenizer
 ---
+A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 50,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage.
 Features:
 Language: Kyrgyz
 Vocabulary Size: 50,000 subwords
 Method: SentencePiece (BPE)
+Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
 Usage Example (Python with transformers):
 ```python
 tokens = tokenizer(text)
 print(tokens)
 ```
+Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.
+License and Attribution
+This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.
+Feedback and Contributions
+We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.