metinovadilet
/

KyrgyzTokenizer-BPE-50k

Model card Files Files and versions Community

metinovadilet commited on 6 days ago

Commit

dbf297d

•

1 Parent(s): 9fd57ba

Update README.md

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -15,21 +15,23 @@ Features:
 Language: Kyrgyz
 Vocabulary Size: 50,000 subwords
 Method: SentencePiece (BPE)
 Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
 Usage Example (Python with transformers):
-python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
 text = "Кыргыз тили – бай жана кооз тил."
 tokens = tokenizer(text)
 print(tokens)
-Recommendations:
-Use diverse and representative text data to ensure the tokenizer covers various language styles and topics.
 Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
 Consider applying normalization or lemmatization to refine results.
 License and Attribution:
 Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.

 Language: Kyrgyz
 Vocabulary Size: 50,000 subwords
 Method: SentencePiece (BPE)
 Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
 Usage Example (Python with transformers):
+```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
 text = "Кыргыз тили – бай жана кооз тил."
 tokens = tokenizer(text)
 print(tokens)
+```
 Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
 Consider applying normalization or lemmatization to refine results.
 License and Attribution:
 Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.