metinovadilet commited on
Commit
dbf297d
1 Parent(s): 9fd57ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -15,21 +15,23 @@ Features:
15
  Language: Kyrgyz
16
  Vocabulary Size: 50,000 subwords
17
  Method: SentencePiece (BPE)
 
18
  Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
19
  Usage Example (Python with transformers):
20
 
21
- python
22
  from transformers import AutoTokenizer
23
 
24
  tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
25
  text = "Кыргыз тили – бай жана кооз тил."
26
  tokens = tokenizer(text)
27
  print(tokens)
28
- Recommendations:
 
29
 
30
- Use diverse and representative text data to ensure the tokenizer covers various language styles and topics.
31
  Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
32
  Consider applying normalization or lemmatization to refine results.
 
33
  License and Attribution:
34
  Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
35
 
 
15
  Language: Kyrgyz
16
  Vocabulary Size: 50,000 subwords
17
  Method: SentencePiece (BPE)
18
+
19
  Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
20
  Usage Example (Python with transformers):
21
 
22
+ ```python
23
  from transformers import AutoTokenizer
24
 
25
  tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
26
  text = "Кыргыз тили – бай жана кооз тил."
27
  tokens = tokenizer(text)
28
  print(tokens)
29
+ ```
30
+
31
 
 
32
  Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
33
  Consider applying normalization or lemmatization to refine results.
34
+
35
  License and Attribution:
36
  Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
37