metinovadilet commited on
Commit
c2b39a5
1 Parent(s): dbf297d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -11
README.md CHANGED
@@ -8,15 +8,14 @@ tags:
8
  - kyrgyz
9
  - tokenizer
10
  ---
11
- This tokenizer is designed for the Kyrgyz language and uses SentencePiece with Byte Pair Encoding (BPE). It includes a 50,000-subword vocabulary. Developed in cooperation with UlutSoft LLC, it reflects common Kyrgyz language usage and aims to produce precise tokenization for downstream NLP tasks.
12
-
13
  Features:
14
 
15
  Language: Kyrgyz
16
  Vocabulary Size: 50,000 subwords
17
  Method: SentencePiece (BPE)
18
 
19
- Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
20
  Usage Example (Python with transformers):
21
 
22
  ```python
@@ -27,13 +26,10 @@ text = "Кыргыз тили – бай жана кооз тил."
27
  tokens = tokenizer(text)
28
  print(tokens)
29
  ```
 
30
 
 
 
31
 
32
- Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
33
- Consider applying normalization or lemmatization to refine results.
34
-
35
- License and Attribution:
36
- Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
37
-
38
- Feedback and Contributions:
39
- Issues, suggestions, and contributions are welcome. Please open an Issue or Pull Request in the repository to help refine this resource.
 
8
  - kyrgyz
9
  - tokenizer
10
  ---
11
+ A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 50,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage.
 
12
  Features:
13
 
14
  Language: Kyrgyz
15
  Vocabulary Size: 50,000 subwords
16
  Method: SentencePiece (BPE)
17
 
18
+ Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
19
  Usage Example (Python with transformers):
20
 
21
  ```python
 
26
  tokens = tokenizer(text)
27
  print(tokens)
28
  ```
29
+ Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.
30
 
31
+ License and Attribution
32
+ This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.
33
 
34
+ Feedback and Contributions
35
+ We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.