metinovadilet
commited on
Commit
•
dbf297d
1
Parent(s):
9fd57ba
Update README.md
Browse files
README.md
CHANGED
@@ -15,21 +15,23 @@ Features:
|
|
15 |
Language: Kyrgyz
|
16 |
Vocabulary Size: 50,000 subwords
|
17 |
Method: SentencePiece (BPE)
|
|
|
18 |
Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
|
19 |
Usage Example (Python with transformers):
|
20 |
|
21 |
-
python
|
22 |
from transformers import AutoTokenizer
|
23 |
|
24 |
tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
|
25 |
text = "Кыргыз тили – бай жана кооз тил."
|
26 |
tokens = tokenizer(text)
|
27 |
print(tokens)
|
28 |
-
|
|
|
29 |
|
30 |
-
Use diverse and representative text data to ensure the tokenizer covers various language styles and topics.
|
31 |
Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
|
32 |
Consider applying normalization or lemmatization to refine results.
|
|
|
33 |
License and Attribution:
|
34 |
Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
|
35 |
|
|
|
15 |
Language: Kyrgyz
|
16 |
Vocabulary Size: 50,000 subwords
|
17 |
Method: SentencePiece (BPE)
|
18 |
+
|
19 |
Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots, and morphological or syntactic analysis.
|
20 |
Usage Example (Python with transformers):
|
21 |
|
22 |
+
```python
|
23 |
from transformers import AutoTokenizer
|
24 |
|
25 |
tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
|
26 |
text = "Кыргыз тили – бай жана кооз тил."
|
27 |
tokens = tokenizer(text)
|
28 |
print(tokens)
|
29 |
+
```
|
30 |
+
|
31 |
|
|
|
32 |
Employ this tokenizer with pretrained models or train new ones (e.g., BERT, GPT) for Kyrgyz NLP tasks.
|
33 |
Consider applying normalization or lemmatization to refine results.
|
34 |
+
|
35 |
License and Attribution:
|
36 |
Developed in collaboration with UlutSoft LLC. When using this tokenizer or derived resources, please provide proper attribution.
|
37 |
|