Model Card for Turkish Byte Pair Encoding Tokenizer

This model provides a tokenizer specifically designed for the Turkish language. It includes 24k Turkish word roots with capital first letter and 24k Turkish word roots with lower first letter, all Turkish suffixes, and extends with approximately 12k additional tokens using Byte Pair Encoding (BPE). The tokenizer is intended to improve the tokenization quality for NLP tasks involving Turkish text.

Model Details

Model Description

This tokenizer is developed to handle the complex morphology and agglutinative nature of the Turkish language. By leveraging a comprehensive set of word roots and suffixes combined with BPE, it ensures efficient tokenization, preserving linguistic structure and reducing the vocabulary size for downstream tasks.

Developed by: Ahmet Semih Gümüş
Model type: Tokenizer (Byte Pair Encoding & Pre-Defined Turkish Words)
Language(s) (NLP): Turkish
License: Apache-2.0

Model Sources [optional]

Repository: wikimedia/wikipedia", "20231101.tr

Direct Use

This tokenizer can be directly used for tokenizing Turkish text in tasks like text classification, translation, or sentiment analysis. It efficiently handles the linguistic properties of Turkish, making it suitable for tasks requiring morphological analysis or text processing.

Downstream Use

The tokenizer can be fine-tuned or integrated into NLP pipelines for Turkish language processing, including model training or inference tasks.

Out-of-Scope Use

The tokenizer is not designed for non-Turkish languages or tasks requiring domain-specific tokenization not covered in its training.

Bias, Risks, and Limitations

While this tokenizer is optimized for Turkish, biases may arise if the training data contains imbalances or stereotypes. It may also perform suboptimally on highly informal or domain-specific text.

Recommendations

Users should evaluate the tokenizer on their specific datasets and tasks to identify any biases or limitations. Supplementary preprocessing or token adjustments may be required for optimal results.

How to Get Started with the Model

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/tr_tokenizer")

# Example usage:
text = "Arkadaşlarını görmek için gitmelisin."
tokens = tokenizer.tokenize(text)
print(tokens)

AhmetSemih
/

tr_tokenizer

You need to agree to share your contact information to access this model