Model Card for Turkish Byte Pair Encoding Tokenizer

This model provides a tokenizer specifically designed for the Turkish language. It includes 256,000 Turkish word roots, all Turkish suffixes in both lowercase and uppercase forms, and extends with approximately 207,000 additional tokens using Byte Pair Encoding (BPE). The tokenizer is intended to improve the tokenization quality for NLP tasks involving Turkish text.

Model Details

Model Description

This tokenizer is developed to handle the complex morphology and agglutinative nature of the Turkish language. By leveraging a comprehensive set of word roots and suffixes combined with BPE, it ensures efficient tokenization, preserving linguistic structure and reducing the vocabulary size for downstream tasks.

Developed by: Ali Arda Fincan
Model type: Tokenizer (Byte Pair Encoding & Pre-Defined Turkish Words)
Language(s) (NLP): Turkish
License: Apache-2.0

Model Sources [optional]

Repository: umarigan/turkish_corpus_small

Direct Use

This tokenizer can be directly used for tokenizing Turkish text in tasks like text classification, translation, or sentiment analysis. It efficiently handles the linguistic properties of Turkish, making it suitable for tasks requiring morphological analysis or text processing.

Downstream Use

The tokenizer can be fine-tuned or integrated into NLP pipelines for Turkish language processing, including model training or inference tasks.

Out-of-Scope Use

The tokenizer is not designed for non-Turkish languages or tasks requiring domain-specific tokenization not covered in its training.

Bias, Risks, and Limitations

While this tokenizer is optimized for Turkish, biases may arise if the training data contains imbalances or stereotypes. It may also perform suboptimally on highly informal or domain-specific text.

Recommendations

Users should evaluate the tokenizer on their specific datasets and tasks to identify any biases or limitations. Supplementary preprocessing or token adjustments may be required for optimal results.

How to Get Started with the Model

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("aliarda/turkish_tokenizer")

# Example usage:
text = "Türkçe metin işleme için bir örnek."
tokens = tokenizer.tokenize(text)
print(tokens)