PersianBPETokenizer / README.md
mshojaei77's picture
Upload tokenizer
bf7bd35 verified
metadata
datasets:
  - mshojaei77/PersianTelegramChannels
language:
  - fa
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
  - 'Tokenizer '
  - persian
  - bpet

PersianBPETokenizer Model Card

Model Details

Model Description

The PersianBPETokenizer is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.

Model Type

  • Tokenization Algorithm: Byte-Pair Encoding (BPE)
  • Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
  • Pre-tokenization: Whitespace
  • Post-processing: TemplateProcessing for special tokens

Model Version

  • Version: 1.0
  • Date: September 6, 2024

License

  • License: MIT

Developers

Citation

If you use this tokenizer in your research, please cite it as:

Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.

Model Use

Intended Use

  • Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
  • Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.

Out-of-Scope Use

  • Non-Persian Text: This tokenizer is not designed for languages other than Persian.
  • Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.

Data

Training Data

  • Dataset: mshojaei77/PersianTelegramChannels
  • Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
  • Size: 60,730 samples

Data Preprocessing

  • Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
  • Pre-tokenization: Used whitespace pre-tokenization.

Performance

Evaluation Metrics

  • Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
  • Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.

Known Limitations

  • Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
  • Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens ([UNK]).

Training Procedure

Training Steps

  1. Environment Setup: Installed necessary libraries (datasets, tokenizers, transformers).
  2. Data Preparation: Loaded the mshojaei77/PersianTelegramChannels dataset and created a batch iterator for efficient training.
  3. Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
  4. Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
  5. Post-processing: Set up post-processing to handle special tokens.
  6. Saving: Saved the tokenizer to disk for future use.
  7. Compatibility: Converted the tokenizer to a PreTrainedTokenizerFast object for compatibility with Hugging Face Transformers.

Hyperparameters

  • Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
  • Batch Size: 1000 samples per batch
  • Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)

How to Use

Installation

To use the PersianBPETokenizer, first install the required libraries:

pip install -q --upgrade datasets tokenizers transformers

Loading the Tokenizer

You can load the tokenizer using the Hugging Face Transformers library:

from transformers import AutoTokenizer

persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")

Tokenization Example

test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))

Acknowledgments

  • Dataset: mshojaei77/PersianTelegramChannels
  • Libraries: Hugging Face datasets, tokenizers, and transformers

References