--- datasets: - mshojaei77/PersianTelegramChannels language: - fa library_name: transformers license: mit pipeline_tag: text-generation tags: - 'Tokenizer ' - persian - bpet --- # PersianBPETokenizer Model Card ## Model Details ### Model Description The `PersianBPETokenizer` is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks. ### Model Type - **Tokenization Algorithm**: Byte-Pair Encoding (BPE) - **Normalization**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ) - **Pre-tokenization**: Whitespace - **Post-processing**: TemplateProcessing for special tokens ### Model Version - **Version**: 1.0 - **Date**: September 6, 2024 ### License - **License**: MIT ### Developers - **Developed by**: Mohammad Shojaei - **Contact**: Shojaei.dev@gmail.com ### Citation If you use this tokenizer in your research, please cite it as: ``` Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer. ``` ## Model Use ### Intended Use - **Primary Use**: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more. - **Secondary Use**: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets. ### Out-of-Scope Use - **Non-Persian Text**: This tokenizer is not designed for languages other than Persian. - **Non-NLP Tasks**: It is not intended for use in non-NLP tasks such as image processing or audio analysis. ## Data ### Training Data - **Dataset**: `mshojaei77/PersianTelegramChannels` - **Description**: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer. - **Size**: 60,730 samples ### Data Preprocessing - **Normalization**: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters. - **Pre-tokenization**: Used whitespace pre-tokenization. ## Performance ### Evaluation Metrics - **Tokenization Accuracy**: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text. - **Compatibility**: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models. ### Known Limitations - **Vocabulary Size**: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required. - **Out-of-Vocabulary Words**: Rare or domain-specific words may be tokenized as unknown tokens (`[UNK]`). ## Training Procedure ### Training Steps 1. **Environment Setup**: Installed necessary libraries (`datasets`, `tokenizers`, `transformers`). 2. **Data Preparation**: Loaded the `mshojaei77/PersianTelegramChannels` dataset and created a batch iterator for efficient training. 3. **Tokenizer Model**: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps. 4. **Training**: Trained the tokenizer on the Persian text corpus using the BPE algorithm. 5. **Post-processing**: Set up post-processing to handle special tokens. 6. **Saving**: Saved the tokenizer to disk for future use. 7. **Compatibility**: Converted the tokenizer to a `PreTrainedTokenizerFast` object for compatibility with Hugging Face Transformers. ### Hyperparameters - **Special Tokens**: `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]` - **Batch Size**: 1000 samples per batch - **Normalization Steps**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ) ## How to Use ### Installation To use the `PersianBPETokenizer`, first install the required libraries: ```bash pip install -q --upgrade datasets tokenizers transformers ``` ### Loading the Tokenizer You can load the tokenizer using the Hugging Face Transformers library: ```python from transformers import AutoTokenizer persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer") ``` ### Tokenization Example ```python test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید" tokens = persian_tokenizer.tokenize(test_sentence) print("Tokens:", tokens) encoded = persian_tokenizer(test_sentence) print("Input IDs:", encoded["input_ids"]) print("Decoded:", persian_tokenizer.decode(encoded["input_ids"])) ``` ## Acknowledgments - **Dataset**: `mshojaei77/PersianTelegramChannels` - **Libraries**: Hugging Face `datasets`, `tokenizers`, and `transformers` ## References - [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/index) - [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)