PersianBPETokenizer Model Card
Model Details
Model Description
The PersianBPETokenizer
is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.
Model Type
- Tokenization Algorithm: Byte-Pair Encoding (BPE)
- Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
- Pre-tokenization: Whitespace
- Post-processing: TemplateProcessing for special tokens
Model Version
- Version: 1.0
- Date: September 6, 2024
License
- License: MIT
Developers
- Developed by: Mohammad Shojaei
- Contact: Shojaei.dev@gmail.com
Citation
If you use this tokenizer in your research, please cite it as:
Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.
Model Use
Intended Use
- Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
- Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.
Out-of-Scope Use
- Non-Persian Text: This tokenizer is not designed for languages other than Persian.
- Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.
Data
Training Data
- Dataset:
mshojaei77/PersianTelegramChannels
- Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
- Size: 60,730 samples
Data Preprocessing
- Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
- Pre-tokenization: Used whitespace pre-tokenization.
Performance
Evaluation Metrics
- Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
- Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.
Known Limitations
- Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
- Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens (
[UNK]
).
Training Procedure
Training Steps
- Environment Setup: Installed necessary libraries (
datasets
,tokenizers
,transformers
). - Data Preparation: Loaded the
mshojaei77/PersianTelegramChannels
dataset and created a batch iterator for efficient training. - Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
- Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
- Post-processing: Set up post-processing to handle special tokens.
- Saving: Saved the tokenizer to disk for future use.
- Compatibility: Converted the tokenizer to a
PreTrainedTokenizerFast
object for compatibility with Hugging Face Transformers.
Hyperparameters
- Special Tokens:
[UNK]
,[CLS]
,[SEP]
,[PAD]
,[MASK]
- Batch Size: 1000 samples per batch
- Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
How to Use
Installation
To use the PersianBPETokenizer
, first install the required libraries:
pip install -q --upgrade datasets tokenizers transformers
Loading the Tokenizer
You can load the tokenizer using the Hugging Face Transformers library:
from transformers import AutoTokenizer
persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")
Tokenization Example
test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))
Acknowledgments
- Dataset:
mshojaei77/PersianTelegramChannels
- Libraries: Hugging Face
datasets
,tokenizers
, andtransformers
References
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.