|
--- |
|
datasets: |
|
- mshojaei77/PersianTelegramChannels |
|
language: |
|
- fa |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: text-generation |
|
tags: |
|
- 'Tokenizer ' |
|
- persian |
|
- bpet |
|
--- |
|
# PersianBPETokenizer Model Card |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
The `PersianBPETokenizer` is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks. |
|
|
|
### Model Type |
|
- **Tokenization Algorithm**: Byte-Pair Encoding (BPE) |
|
- **Normalization**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ) |
|
- **Pre-tokenization**: Whitespace |
|
- **Post-processing**: TemplateProcessing for special tokens |
|
|
|
### Model Version |
|
- **Version**: 1.0 |
|
- **Date**: September 6, 2024 |
|
|
|
### License |
|
- **License**: MIT |
|
|
|
### Developers |
|
- **Developed by**: Mohammad Shojaei |
|
- **Contact**: Shojaei.dev@gmail.com |
|
|
|
### Citation |
|
If you use this tokenizer in your research, please cite it as: |
|
``` |
|
Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer. |
|
``` |
|
|
|
## Model Use |
|
|
|
### Intended Use |
|
- **Primary Use**: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more. |
|
- **Secondary Use**: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets. |
|
|
|
### Out-of-Scope Use |
|
- **Non-Persian Text**: This tokenizer is not designed for languages other than Persian. |
|
- **Non-NLP Tasks**: It is not intended for use in non-NLP tasks such as image processing or audio analysis. |
|
|
|
## Data |
|
|
|
### Training Data |
|
- **Dataset**: `mshojaei77/PersianTelegramChannels` |
|
- **Description**: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer. |
|
- **Size**: 60,730 samples |
|
|
|
### Data Preprocessing |
|
- **Normalization**: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters. |
|
- **Pre-tokenization**: Used whitespace pre-tokenization. |
|
|
|
## Performance |
|
|
|
### Evaluation Metrics |
|
- **Tokenization Accuracy**: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text. |
|
- **Compatibility**: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models. |
|
|
|
### Known Limitations |
|
- **Vocabulary Size**: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required. |
|
- **Out-of-Vocabulary Words**: Rare or domain-specific words may be tokenized as unknown tokens (`[UNK]`). |
|
|
|
## Training Procedure |
|
|
|
### Training Steps |
|
1. **Environment Setup**: Installed necessary libraries (`datasets`, `tokenizers`, `transformers`). |
|
2. **Data Preparation**: Loaded the `mshojaei77/PersianTelegramChannels` dataset and created a batch iterator for efficient training. |
|
3. **Tokenizer Model**: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps. |
|
4. **Training**: Trained the tokenizer on the Persian text corpus using the BPE algorithm. |
|
5. **Post-processing**: Set up post-processing to handle special tokens. |
|
6. **Saving**: Saved the tokenizer to disk for future use. |
|
7. **Compatibility**: Converted the tokenizer to a `PreTrainedTokenizerFast` object for compatibility with Hugging Face Transformers. |
|
|
|
### Hyperparameters |
|
- **Special Tokens**: `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]` |
|
- **Batch Size**: 1000 samples per batch |
|
- **Normalization Steps**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ) |
|
|
|
## How to Use |
|
|
|
### Installation |
|
To use the `PersianBPETokenizer`, first install the required libraries: |
|
```bash |
|
pip install -q --upgrade datasets tokenizers transformers |
|
``` |
|
|
|
### Loading the Tokenizer |
|
You can load the tokenizer using the Hugging Face Transformers library: |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer") |
|
``` |
|
|
|
### Tokenization Example |
|
```python |
|
test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید" |
|
tokens = persian_tokenizer.tokenize(test_sentence) |
|
print("Tokens:", tokens) |
|
encoded = persian_tokenizer(test_sentence) |
|
print("Input IDs:", encoded["input_ids"]) |
|
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"])) |
|
``` |
|
|
|
|
|
## Acknowledgments |
|
- **Dataset**: `mshojaei77/PersianTelegramChannels` |
|
- **Libraries**: Hugging Face `datasets`, `tokenizers`, and `transformers` |
|
|
|
## References |
|
- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/index) |
|
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index) |