File size: 5,151 Bytes

5260c1a
542ca19
 
 
 
bf7bd35
 
542ca19
bf7bd35
 
 
 
5260c1a
542ca19
5260c1a
 
 
 
542ca19
5260c1a
542ca19
 
 
 
 
5260c1a
542ca19
 
 
5260c1a
542ca19
 
5260c1a
542ca19
 
 
5260c1a
542ca19
 
 
 
 
5260c1a
542ca19
5260c1a
542ca19
 
 
5260c1a
 
542ca19
 
5260c1a
542ca19
5260c1a
 
542ca19

---
datasets:
- mshojaei77/PersianTelegramChannels
language:
- fa
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
- 'Tokenizer '
- persian
- bpet
---
# PersianBPETokenizer Model Card

## Model Details

### Model Description
The `PersianBPETokenizer` is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.

### Model Type
- **Tokenization Algorithm**: Byte-Pair Encoding (BPE)
- **Normalization**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
- **Pre-tokenization**: Whitespace
- **Post-processing**: TemplateProcessing for special tokens

### Model Version
- **Version**: 1.0
- **Date**: September 6, 2024

### License
- **License**: MIT

### Developers
- **Developed by**: Mohammad Shojaei
- **Contact**: Shojaei.dev@gmail.com

### Citation
If you use this tokenizer in your research, please cite it as:
```
Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.
```

## Model Use

### Intended Use
- **Primary Use**: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
- **Secondary Use**: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.

### Out-of-Scope Use
- **Non-Persian Text**: This tokenizer is not designed for languages other than Persian.
- **Non-NLP Tasks**: It is not intended for use in non-NLP tasks such as image processing or audio analysis.

## Data

### Training Data
- **Dataset**: `mshojaei77/PersianTelegramChannels`
- **Description**: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
- **Size**: 60,730 samples

### Data Preprocessing
- **Normalization**: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
- **Pre-tokenization**: Used whitespace pre-tokenization.

## Performance

### Evaluation Metrics
- **Tokenization Accuracy**: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
- **Compatibility**: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.

### Known Limitations
- **Vocabulary Size**: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
- **Out-of-Vocabulary Words**: Rare or domain-specific words may be tokenized as unknown tokens (`[UNK]`).

## Training Procedure

### Training Steps
1. **Environment Setup**: Installed necessary libraries (`datasets`, `tokenizers`, `transformers`).
2. **Data Preparation**: Loaded the `mshojaei77/PersianTelegramChannels` dataset and created a batch iterator for efficient training.
3. **Tokenizer Model**: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
4. **Training**: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
5. **Post-processing**: Set up post-processing to handle special tokens.
6. **Saving**: Saved the tokenizer to disk for future use.
7. **Compatibility**: Converted the tokenizer to a `PreTrainedTokenizerFast` object for compatibility with Hugging Face Transformers.

### Hyperparameters
- **Special Tokens**: `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
- **Batch Size**: 1000 samples per batch
- **Normalization Steps**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)

## How to Use

### Installation
To use the `PersianBPETokenizer`, first install the required libraries:
```bash
pip install -q --upgrade datasets tokenizers transformers
```

### Loading the Tokenizer
You can load the tokenizer using the Hugging Face Transformers library:
```python
from transformers import AutoTokenizer

persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")
```

### Tokenization Example
```python
test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))
```


## Acknowledgments
- **Dataset**: `mshojaei77/PersianTelegramChannels`
- **Libraries**: Hugging Face `datasets`, `tokenizers`, and `transformers`

## References
- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/index)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)