PersianBPETokenizer / README.md
mshojaei77's picture
Upload tokenizer
bf7bd35 verified
|
raw
history blame
5.15 kB
---
datasets:
- mshojaei77/PersianTelegramChannels
language:
- fa
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
- 'Tokenizer '
- persian
- bpet
---
# PersianBPETokenizer Model Card
## Model Details
### Model Description
The `PersianBPETokenizer` is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.
### Model Type
- **Tokenization Algorithm**: Byte-Pair Encoding (BPE)
- **Normalization**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
- **Pre-tokenization**: Whitespace
- **Post-processing**: TemplateProcessing for special tokens
### Model Version
- **Version**: 1.0
- **Date**: September 6, 2024
### License
- **License**: MIT
### Developers
- **Developed by**: Mohammad Shojaei
- **Contact**: Shojaei.dev@gmail.com
### Citation
If you use this tokenizer in your research, please cite it as:
```
Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.
```
## Model Use
### Intended Use
- **Primary Use**: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
- **Secondary Use**: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.
### Out-of-Scope Use
- **Non-Persian Text**: This tokenizer is not designed for languages other than Persian.
- **Non-NLP Tasks**: It is not intended for use in non-NLP tasks such as image processing or audio analysis.
## Data
### Training Data
- **Dataset**: `mshojaei77/PersianTelegramChannels`
- **Description**: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
- **Size**: 60,730 samples
### Data Preprocessing
- **Normalization**: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
- **Pre-tokenization**: Used whitespace pre-tokenization.
## Performance
### Evaluation Metrics
- **Tokenization Accuracy**: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
- **Compatibility**: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.
### Known Limitations
- **Vocabulary Size**: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
- **Out-of-Vocabulary Words**: Rare or domain-specific words may be tokenized as unknown tokens (`[UNK]`).
## Training Procedure
### Training Steps
1. **Environment Setup**: Installed necessary libraries (`datasets`, `tokenizers`, `transformers`).
2. **Data Preparation**: Loaded the `mshojaei77/PersianTelegramChannels` dataset and created a batch iterator for efficient training.
3. **Tokenizer Model**: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
4. **Training**: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
5. **Post-processing**: Set up post-processing to handle special tokens.
6. **Saving**: Saved the tokenizer to disk for future use.
7. **Compatibility**: Converted the tokenizer to a `PreTrainedTokenizerFast` object for compatibility with Hugging Face Transformers.
### Hyperparameters
- **Special Tokens**: `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
- **Batch Size**: 1000 samples per batch
- **Normalization Steps**: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
## How to Use
### Installation
To use the `PersianBPETokenizer`, first install the required libraries:
```bash
pip install -q --upgrade datasets tokenizers transformers
```
### Loading the Tokenizer
You can load the tokenizer using the Hugging Face Transformers library:
```python
from transformers import AutoTokenizer
persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")
```
### Tokenization Example
```python
test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))
```
## Acknowledgments
- **Dataset**: `mshojaei77/PersianTelegramChannels`
- **Libraries**: Hugging Face `datasets`, `tokenizers`, and `transformers`
## References
- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/index)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)