File size: 1,710 Bytes
b0e83a4 b3f836a 66de022 b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a b0e83a4 b3f836a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
---
language:
- fa
- en
license: mit
library_name: transformers
tags:
- Tokenizer
---
# Improved LLaMA 2 Tokenizer with Persian Language Support
## Model Description
This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.
### Key Features
- Enhanced support for Persian language tokenization
- Maintained multilingual capabilities of the original LLaMA 2 tokenizer
- Improved handling of Persian-specific characters and word structures
- Larger vocabulary size to accommodate Persian tokens
## Training Data
The tokenizer was created using the following steps:
1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.
## Training Procedure
1. Persian Wikipedia Tokenizer Training:
- Corpus: Persian Wikipedia dump (specify date if available)
- Tokenization algorithm: BPE
- Vocabulary size: 5000
2. Merging with LLaMA 2 Tokenizer:
- Base tokenizer: LLaMA 2 tokenizer
- Final vocabulary size: 36954
## Usage
To use this tokenizer with the Hugging Face Transformers library:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/llama2-persian-tokenizer")
# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens) |