File size: 1,710 Bytes
b0e83a4
b3f836a
 
 
66de022
 
 
 
b0e83a4
 
b3f836a
b0e83a4
b3f836a
b0e83a4
b3f836a
b0e83a4
b3f836a
b0e83a4
b3f836a
 
 
 
b0e83a4
b3f836a
b0e83a4
b3f836a
b0e83a4
b3f836a
 
b0e83a4
b3f836a
b0e83a4
b3f836a
 
 
 
b0e83a4
b3f836a
 
 
b0e83a4
b3f836a
b0e83a4
b3f836a
b0e83a4
b3f836a
 
b0e83a4
b3f836a
b0e83a4
b3f836a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
language:
- fa
- en
license: mit
library_name: transformers
tags:
- Tokenizer
---

# Improved LLaMA 2 Tokenizer with Persian Language Support

## Model Description

This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.

### Key Features

- Enhanced support for Persian language tokenization
- Maintained multilingual capabilities of the original LLaMA 2 tokenizer
- Improved handling of Persian-specific characters and word structures
- Larger vocabulary size to accommodate Persian tokens

## Training Data

The tokenizer was created using the following steps:

1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.

## Training Procedure

1. Persian Wikipedia Tokenizer Training:
   - Corpus: Persian Wikipedia dump (specify date if available)
   - Tokenization algorithm: BPE
   - Vocabulary size: 5000

2. Merging with LLaMA 2 Tokenizer:
   - Base tokenizer: LLaMA 2 tokenizer
   - Final vocabulary size: 36954

## Usage

To use this tokenizer with the Hugging Face Transformers library:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-username/llama2-persian-tokenizer")

# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens)