--- language: - fa - en license: mit library_name: transformers tags: - Tokenizer --- # Improved LLaMA 2 Tokenizer with Persian Language Support ## Model Description This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages. ### Key Features - Enhanced support for Persian language tokenization - Maintained multilingual capabilities of the original LLaMA 2 tokenizer - Improved handling of Persian-specific characters and word structures - Larger vocabulary size to accommodate Persian tokens ## Training Data The tokenizer was created using the following steps: 1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns. 2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer. ## Training Procedure 1. Persian Wikipedia Tokenizer Training: - Corpus: Persian Wikipedia dump (specify date if available) - Tokenization algorithm: BPE - Vocabulary size: 5000 2. Merging with LLaMA 2 Tokenizer: - Base tokenizer: LLaMA 2 tokenizer - Final vocabulary size: 36954 ## Usage To use this tokenizer with the Hugging Face Transformers library: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/llama2-persian-tokenizer") # Example usage text = "این یک مثال به زبان فارسی است." tokens = tokenizer(text) print(tokens)