Tifinagh-Unigram-Tokenizer-32K
Model Description
Tifinagh-Unigram-Tokenizer-32K is a custom tokenizer built using the Unigram algorithm via the SentencePiece library. It is specifically trained and optimized to process the Tamazight language using the Tifinagh script, featuring a vocabulary size of 32,000 (32K) tokens.
This tokenizer serves as a foundational tool for Natural Language Processing (NLP) tasks involving Tamazight, streamlining the development of Large Language Models (LLMs) and Machine Translation systems.
Model Details
- Algorithm: SentencePiece (Unigram)
- Vocabulary Size: 32,000
- Supported Script: Tifinagh
- Languages: Tamazight, Arabic, French, English
Usage
You can easily integrate this tokenizer into your Python projects to process Tamazight text. Ensure you have the sentencepiece library installed and download the tokenizer.model file from this repository.
Example Code (usage_example.py)
import sentencepiece as spm
from huggingface_hub import hf_hub_download
# Download the tokenizer model directly from Hugging Face
model_path = hf_hub_download(repo_id="Tifinagh/Tifinagh-Unigram-Tokenizer-32K", filename="tifinagh_unigram.model")
# Load the tokenizer model
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(model_path)
# Example usage
text = "ⴰⵣⵓⵍ ⴰⵎⴰⴹⴰⵍ! ⵡⴰ ⵉⴳⴰ ⵉⵔⵉⵎ ⵏ ⵓⵙⵙⵎⵔⵙ ⵏ ⵓⵙⵏⵉⴳⵍ ⵏ ⵓⵙⴽⴽⵉⵍ ⵏ ⵜⴼⵉⵏⴰⵖ"
tokens = tokenizer.encode(text, out_type=str)
decoded = tokenizer.decode(tokenizer.encode(text))
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {decoded}")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support