Tifinagh-Unigram-Tokenizer-32K

Model Description

Tifinagh-Unigram-Tokenizer-32K is a custom tokenizer built using the Unigram algorithm via the SentencePiece library. It is specifically trained and optimized to process the Tamazight language using the Tifinagh script, featuring a vocabulary size of 32,000 (32K) tokens.

This tokenizer serves as a foundational tool for Natural Language Processing (NLP) tasks involving Tamazight, streamlining the development of Large Language Models (LLMs) and Machine Translation systems.

Model Details

Algorithm: SentencePiece (Unigram)
Vocabulary Size: 32,000
Supported Script: Tifinagh
Languages: Tamazight, Arabic, French, English

Usage

You can easily integrate this tokenizer into your Python projects to process Tamazight text. Ensure you have the sentencepiece library installed and download the tokenizer.model file from this repository.

Example Code (`usage_example.py`)

import sentencepiece as spm
from huggingface_hub import hf_hub_download

# Download the tokenizer model directly from Hugging Face
model_path = hf_hub_download(repo_id="Tifinagh/Tifinagh-Unigram-Tokenizer-32K", filename="tifinagh_unigram.model")

# Load the tokenizer model
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(model_path)

# Example usage
text = "ⴰⵣⵓⵍ ⴰⵎⴰⴹⴰⵍ! ⵡⴰ ⵉⴳⴰ ⵉⵔⵉⵎ ⵏ ⵓⵙⵙⵎⵔⵙ ⵏ ⵓⵙⵏⵉⴳⵍ ⵏ ⵓⵙⴽⴽⵉⵍ ⵏ ⵜⴼⵉⵏⴰⵖ"
tokens = tokenizer.encode(text, out_type=str)
decoded = tokenizer.decode(tokenizer.encode(text))

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {decoded}")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Tifinagh-Unigram-Tokenizer-32K

Model Description

Model Details

Usage

Example Code (usage_example.py)

Example Code (`usage_example.py`)