Kirundi Tokenizer

This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.

Model Details

Model type: SentencePiece
Vocabulary size: 32,000
Training corpus: A clean corpus of Kirundi text.

Training Data

The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.

How to Use

import sentencepiece as spm

# Load the tokenizer
sp = spm.SentencePieceProcessor(model_file='kirundi.model')

# Tokenize text
text = "Ndakunda igihugu canje."
tokens = sp.encode(text, out_type=str)
print(tokens)

# Detokenize text
decoded_text = sp.decode(tokens)
print(decoded_text)