Kirundi Tokenizer
This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.
Model Details
- Model type: SentencePiece
- Vocabulary size: 32,000
- Training corpus: A clean corpus of Kirundi text.
Training Data
The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.
How to Use
import sentencepiece as spm
# Load the tokenizer
sp = spm.SentencePieceProcessor(model_file='kirundi.model')
# Tokenize text
text = "Ndakunda igihugu canje."
tokens = sp.encode(text, out_type=str)
print(tokens)
# Detokenize text
decoded_text = sp.decode(tokens)
print(decoded_text)