Kirundi Tokenizer and LoRA Model

Model Description

This repository contains two main components:

A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
A LoRA adapter trained for Kirundi language processing

Tokenizer Details

Type: BPE (Byte-Pair Encoding)
Vocabulary Size: 30,000 tokens
Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
Pre-tokenization: Whitespace-based

LoRA Adapter Details

Base Model: [To be filled with your chosen base model]
Rank: 8
Alpha: 32
Target Modules: Query and Value attention matrices
Dropout: 0.05

Intended Uses & Limitations

Intended Uses

Text processing for Kirundi language
Machine translation tasks involving Kirundi
Natural language understanding tasks for Kirundi content
Foundation for developing Kirundi language applications

Limitations

The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
Limited to the vocabulary observed in the training data
Performance may vary on domain-specific text

Training Data

The model components were trained on the Kirundi-English parallel corpus:

Dataset: eligapris/kirundi-english
Size: 21.4k sentence pairs
Nature: Parallel corpus with Kirundi and English translations
Domain: Mixed domain including religious, general, and conversational text

Training Procedure

Tokenizer Training

Trained using Hugging Face's Tokenizers library
BPE algorithm with a vocabulary size of 30k
Includes special tokens for task-specific usage
Trained on the Kirundi portion of the parallel corpus

LoRA Training

[To be filled with your specific training details]

Number of epochs:
Batch size:
Learning rate:
Training hardware:
Training time:

Evaluation Results

[To be filled with your evaluation metrics]

Coverage statistics:
Out-of-vocabulary rate:
Task-specific metrics:

Environmental Impact

[To be filled with training compute details]

Estimated CO2 emissions:
Hardware used:
Training duration:

Technical Specifications

Model Architecture

Tokenizer: BPE-based with custom vocabulary
LoRA Configuration:
- r=8 (rank)
- α=32 (scaling)
- Trained on specific attention layers
- Dropout rate: 0.05

Software Requirements

dependencies = {
    "transformers": ">=4.30.0",
    "tokenizers": ">=0.13.0",
    "peft": ">=0.4.0"
}

How to Use

Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")

Loading the LoRA Model

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification

config = PeftConfig.from_pretrained("path_to_lora_model")
model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, "path_to_lora_model")

Contact

Eligapris

Updates and Versions

v1.0.0 (Initial Release)
- Base tokenizer and LoRA model
- Trained on Kirundi-English parallel corpus
- Basic functionality and documentation

Acknowledgments

Dataset provided by eligapris
Hugging Face's Transformers and Tokenizers libraries
PEFT library for LoRA implementation

eligapris
/

tokenizer-rn