tokenizer-rn / README.md
eligapris's picture
Update README.md
09caf4b verified
metadata
license: mit
datasets:
  - eligapris/kirundi-english
language:
  - rn
library_name: transformers
tags:
  - kirundi
  - rn

Kirundi Tokenizer and LoRA Model

Model Description

This repository contains two main components:

  1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run)
  2. A LoRA adapter trained for Kirundi language processing

Tokenizer Details

  • Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 30,000 tokens
  • Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
  • Pre-tokenization: Whitespace-based

LoRA Adapter Details

  • Base Model: [To be filled with your chosen base model]
  • Rank: 8
  • Alpha: 32
  • Target Modules: Query and Value attention matrices
  • Dropout: 0.05

Intended Uses & Limitations

Intended Uses

  • Text processing for Kirundi language
  • Machine translation tasks involving Kirundi
  • Natural language understanding tasks for Kirundi content
  • Foundation for developing Kirundi language applications

Limitations

  • The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects
  • Limited to the vocabulary observed in the training data
  • Performance may vary on domain-specific text

Training Data

The model components were trained on the Kirundi-English parallel corpus:

  • Dataset: eligapris/kirundi-english
  • Size: 21.4k sentence pairs
  • Nature: Parallel corpus with Kirundi and English translations
  • Domain: Mixed domain including religious, general, and conversational text

Training Procedure

Tokenizer Training

  • Trained using Hugging Face's Tokenizers library
  • BPE algorithm with a vocabulary size of 30k
  • Includes special tokens for task-specific usage
  • Trained on the Kirundi portion of the parallel corpus

LoRA Training

[To be filled with your specific training details]

  • Number of epochs:
  • Batch size:
  • Learning rate:
  • Training hardware:
  • Training time:

Evaluation Results

[To be filled with your evaluation metrics]

  • Coverage statistics:
  • Out-of-vocabulary rate:
  • Task-specific metrics:

Environmental Impact

[To be filled with training compute details]

  • Estimated CO2 emissions:
  • Hardware used:
  • Training duration:

Technical Specifications

Model Architecture

  • Tokenizer: BPE-based with custom vocabulary
  • LoRA Configuration:
    • r=8 (rank)
    • α=32 (scaling)
    • Trained on specific attention layers
    • Dropout rate: 0.05

Software Requirements

dependencies = {
    "transformers": ">=4.30.0",
    "tokenizers": ">=0.13.0",
    "peft": ">=0.4.0"
}

How to Use

Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer")

Loading the LoRA Model

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification

config = PeftConfig.from_pretrained("path_to_lora_model")
model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, "path_to_lora_model")

Contact

Eligapris


Updates and Versions

  • v1.0.0 (Initial Release)
    • Base tokenizer and LoRA model
    • Trained on Kirundi-English parallel corpus
    • Basic functionality and documentation

Acknowledgments

  • Dataset provided by eligapris
  • Hugging Face's Transformers and Tokenizers libraries
  • PEFT library for LoRA implementation