Instructions to use theophilusowiti/Caracal_instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use theophilusowiti/Caracal_instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="theophilusowiti/Caracal_instruct", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("theophilusowiti/Caracal_instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("theophilusowiti/Caracal_instruct", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use theophilusowiti/Caracal_instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "theophilusowiti/Caracal_instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "theophilusowiti/Caracal_instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/theophilusowiti/Caracal_instruct

SGLang

How to use theophilusowiti/Caracal_instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "theophilusowiti/Caracal_instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "theophilusowiti/Caracal_instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "theophilusowiti/Caracal_instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "theophilusowiti/Caracal_instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use theophilusowiti/Caracal_instruct with Docker Model Runner:
```
docker model run hf.co/theophilusowiti/Caracal_instruct
```

Caracal_instruct

Model Description

Caracal_instruct is an instruction-tuned model, produced as part of the AfriLLMQuant pilot project. It was trained via full Quantization-Aware Training (QAT).

The underlying small language model (SLM) backbone for this QAT process was Inkuba-0.4B, continued-pretrained on African-language data and then instruction-tuned to produce Caracal_instruct. The recommended use of this model is for fine-tuning on specific task.


Base model	theophilusowiti/Caracal_GPT
Training method	Full Quantization-Aware Training (QAT)
Quantization	INT4
Memory footprint (INT8)	~1.17 GB
QAT training time	~3 days on 1x NVIDIA A100 80GB
License	Apache 2.0
Training Dataset	Muri Dataset

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "theophilusowiti/Caracal_instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "<s><Input>\nWho is the president of Kenya?\n</Input>\n<Answer>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Expected output

The model expects the <Input>...</Input> / <Answer>...</Answer> instruction format used during SFT, e.g.:

<Input>
Who is the president of Kenya?
</Input>
<Answer>
 William Ruto

William Ruto</Answer>

Top Performing Languages

Based on instruction following and PPL, these languages conform to instruction, especially when used with context/RAG:

East Africa: Swahili (swa), Amharic (amh), Luganda (lug), Kinyarwanda (kin)

West Africa: Hausa (hau), Yoruba (yor), Igbo (ibo)

Central Africa: Lingala (lin)

Southern Africa: Xhosa (xho)

Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.03
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
3.1297	0.9999	6,036	2.6244
2.7599	2.0	12,073	2.6072
2.7805	2.9998	18,108	2.6143

Framework versions

Transformers 4.45.0
Pytorch 2.12.0+cu126
Datasets 4.8.5
Tokenizers 0.20.3

Downloads last month: 1,499

Model tree for theophilusowiti/Caracal_instruct

Base model

lelapa/InkubaLM-0.4B

Finetuned

theophilusowiti/Caracal_GPT

Finetuned

(2)

this model

Paper for theophilusowiti/Caracal_instruct

InkubaLM: A small language model for low-resource African languages

Paper • 2408.17024 • Published Aug 30, 2024 • 14

Evaluation results

Accuracy on IrokoBench AfriXNLI
self-reported

33.320