Sunbird Tutor · Gemma 4 E2B

A Gemma 4 E2B fine-tune for spoken question-answering in English and Ugandan languages. Audio in, text answer out, no intermediate transcription step.

This is the multilingual SFT checkpoint described in the Sunbird Tutor writeup (Kaggle Gemma 4 Good Hackathon, 2026). It powers the Sunflower educational assistant app, an Android app that runs the model fully offline so a child in a Ugandan classroom can ask a science question in Luganda and get an answer back in Luganda, without an internet connection.

Training code: SunbirdAI/sunbird-tutor-modelling contains the multilingual training pipeline, data prep, and evaluation harness used to produce this model.

What it does

Accepts 16 kHz mono PCM audio directly through Gemma 4's native audio tower, with no separate ASR step. The audio understanding lives inside the same context window that produces the answer, so a single forward pass takes you from a spoken question to a written reply. Depending on the prompt, the model can answer the question, transcribe the speech, translate it into another supported language, or explain what was said.

Vision was not exercised during fine-tuning. Text plus audio input only.

Languages

ISO 639-3	Language	Status
`eng`	English	strong
`lug`	Luganda	strongest non-English. chrF ~0.51 on the project eval.
`ach`	Acholi	second tier. chrF ~0.40, classroom-usable for shorter responses.
`nyn`	Runyankole	transcription and short translation reliable; QA degrades.
`xog`	Lusoga	transcription and short translation reliable; QA degrades.
`nyo`	Lunyoro	transcription and short translation reliable; QA degrades.
`teo`	Ateso	transcription and short translation reliable; QA degrades.

Quality scales with training data volume, so Luganda is meaningfully ahead. The broader Sunbird Tutor project targets 12 Ugandan languages across multiple checkpoints; see the training repo for the larger picture.

How to use

Transformers (Python)

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Sunbird/sunbirdtutor-gemma-4-e2b")
model = AutoModelForImageTextToText.from_pretrained(
    "Sunbird/sunbirdtutor-gemma-4-e2b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are an educational assistant that can give explanations, "
            "transcriptions and translations in Ugandan languages."
        ),
    },
    {
        "role": "user",
        "content": [{"type": "audio", "audio": "path/to/16khz_mono.wav"}],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
answer = processor.batch_decode(
    outputs[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)[0]
print(answer)

The exact auto-class for Gemma 4 audio depends on your transformers version. If AutoModelForImageTextToText does not pick up the audio tower, the loader scripts in sunbird-tutor-modelling have the working class names.

On-device

For mobile, use the Cactus INT4 quantization: Sunbird/sunflower-qa-cactus-int4. About 3.8 GB on disk, fully offline. The audio tower stays at FP16 so Luganda speech recognition survives quantization; the decoder is pushed to INT4.

Prompt format

Use this system prompt verbatim. The fine-tune saw it on every training example, and drift here measurably degrades quality:

You are an educational assistant that can give explanations, transcriptions and translations in Ugandan languages.

The user content varies by mode:

Mode	User content
Answer (default)	empty string. Audio carries the question.
Transcribe	`"Transcribe this audio."`
Translate	`"Translate this audio into {target language}."`
Explain	`"Explain what was said in this audio."`

The canonical runtime strings live in lib/model_settings_sheet.dart inside the Sunflower app.

Training

Fine-tuned from google/gemma-4-e2b. The pipeline, documented in full in the training repository, has three stages:

Continued pretraining on around 600M characters of Ugandan-language text from web articles, books, translations, and synthetic instruction-following examples.
Multilingual SFT on transcription using speech data from SALT, Google's Waxal, FLEURS, and the Makerere speech benchmark.
A short final fine-tune on speech QA, built from the Ugandan primary school curriculum, machine-translated to Ugandan languages and converted to speech with a TTS model the team had previously trained (Orpheus 3B based).

The audio tower was preserved throughout. The decoder adapts to Ugandan-language QA while keeping Gemma 4's native speech understanding intact.

For exact configs, training scripts, per-language eval numbers, and the writeup that frames the work, see the training repo.

Intended use

Primary school science Q&A in Ugandan classrooms. The demo curriculum covers six Primary 5 to Primary 7 topics: photosynthesis, the water cycle, the life cycle of an insect, malaria prevention, digestion, and the solar system. Beyond Q&A, the model handles speech transcription, translation between supported languages from spoken input, and short spoken explanations. It is also useful as a research artifact for adapting multimodal foundation models to low-resource languages.

Out of scope

High-stakes domains, including medical, legal, and financial advice. Languages outside the seven listed. Image input. Long-form generation past a few hundred tokens, which drifts from the single-turn QA distribution the fine-tune was optimised for.

Limitations

Only Luganda reaches the strongest tier of QA quality. Acholi is classroom-usable for shorter responses. The other four Ugandan languages are present in the model but full question-answering degrades outside Luganda and Acholi; transcription and short translation remain reliable. Background-noise robustness has not been formally benchmarked in classroom environments. Audio inference assumes 16 kHz mono PCM input.

Related artifacts

SunbirdAI/sunbird-tutor-modelling: training code, data pipeline, evaluation harness.
SunbirdAI/sunflower-app: Android app that runs this model on-device.
ak3ra/sunflower-qa-cactus-int4: Cactus INT4 quantization, ~3.8 GB.

Acknowledgements

Built by the Sunbird AI team. Foundation model: Google's Gemma 4 E2B. Inference engine: Cactus.

Citation

@misc{sunbird-tutor-gemma-4-e2b-2026,
  author = {Sunbird AI},
  title  = {Sunbird Tutor: Gemma 4 E2B for spoken question-answering in Ugandan languages},
  year   = {2026},
  url    = {https://huggingface.co/Sunbird/sunbirdtutor-gemma-4-e2b}
}

Submitted to the Kaggle Gemma 4 Good Hackathon, 2026.

Downloads last month: 63

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sunbird/sunbirdtutor-gemma-4-e2b

Finetunes

1 model