Helena llama.cpp / Ollama patches — Gemma 4 multimodal (vision + audio)

This repository contains patches that enable Gemma 4 (E4B) multimodal inference (vision + audio) in a fork of Ollama (Go runner) with vendored llama.cpp, validated end-to-end on a single RTX 5070 Laptop GPU (8 GB VRAM) running on Linux/WSL.

The patches were developed during the integration of Gemma 4 E4B Q4_K_M into the Helena cognitive AI project (a local-first modular cognitive system), where vanilla Ollama upstream did not yet support Gemma 4 multimodality at the time of writing (2026-05-01).

⚠️ These patches are not official Ollama or llama.cpp contributions — they are local fixes that enabled a working pipeline. They may be partially redundant with upstream work that has landed since. Verify against the latest upstream before applying.

What these patches fix

1. Vision (3 files in `llama.cpp/tools/mtmd/`)

The clip.cpp / clip-impl.h / clip.h patches add PROJECTOR_TYPE_GEMMA4V as a structural alias of PROJECTOR_TYPE_GEMMA3 (same tensors, same forward pass), allowing the mtmd (multimodal) tooling to build and load a Gemma 4 vision projector that uses gemma4v as its clip.vision.projector_type value.

Without this, the build refuses to recognize the projector type and the model cannot be loaded with vision capability.

2. Audio (5 fixes in the Ollama Go runner — `model/models/gemma4/model_audio.go`)

The Go runner's gemma4 audio path had several mismatches that caused crashes during inference:

#	Fix	Reason
1	`pre_encode.out` moved from BEFORE the conformer to AFTER	The macro name `TN_PRE_ENCODE_OUT` is misleading — for `PROJECTOR_TYPE_GEMMA4A` the GGUF tensor `a.pre_encode.out.` is the encoder's output projection (1024 → 1536), applied after* the conformer stack, before the multimodal embedder. This matches `gemma4a.cpp` upstream graph.
2	GGUF tag `AttnPreNorm`: `ln1` → `attn_pre_norm`	Real tensor name in the GGUF is `a.blk.N.attn_pre_norm.weight`.
3	GGUF tag `AttnPostNorm`: `ln2` → `attn_post_norm`	Real tensor name is `a.blk.N.attn_post_norm.weight`.
4	GGUF tag `Norm` (block-end final): `layer_pre_norm` → `ln2`	The single block-level final RMSNorm in the gemma4 conformer is named `ln2` in the GGUF.
5	GGUF tag `LinearPos`: `linear_pos.weight` → `attn_k_rel.weight`	The relative position projection is named `attn_k_rel.weight` in the GGUF.
6	Nil-guard on `FC` in `AudioMultimodalProjector.Forward`	The legacy `mm.a.fc.*` tensors do not exist in the gemma4 mmproj GGUF — only `mm.a.input_projection.weight`. The Forward must skip the FC step when nil.

After these fixes, audio is decoded and encoded end-to-end through the 12 conformer blocks, the post-encoder projection runs, and audio tokens are embedded into the text vocabulary space (e.g., a 5-second WAV produces ~125 audio tokens at the model's text embedding dim 2560).

Architecture context

The architecture for Gemma 4 audio (verified against HuggingFace Gemma3nAudioModel reference and llama.cpp gemma4a.cpp graph builder):

Mel features [128 mel bins, T frames]
  ↓ SSCP (2× Conv2D stride 2)        → [1024, T/4]
  ↓ Conformer stack (12 blocks)      → [1024, T/4]
  ↓ pre_encode.out projection        → [1536, T/4]   ← post-conformer, name is misleading
  ↓ RMSNorm (no learned weight)
  ↓ mm.a.input_projection            → [2560, T/4]   ← text embedding dim

How to apply

Option A — Apply the patch file

cd /path/to/your/ollama-fork
git apply /path/to/0001-feat-helena-engine-Gemma-4-multimodal-vision-audio-o.patch

The patch assumes a layout where llama.cpp is vendored under modules/helena-engine/llama/llama.cpp/ and the Go runner is under modules/helena-engine/model/models/gemma4/. Adapt the paths to your fork's layout.

Option B — Copy the patched files

The files/ directory contains the patched source files in their original subpath structure:

files/llama.cpp-tools-mtmd/{clip.h,clip-impl.h,clip.cpp} → copy to your llama.cpp/tools/mtmd/
files/ollama-go-runner-gemma4/model_audio.go → copy to your ollama-fork/model/models/gemma4/

Then rebuild your Ollama fork:

cd /path/to/your/ollama-fork
go build -o ollama .

Required: merged GGUF (text + mmproj)

The Ollama Go runner does not load split vision models (text GGUF + separate mmproj GGUF). You must merge the Gemma 4 text and mmproj GGUFs into a single file with proper KV alias keys (gemma4.vision.*, gemma4.audio.* instead of the original clip.vision.*, clip.audio.*). A reference Python merge script using gguf-py is described in the Helena project memory.

The reason for this: helenad/Ollama's KeyValue("vision.block_count") automatically prepends the model's general.architecture (e.g., gemma4) when looking up keys not starting with general. or tokenizer.. So clip.vision.block_count from the original mmproj is invisible — it must be aliased as gemma4.vision.block_count.

Validation

Vision: confirmed on test images (red square, yellow circle on blue background) — model correctly identifies shape and colors.
Audio: confirmed on birth_cry.wav (5s, 16 kHz mono PCM) — full pipeline traversed (decode → mel → SSCP → 12 conformer blocks → post-projection → multimodal embedding → LLM generation, ~750 output tokens).
Limitations:
- Cold inference is slow on 8 GB VRAM (~50 s for 30 s audio).
- For audios > 60 seconds, the compute graph capacity in helenad needs to be increased (maxGraphNodes heuristic at ml/backend/ggml/ggml.go:381) — separate fix.
- Quality on speech transcription is limited at Q4_K_M quantization — Gemma 4 audio is best for scene classification (voice/music/noise) rather than precise transcription. For accurate transcription, pair with Whisper.

Provenance

Developed during the Helena project (a local-first modular cognitive AI system, French-speaking). Diagnostic and patches written by Claude Opus 4.7 (1M context) assisting the developer Antharyus. Date: 2026-05-01 / 2026-05-02.

License

MIT — same as upstream llama.cpp and Ollama.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support