Helena llama.cpp / Ollama patches β€” Gemma 4 multimodal (vision + audio)

This repository contains patches that enable Gemma 4 (E4B) multimodal inference (vision + audio) in a fork of Ollama (Go runner) with vendored llama.cpp, validated end-to-end on a single RTX 5070 Laptop GPU (8 GB VRAM) running on Linux/WSL.

The patches were developed during the integration of Gemma 4 E4B Q4_K_M into the Helena cognitive AI project (a local-first modular cognitive system), where vanilla Ollama upstream did not yet support Gemma 4 multimodality at the time of writing (2026-05-01).

⚠️ These patches are not official Ollama or llama.cpp contributions β€” they are local fixes that enabled a working pipeline. They may be partially redundant with upstream work that has landed since. Verify against the latest upstream before applying.

What these patches fix

1. Vision (3 files in llama.cpp/tools/mtmd/)

The clip.cpp / clip-impl.h / clip.h patches add PROJECTOR_TYPE_GEMMA4V as a structural alias of PROJECTOR_TYPE_GEMMA3 (same tensors, same forward pass), allowing the mtmd (multimodal) tooling to build and load a Gemma 4 vision projector that uses gemma4v as its clip.vision.projector_type value.

Without this, the build refuses to recognize the projector type and the model cannot be loaded with vision capability.

2. Audio (5 fixes in the Ollama Go runner β€” model/models/gemma4/model_audio.go)

The Go runner's gemma4 audio path had several mismatches that caused crashes during inference:

# Fix Reason
1 pre_encode.out moved from BEFORE the conformer to AFTER The macro name TN_PRE_ENCODE_OUT is misleading β€” for PROJECTOR_TYPE_GEMMA4A the GGUF tensor a.pre_encode.out.* is the encoder's output projection (1024 β†’ 1536), applied after the conformer stack, before the multimodal embedder. This matches gemma4a.cpp upstream graph.
2 GGUF tag AttnPreNorm: ln1 β†’ attn_pre_norm Real tensor name in the GGUF is a.blk.N.attn_pre_norm.weight.
3 GGUF tag AttnPostNorm: ln2 β†’ attn_post_norm Real tensor name is a.blk.N.attn_post_norm.weight.
4 GGUF tag Norm (block-end final): layer_pre_norm β†’ ln2 The single block-level final RMSNorm in the gemma4 conformer is named ln2 in the GGUF.
5 GGUF tag LinearPos: linear_pos.weight β†’ attn_k_rel.weight The relative position projection is named attn_k_rel.weight in the GGUF.
6 Nil-guard on FC in AudioMultimodalProjector.Forward The legacy mm.a.fc.* tensors do not exist in the gemma4 mmproj GGUF β€” only mm.a.input_projection.weight. The Forward must skip the FC step when nil.

After these fixes, audio is decoded and encoded end-to-end through the 12 conformer blocks, the post-encoder projection runs, and audio tokens are embedded into the text vocabulary space (e.g., a 5-second WAV produces ~125 audio tokens at the model's text embedding dim 2560).

Architecture context

The architecture for Gemma 4 audio (verified against HuggingFace Gemma3nAudioModel reference and llama.cpp gemma4a.cpp graph builder):

Mel features [128 mel bins, T frames]
  ↓ SSCP (2Γ— Conv2D stride 2)        β†’ [1024, T/4]
  ↓ Conformer stack (12 blocks)      β†’ [1024, T/4]
  ↓ pre_encode.out projection        β†’ [1536, T/4]   ← post-conformer, name is misleading
  ↓ RMSNorm (no learned weight)
  ↓ mm.a.input_projection            β†’ [2560, T/4]   ← text embedding dim

How to apply

Option A β€” Apply the patch file

cd /path/to/your/ollama-fork
git apply /path/to/0001-feat-helena-engine-Gemma-4-multimodal-vision-audio-o.patch

The patch assumes a layout where llama.cpp is vendored under modules/helena-engine/llama/llama.cpp/ and the Go runner is under modules/helena-engine/model/models/gemma4/. Adapt the paths to your fork's layout.

Option B β€” Copy the patched files

The files/ directory contains the patched source files in their original subpath structure:

  • files/llama.cpp-tools-mtmd/{clip.h,clip-impl.h,clip.cpp} β†’ copy to your llama.cpp/tools/mtmd/
  • files/ollama-go-runner-gemma4/model_audio.go β†’ copy to your ollama-fork/model/models/gemma4/

Then rebuild your Ollama fork:

cd /path/to/your/ollama-fork
go build -o ollama .

Required: merged GGUF (text + mmproj)

The Ollama Go runner does not load split vision models (text GGUF + separate mmproj GGUF). You must merge the Gemma 4 text and mmproj GGUFs into a single file with proper KV alias keys (gemma4.vision.*, gemma4.audio.* instead of the original clip.vision.*, clip.audio.*). A reference Python merge script using gguf-py is described in the Helena project memory.

The reason for this: helenad/Ollama's KeyValue("vision.block_count") automatically prepends the model's general.architecture (e.g., gemma4) when looking up keys not starting with general. or tokenizer.. So clip.vision.block_count from the original mmproj is invisible β€” it must be aliased as gemma4.vision.block_count.

Validation

  • Vision: confirmed on test images (red square, yellow circle on blue background) β€” model correctly identifies shape and colors.
  • Audio: confirmed on birth_cry.wav (5s, 16 kHz mono PCM) β€” full pipeline traversed (decode β†’ mel β†’ SSCP β†’ 12 conformer blocks β†’ post-projection β†’ multimodal embedding β†’ LLM generation, ~750 output tokens).
  • Limitations:
    • Cold inference is slow on 8 GB VRAM (~50 s for 30 s audio).
    • For audios > 60 seconds, the compute graph capacity in helenad needs to be increased (maxGraphNodes heuristic at ml/backend/ggml/ggml.go:381) β€” separate fix.
    • Quality on speech transcription is limited at Q4_K_M quantization β€” Gemma 4 audio is best for scene classification (voice/music/noise) rather than precise transcription. For accurate transcription, pair with Whisper.

Provenance

Developed during the Helena project (a local-first modular cognitive AI system, French-speaking). Diagnostic and patches written by Claude Opus 4.7 (1M context) assisting the developer Antharyus. Date: 2026-05-01 / 2026-05-02.

License

MIT β€” same as upstream llama.cpp and Ollama.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support