Gemma4 LLM

Run Google's Gemma 4 locally with vision. A fully offline AI chat app powered by Gemma 4 E2B — no cloud, no API keys, no accounts.

Downloads

Platform	File	Size
Windows	Gemma4-LLM.zip	5.05 GB
Android	gemma4-llm.apk	4.1 GB

Both bundles include the model, inference engine, and chat UI. Nothing else to install.

About

Gemma4 LLM bundles everything needed to run Google's Gemma 4 E2B model on your device — the model weights, inference engine, and chat UI are all included.

Gemma 4 E2B is natively multimodal — it understands both text and images out of the box. This isn't a bolt-on vision module; image understanding is built into the model architecture.

Natively multimodal: Text and image understanding built into the model
Fully offline: No internet connection required after install
GPU accelerated: Uses CUDA on Windows, CPU on Android (arm64)
Stock llama.cpp: Built on official ggml-org/llama.cpp release b8683

Model Details


Base model	google/gemma-4-e2b-it
Architecture	Mixture-of-Experts (5.1B total, 2.3B active)
Quantization	Q4_K_M (GGUF) via unsloth
Model file	gemma-4-E2B-it-Q4_K_M.gguf (2.96 GB)
Image encoder	mmproj-F16.gguf (940 MB) — part of Gemma 4's native vision architecture
License	Apache 2.0

Windows

Download and extract Gemma4-LLM.zip
Double-click Gemma4-LLM.exe — a native window opens with the chat UI
The model loads automatically on startup

The zip contains the EXE and a models/ folder side by side. Bundles CUDA runtime for GPU acceleration — falls back to CPU if no GPU is available.

Android

Download gemma4-llm.apk
Enable "Install from unknown sources" in Settings
Install and open — first launch extracts the model (~3 min)

Requirements:

Android 9+ with arm64 (64-bit) processor
8+ GB RAM recommended
Runs as a foreground service with status notification
Model is bundled inside the APK (split into chunks, reassembled on first launch)

Technical Details

Inference: llama.cpp server running locally on port 8080
Context length: 4096 tokens (mobile), 8192 tokens (desktop)
KV cache: q8_0 quantized for reduced memory usage
Android: 4 threads on big cores, WebView UI
Windows: PyWebView (Edge/Chromium), Flask backend

Credits

Google for the Gemma 4 model family
ggml-org/llama.cpp for the inference engine
unsloth for GGUF quantizations

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support