Gemma4 LLM
Run Google's Gemma 4 locally with vision. A fully offline AI chat app powered by Gemma 4 E2B β no cloud, no API keys, no accounts.
Downloads
| Platform | File | Size |
|---|---|---|
| Windows | Gemma4-LLM.zip | 5.05 GB |
| Android | gemma4-llm.apk | 4.1 GB |
Both bundles include the model, inference engine, and chat UI. Nothing else to install.
About
Gemma4 LLM bundles everything needed to run Google's Gemma 4 E2B model on your device β the model weights, inference engine, and chat UI are all included.
Gemma 4 E2B is natively multimodal β it understands both text and images out of the box. This isn't a bolt-on vision module; image understanding is built into the model architecture.
- Natively multimodal: Text and image understanding built into the model
- Fully offline: No internet connection required after install
- GPU accelerated: Uses CUDA on Windows, CPU on Android (arm64)
- Stock llama.cpp: Built on official ggml-org/llama.cpp release b8683
Model Details
| Base model | google/gemma-4-e2b-it |
| Architecture | Mixture-of-Experts (5.1B total, 2.3B active) |
| Quantization | Q4_K_M (GGUF) via unsloth |
| Model file | gemma-4-E2B-it-Q4_K_M.gguf (2.96 GB) |
| Image encoder | mmproj-F16.gguf (940 MB) β part of Gemma 4's native vision architecture |
| License | Apache 2.0 |
Windows
- Download and extract
Gemma4-LLM.zip - Double-click
Gemma4-LLM.exeβ a native window opens with the chat UI - The model loads automatically on startup
The zip contains the EXE and a models/ folder side by side. Bundles CUDA runtime for GPU acceleration β falls back to CPU if no GPU is available.
Android
- Download
gemma4-llm.apk - Enable "Install from unknown sources" in Settings
- Install and open β first launch extracts the model (~3 min)
Requirements:
- Android 9+ with arm64 (64-bit) processor
- 8+ GB RAM recommended
- Runs as a foreground service with status notification
- Model is bundled inside the APK (split into chunks, reassembled on first launch)
Technical Details
- Inference: llama.cpp server running locally on port 8080
- Context length: 4096 tokens (mobile), 8192 tokens (desktop)
- KV cache: q8_0 quantized for reduced memory usage
- Android: 4 threads on big cores, WebView UI
- Windows: PyWebView (Edge/Chromium), Flask backend
Credits
- Google for the Gemma 4 model family
- ggml-org/llama.cpp for the inference engine
- unsloth for GGUF quantizations
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support