Gemma4 LLM

Run Google's Gemma 4 locally with vision. A fully offline AI chat app powered by Gemma 4 E2B β€” no cloud, no API keys, no accounts.

Downloads

Platform File Size
Windows Gemma4-LLM.zip 5.05 GB
Android gemma4-llm.apk 4.1 GB

Both bundles include the model, inference engine, and chat UI. Nothing else to install.

About

Gemma4 LLM bundles everything needed to run Google's Gemma 4 E2B model on your device β€” the model weights, inference engine, and chat UI are all included.

Gemma 4 E2B is natively multimodal β€” it understands both text and images out of the box. This isn't a bolt-on vision module; image understanding is built into the model architecture.

  • Natively multimodal: Text and image understanding built into the model
  • Fully offline: No internet connection required after install
  • GPU accelerated: Uses CUDA on Windows, CPU on Android (arm64)
  • Stock llama.cpp: Built on official ggml-org/llama.cpp release b8683

Model Details

Base model google/gemma-4-e2b-it
Architecture Mixture-of-Experts (5.1B total, 2.3B active)
Quantization Q4_K_M (GGUF) via unsloth
Model file gemma-4-E2B-it-Q4_K_M.gguf (2.96 GB)
Image encoder mmproj-F16.gguf (940 MB) β€” part of Gemma 4's native vision architecture
License Apache 2.0

Windows

  1. Download and extract Gemma4-LLM.zip
  2. Double-click Gemma4-LLM.exe β€” a native window opens with the chat UI
  3. The model loads automatically on startup

The zip contains the EXE and a models/ folder side by side. Bundles CUDA runtime for GPU acceleration β€” falls back to CPU if no GPU is available.

Android

  1. Download gemma4-llm.apk
  2. Enable "Install from unknown sources" in Settings
  3. Install and open β€” first launch extracts the model (~3 min)

Requirements:

  • Android 9+ with arm64 (64-bit) processor
  • 8+ GB RAM recommended
  • Runs as a foreground service with status notification
  • Model is bundled inside the APK (split into chunks, reassembled on first launch)

Technical Details

  • Inference: llama.cpp server running locally on port 8080
  • Context length: 4096 tokens (mobile), 8192 tokens (desktop)
  • KV cache: q8_0 quantized for reduced memory usage
  • Android: 4 threads on big cores, WebView UI
  • Windows: PyWebView (Edge/Chromium), Flask backend

Credits

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support