Instructions to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Asystemoffields/gemma-4-E2B-it-PMRA-GGUF", filename="gemma4_e2b_it_pmra_calib_greedy.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF # Run inference directly in the terminal: llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF # Run inference directly in the terminal: llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF # Run inference directly in the terminal: ./llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Use Docker
docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
- LM Studio
- Jan
- vLLM
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
- Ollama
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Ollama:
ollama run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
- Unsloth Studio
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting
- Pi
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Docker Model Runner:
docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
- Lemonade
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
Run and chat with the model
lemonade run user.gemma-4-E2B-it-PMRA-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Gemma 4 E2B-it · PMRA mixed-precision GGUF
A ~3.1 GB GGUF of Google DeepMind's Gemma 4 E2B-it at the Q3_K_S size budget that is a large win over the plain quant — ~5.1 nats lower NLL than Q3_K_S, and it even edges out the bigger Q4_K_M while staying smaller. A standard GGUF for llama.cpp / Ollama, text generation.
The model
Gemma 4 E2B-it is the smallest, instruction-tuned member of Google DeepMind's Gemma 4 family. Gemma 4 models are multimodal (text + image, with audio on the small models), carry a context window of up to 256K tokens, and support 140+ languages; the family spans dense and Mixture-of-Experts designs in four sizes (E2B, E4B, 26B-A4B, 31B). The "E2B" variant is the phone-and-laptop-class entry point — small enough to run locally, which is exactly the regime where a better same-size quant matters most.
Scope of this artifact: this GGUF targets the text stack for text generation in llama.cpp; image/audio input is not exercised here. Calibrated and measured on English.
Why this build (PMRA)
A normal GGUF quant uses one format for nearly every tensor, paying the same bit-rate everywhere. Production Mixed-Rate Allocation (PMRA) measures each tensor group's contribution to quality and spends bits where they buy the most: from a low-bit Q2_K floor it promotes the groups that matter to stronger formats (Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_M) under a fixed byte budget — producing one standard GGUF at the Q3_K_S size that is far more faithful to the original weights.
Headline (Wikitext-2 validation, lower NLL is better):
| NLL | size | |
|---|---|---|
| this PMRA build (knapsack) | 12.88 | 3.094 GB |
plain Q3_K_S (same budget) |
17.99 | 3.094 GB |
Q4_K_M (larger) |
13.55 | 3.412 GB |
→ −5.11 NLL vs the same-size Q3_K_S, and lower NLL than even the larger Q4_K_M.
Which file?
gemma4_e2b_it_pmra_calib_knapsack.gguf— recommended (knapsack selector).gemma4_e2b_it_pmra_calib_greedy.gguf— the earlier greedy-selector build, kept for reference (the knapsack build is0.40NLL better).
Quick start
llama-cli -m gemma4_e2b_it_pmra_calib_knapsack.gguf \
-p "Write a short hello from PMRA." -n 80
Needs a recent llama.cpp build (or Ollama) with Gemma 4 support. ~3.1 GB on disk; runs on CPU.
Footprint
- file:
gemma4_e2b_it_pmra_calib_knapsack.gguf - size:
3,110,215,968bytes (≈ 3.11 GB) · payload3,094,397,068bytes · tensor count601 - file bpw:
5.354· payload bpw:5.327 - SHA-256:
a5a80f2628e236a228f2016bcc3ac660a268f2c8757d21d901095c74b60e3d97 - tensor reload mismatches:
0 - local llama.cpp smoke (build
a8fd165):30.5prompt tok/s ·10.6decode tok/s
general.file_type is inherited from the metadata source (GGUF has no enum for this mixed allocation); use the embedded pmra.* metadata and artifact_report_knapsack.json for payload accounting.
Benchmarks
Calibration: Wikitext-2-raw train. Evaluation: Wikitext-2-raw validation. Lower NLL is better; mix/quant rows are at matched size.
| Variant | NLL | Payload bpw | Payload bytes |
|---|---|---|---|
| fp16 reference | 14.381222 |
16.000000 |
9,294,899,782 |
Q2_K (low source) |
20.376913 |
5.118105 |
2,973,267,084 |
Q3_K_S (target / control) |
17.993582 |
5.326613 |
3,094,396,044 |
Q3_K_M |
15.619944 |
5.483489 |
3,185,529,996 |
Q3_K_L |
15.756687 |
5.622925 |
3,266,532,492 |
IQ4_XS |
16.043206 |
5.670221 |
3,294,008,460 |
Q4_K_M |
13.549753 |
5.873431 |
3,412,059,276 |
| same-budget random | 20.488594 |
5.326613 |
3,094,396,044 |
PMRA c2_calib_greedy_mixed |
13.281400 |
5.326291 |
3,094,208,652 |
PMRA c2_calib_knapsack_mixed |
12.878809 |
5.326613 |
3,094,396,044 |
- knapsack vs
Q3_K_Starget: −5.114774 NLL, matched payload - knapsack vs same-budget random: −7.609785 NLL
- knapsack vs greedy: −0.402591 NLL
- selected tensor groups:
204
How it was built
- base:
google/gemma-4-E2B-it - GGUF sources:
mradermacher/gemma-4-E2B-it-GGUF - tensor profile
gemma4· selectorc2_calib_knapsack_mixed - low source
Q2_K→ target/controlQ3_K_S; promotion menuQ3_K_M,Q3_K_L,IQ4_XS,Q4_K_M
Source mix
| Source | Tensors | Payload bytes |
|---|---|---|
Q2_K |
397 |
2,637,615,244 |
Q3_K_M |
84 |
233,001,984 |
Q4_K_M |
56 |
119,282,688 |
IQ4_XS |
40 |
83,140,608 |
Q3_K_L |
24 |
21,356,544 |
Files
gemma4_e2b_it_pmra_calib_knapsack.gguf(recommended),gemma4_e2b_it_pmra_calib_greedy.ggufartifact_report_knapsack.json/.md,selector_result_knapsack.json/.mdllama_cli_smoke_knapsack.log,GEMMA4_E2B_IT_KNAPSACK_RELEASE.md, and the prior greedy-release reports
Attribution & license
Derived from google/gemma-4-E2B-it (Google DeepMind) and public GGUF quantizations from mradermacher/gemma-4-E2B-it-GGUF, via llama.cpp GGUF tooling. Released under the Gemma 4 / Apache-2.0 terms; preserve upstream model, license, and quantization attribution when redistributing.
Method + reproduction: https://github.com/asystemoffields/PMRA
Limitations
- Experimental, English-calibrated; broader multilingual and multimodal evaluation is future work.
- The selector is calibration-greedy/knapsack at tensor granularity; finer allocation may improve the frontier further.
- Downloads last month
- 2,157
We're not able to determine the quantization variants.