Instructions to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Asystemoffields/gemma-4-E2B-it-PMRA-GGUF",
	filename="gemma4_e2b_it_pmra_calib_greedy.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
# Run inference directly in the terminal:
llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
# Run inference directly in the terminal:
llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
# Run inference directly in the terminal:
./llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Use Docker

docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

LM Studio
Jan

vLLM

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Ollama
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Ollama:
```
ollama run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
```

Unsloth Studio

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF to start chatting

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Asystemoffields/gemma-4-E2B-it-PMRA-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Run Hermes

hermes

Docker Model Runner
How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Docker Model Runner:
```
docker model run hf.co/Asystemoffields/gemma-4-E2B-it-PMRA-GGUF
```

Lemonade

How to use Asystemoffields/gemma-4-E2B-it-PMRA-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Run and chat with the model

lemonade run user.gemma-4-E2B-it-PMRA-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Gemma 4 E2B-it · PMRA mixed-precision GGUF

A ~3.1 GB GGUF of Google DeepMind's Gemma 4 E2B-it at the Q3_K_S size budget that is a large win over the plain quant — ~5.1 nats lower NLL than Q3_K_S, and it even edges out the bigger Q4_K_M while staying smaller. A standard GGUF for llama.cpp / Ollama, text generation.

The model

Gemma 4 E2B-it is the smallest, instruction-tuned member of Google DeepMind's Gemma 4 family. Gemma 4 models are multimodal (text + image, with audio on the small models), carry a context window of up to 256K tokens, and support 140+ languages; the family spans dense and Mixture-of-Experts designs in four sizes (E2B, E4B, 26B-A4B, 31B). The "E2B" variant is the phone-and-laptop-class entry point — small enough to run locally, which is exactly the regime where a better same-size quant matters most.

Scope of this artifact: this GGUF targets the text stack for text generation in llama.cpp; image/audio input is not exercised here. Calibrated and measured on English.

Why this build (PMRA)

A normal GGUF quant uses one format for nearly every tensor, paying the same bit-rate everywhere. Production Mixed-Rate Allocation (PMRA) measures each tensor group's contribution to quality and spends bits where they buy the most: from a low-bit Q2_K floor it promotes the groups that matter to stronger formats (Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_M) under a fixed byte budget — producing one standard GGUF at the Q3_K_S size that is far more faithful to the original weights.

Headline (Wikitext-2 validation, lower NLL is better):

	NLL	size
this PMRA build (knapsack)	12.88	3.094 GB
plain `Q3_K_S` (same budget)	17.99	3.094 GB
`Q4_K_M` (larger)	13.55	3.412 GB

→ −5.11 NLL vs the same-size Q3_K_S, and lower NLL than even the larger Q4_K_M.

Which file?

gemma4_e2b_it_pmra_calib_knapsack.gguf — recommended (knapsack selector).
gemma4_e2b_it_pmra_calib_greedy.gguf — the earlier greedy-selector build, kept for reference (the knapsack build is 0.40 NLL better).

Quick start

llama-cli -m gemma4_e2b_it_pmra_calib_knapsack.gguf \
  -p "Write a short hello from PMRA." -n 80

Needs a recent llama.cpp build (or Ollama) with Gemma 4 support. ~3.1 GB on disk; runs on CPU.

Footprint

file: gemma4_e2b_it_pmra_calib_knapsack.gguf
size: 3,110,215,968 bytes (≈ 3.11 GB) · payload 3,094,397,068 bytes · tensor count 601
file bpw: 5.354 · payload bpw: 5.327
SHA-256: a5a80f2628e236a228f2016bcc3ac660a268f2c8757d21d901095c74b60e3d97
tensor reload mismatches: 0
local llama.cpp smoke (build a8fd165): 30.5 prompt tok/s · 10.6 decode tok/s

general.file_type is inherited from the metadata source (GGUF has no enum for this mixed allocation); use the embedded pmra.* metadata and artifact_report_knapsack.json for payload accounting.

Benchmarks

Calibration: Wikitext-2-raw train. Evaluation: Wikitext-2-raw validation. Lower NLL is better; mix/quant rows are at matched size.

Variant	NLL	Payload bpw	Payload bytes
fp16 reference	`14.381222`	`16.000000`	`9,294,899,782`
`Q2_K` (low source)	`20.376913`	`5.118105`	`2,973,267,084`
`Q3_K_S` (target / control)	`17.993582`	`5.326613`	`3,094,396,044`
`Q3_K_M`	`15.619944`	`5.483489`	`3,185,529,996`
`Q3_K_L`	`15.756687`	`5.622925`	`3,266,532,492`
`IQ4_XS`	`16.043206`	`5.670221`	`3,294,008,460`
`Q4_K_M`	`13.549753`	`5.873431`	`3,412,059,276`
same-budget random	`20.488594`	`5.326613`	`3,094,396,044`
PMRA `c2_calib_greedy_mixed`	`13.281400`	`5.326291`	`3,094,208,652`
PMRA `c2_calib_knapsack_mixed`	`12.878809`	`5.326613`	`3,094,396,044`

knapsack vs Q3_K_S target: −5.114774 NLL, matched payload
knapsack vs same-budget random: −7.609785 NLL
knapsack vs greedy: −0.402591 NLL
selected tensor groups: 204

How it was built

base: google/gemma-4-E2B-it
GGUF sources: mradermacher/gemma-4-E2B-it-GGUF
tensor profile gemma4 · selector c2_calib_knapsack_mixed
low source Q2_K → target/control Q3_K_S; promotion menu Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_M

Source mix

Source	Tensors	Payload bytes
`Q2_K`	`397`	`2,637,615,244`
`Q3_K_M`	`84`	`233,001,984`
`Q4_K_M`	`56`	`119,282,688`
`IQ4_XS`	`40`	`83,140,608`
`Q3_K_L`	`24`	`21,356,544`

Files

gemma4_e2b_it_pmra_calib_knapsack.gguf (recommended), gemma4_e2b_it_pmra_calib_greedy.gguf
artifact_report_knapsack.json / .md, selector_result_knapsack.json / .md
llama_cli_smoke_knapsack.log, GEMMA4_E2B_IT_KNAPSACK_RELEASE.md, and the prior greedy-release reports

Attribution & license

Derived from google/gemma-4-E2B-it (Google DeepMind) and public GGUF quantizations from mradermacher/gemma-4-E2B-it-GGUF, via llama.cpp GGUF tooling. Released under the Gemma 4 / Apache-2.0 terms; preserve upstream model, license, and quantization attribution when redistributing.

Method + reproduction: https://github.com/asystemoffields/PMRA

Limitations

Experimental, English-calibrated; broader multilingual and multimodal evaluation is future work.
The selector is calibration-greedy/knapsack at tensor granularity; finer allocation may improve the frontier further.