Instructions to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Orivael-SRD-Lab/gemma4_e2b_srd4_q4km", filename="gemma4_e2b_srd4_q4km.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km # Run inference directly in the terminal: llama cli -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km # Run inference directly in the terminal: llama cli -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km # Run inference directly in the terminal: ./llama-cli -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km # Run inference directly in the terminal: ./build/bin/llama-cli -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Use Docker
docker model run hf.co/Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
- LM Studio
- Jan
- vLLM
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Orivael-SRD-Lab/gemma4_e2b_srd4_q4km" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Orivael-SRD-Lab/gemma4_e2b_srd4_q4km", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
- Ollama
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Ollama:
ollama run hf.co/Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
- Unsloth Studio
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Orivael-SRD-Lab/gemma4_e2b_srd4_q4km to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Orivael-SRD-Lab/gemma4_e2b_srd4_q4km to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Orivael-SRD-Lab/gemma4_e2b_srd4_q4km to start chatting
- Pi
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Orivael-SRD-Lab/gemma4_e2b_srd4_q4km" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Docker Model Runner:
docker model run hf.co/Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
- Lemonade
How to use Orivael-SRD-Lab/gemma4_e2b_srd4_q4km with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Orivael-SRD-Lab/gemma4_e2b_srd4_q4km
Run and chat with the model
lemonade run user.gemma4_e2b_srd4_q4km-{{QUANT_TAG}}List all available models
lemonade list
Gemma 4 E2B-it — SRD-4 Q4_K_M GGUF
Orivael SRD-4 quantization of google/gemma-4-E2B-it, packed into a
HMAC-signed .axm container and extracted to GGUF Q4_K_M for llama.cpp /
LM Studio / PocketPal.
Patent pending — Orivael Inc.
The SRD container format and signing protocol are the subject of pending patent claims (ORVL-023, ORVL-024).
What is SRD?
Stochastic Residual Dithering (SRD) is a post-training quantization technique developed by Orivael Inc. It applies calibrated noise dithering to residual quantization error, recovering perplexity lost in aggressive bit-width reduction — with no re-training or fine-tuning required.
Every weight shard is HMAC-SHA256 signed inside an .axm container,
providing cryptographic provenance: the fingerprint is a public commitment
to exactly these weights.
Architecture — Mixture of Experts
Gemma 4 E2B is a Mixture-of-Experts (MoE) model:
| Property | Value |
|---|---|
| Total parameters | ~5B |
| Active parameters per token | ~2B (25%) |
| Experts | 8 total · 2 active per token |
| Hidden size | 2560 |
| Layers | 34 |
| Vocabulary | 262,144 tokens |
| Max context | 32,768 tokens |
| Attention | GQA |
SRD uses a MoE-aware chunk split (20 / 20 / 37 / 23 % of depth) rather than the dense-model 40–77 % range, since expert routing layers are distributed across the full depth.
Compression
| Metric | Value |
|---|---|
| BF16 baseline | ~10.0 GB |
| SRD-4 container (.axm) | ~4.6 GB |
| Q4_K_M GGUF (this file) | 3.11 GB |
| Compression vs BF16 | 69 % |
| Bits per weight (avg) | ~4.85 bpw |
| HMAC governance | Yes — fingerprinted .axm |
| Re-training required | No |
Files
| File | Size | Description |
|---|---|---|
gemma4_e2b_srd4_q4km.gguf |
3.11 GB | GGUF Q4_K_M — llama.cpp / LM Studio / PocketPal |
gemma4_e2b_met_sidecar.json |
~5 KB | MoE-aware MET slot map + RAM hydration policy |
Usage
llama.cpp
./llama-cli \
-m gemma4_e2b_srd4_q4km.gguf \
-p "<start_of_turn>user\nHello, how are you?<end_of_turn>\n<start_of_turn>model\n" \
-n 200 --temp 0.7 -ngl 99
LM Studio
Download and open gemma4_e2b_srd4_q4km.gguf directly in LM Studio.
Use the Gemma Instruct chat template.
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="gemma4_e2b_srd4_q4km.gguf",
n_gpu_layers=-1,
n_ctx=4096,
)
output = llm.create_chat_completion(messages=[
{"role": "user", "content": "Hello, how are you?"}
])
print(output["choices"][0]["message"]["content"])
Hardware Requirements
| Device | VRAM / RAM | Notes |
|---|---|---|
| RTX 3080 / 4070 (10 GB) | 10 GB VRAM | Full GPU offload (-ngl 99) |
| RTX 3070 (8 GB) | 8 GB VRAM | Full GPU offload |
| Apple M2/M3 (16 GB) | 16 GB unified | Full Metal offload |
| Jetson Orin Nano 8 GB | 8 GB | -ngl 99, fits comfortably |
| CPU only (16 GB RAM) | 16 GB | Slow but works |
Axiom MET Sidecar
The included gemma4_e2b_met_sidecar.json provides Memory-Efficient
Trajectory (MET) hydration budgets for on-device inference engines that
support Axiom's intent-aware layer loading:
{
"moe": {
"num_experts": 8,
"experts_per_token": 2,
"active_frac": 0.25,
"chunk_note": "20/20/37/23 split — architecture-agnostic for MoE"
},
"hydration_policy": {
"INFORM": ["early"],
"HARM": ["early", "factual", "reasoning", "governance"]
}
}
Benchmark
PPL benchmark in progress.
WikiText-2 perplexity (stride 512, context 2048, 100 chunks) will be posted here once validated with a current llama.cpp build. An earlier run produced an invalid result due to a stale llama.cpp version at conversion time — that number is not published here.
| Metric | SRD Q4_K_M | Community Q4_K_M |
|---|---|---|
| bpw | ~4.85 | ~4.85 |
| Size | 3.11 GB | pending |
| WikiText-2 PPL | pending re-run | pending |
| HMAC governance | Yes | No |
| MoE-aware sidecar | Yes | No |
License
Weights are derived from google/gemma-4-E2B-it and are subject to the
Gemma Terms of Use.
The SRD quantization method and .axm container format are proprietary to
Orivael Inc. (patent pending ORVL-023, ORVL-024).
Citation
@misc{orivael2026srd,
title = {Stochastic Residual Dithering (SRD): Post-Training Quantization
with Cryptographic Provenance},
author = {Orivael Inc.},
year = {2026},
note = {Patent pending ORVL-023, ORVL-024}
}
- Downloads last month
- 550
We're not able to determine the quantization variants.