Instructions to use CompressedGemma/Gemma-4-31B-it-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CompressedGemma/Gemma-4-31B-it-Opus with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="CompressedGemma/Gemma-4-31B-it-Opus",
	filename="Gemma-4-31B-Opus.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use CompressedGemma/Gemma-4-31B-it-Opus with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus
# Run inference directly in the terminal:
llama cli -hf CompressedGemma/Gemma-4-31B-it-Opus

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus
# Run inference directly in the terminal:
llama cli -hf CompressedGemma/Gemma-4-31B-it-Opus

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf CompressedGemma/Gemma-4-31B-it-Opus
# Run inference directly in the terminal:
./llama-cli -hf CompressedGemma/Gemma-4-31B-it-Opus

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf CompressedGemma/Gemma-4-31B-it-Opus
# Run inference directly in the terminal:
./build/bin/llama-cli -hf CompressedGemma/Gemma-4-31B-it-Opus

Use Docker

docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus

LM Studio
Jan

vLLM

How to use CompressedGemma/Gemma-4-31B-it-Opus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CompressedGemma/Gemma-4-31B-it-Opus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CompressedGemma/Gemma-4-31B-it-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus

Ollama
How to use CompressedGemma/Gemma-4-31B-it-Opus with Ollama:
```
ollama run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
```

Unsloth Studio

How to use CompressedGemma/Gemma-4-31B-it-Opus with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting

How to use CompressedGemma/Gemma-4-31B-it-Opus with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "CompressedGemma/Gemma-4-31B-it-Opus"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use CompressedGemma/Gemma-4-31B-it-Opus with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default CompressedGemma/Gemma-4-31B-it-Opus

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use CompressedGemma/Gemma-4-31B-it-Opus with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "CompressedGemma/Gemma-4-31B-it-Opus" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use CompressedGemma/Gemma-4-31B-it-Opus with Docker Model Runner:
```
docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
```

Lemonade

How to use CompressedGemma/Gemma-4-31B-it-Opus with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull CompressedGemma/Gemma-4-31B-it-Opus

Run and chat with the model

lemonade run user.Gemma-4-31B-it-Opus-{{QUANT_TAG}}

List all available models

lemonade list

Note

I've recently run this through as a full Q2 quant, I created this so it will fit on budget GPUs.

On a RTX 3060 12GB, it is possible to get around 12 tokens a second with normal context and GPU offloading to 45ish layers, when cache is quant'd to Q4_0.

IF you expand to 100k, expect 6 tokens a second if your GPU has only 12GB RAM.

llama settings

Tested on RTX 3060 12GB:

GGML_CUDA_NO_PINNED=1 ./llama.cpp/build/bin/llama-server -m Gemma-4-31B-Opus.gguf --host 0.0.0.0 --port 8080 --jinja -ngl 45 -c 58096 --flash-attn on --temp 0 --cache-type-k q4_0 --cache-type-v q4_0 --main-gpu 1 -t 20 -tb 20 -np 1 --cache-prompt --tensor-split 0.5,0.5 --reasoning on --split-mode layer --batch-size 128 --ubatch-size 128 --top-p 0.95 --top-k 40

Gemma-4-31B-Opus

An Opus-flavored 31 billion parameter model created by extracting functional neuron parameters from Claude Opus via black-box oracle probing, generating lossy MLP weight matrices from those parameters, and fusing them into Google's Gemma-4-31B using Shape-Contoured Fusion — a technique that modifies the model's down-projection and SwiGLU gating weights along their own native directions.

The Full Pipeline

This model was created through a four-stage pipeline that starts with nothing more than API access to Claude Opus and ends with a fused Gemma-4-31B model.

Stage 1: Black-Box Weight Extraction

What it does: Extracts the internal weight geometry of a neural network using only input/output queries — no model access, no gradients, no weights. You are the oracle.

Phase 1 — Boundary Detection

The script computes the matrix of partial derivatives at each point using central finite differences.

If the two points differ, there's a ReLU boundary between them to machine precision.

Phase 2 — Rank-1 Decomposition

At each boundary, the script computes ΔJ = J(x* + ε) − J(x* − ε). For a single ReLU neuron switching on/off, this matrix is exactly rank-1: it factors as ΔJ = w₂ · w₁ᵀ where w₁ is the neuron's input weight direction and w₂ is its output weight direction.

Phase 3 — Sign Resolution & Bias Recovery

Multi-start coordinate descent over sign configurations minimizes prediction error on all accumulated query logs. Output bias b2 is recovered by averaging residuals across all observed input-output pairs.

Stage 2: Neuron Construction & Verification

What it does: Takes the raw extracted parameters and constructs verified, correct neuron weight matrices that exactly reproduce the target piecewise-linear function.

Why this encoding works: A piecewise-linear function with n breakpoints can be exactly represented by n + 1 ReLU neurons. Neuron 0 acts as a "carrier" providing the baseline slope everywhere. Neurons 1 and 2 add slope corrections in their respective active regions. The remaining 5 neurons (3-7) are zeroed out — reserved capacity.

Stage 3: Lossy Weight Generation at Scale

What it does: Takes the two verified source neurons and generates 614,400 neuron variants (60 layers × 10,240 neurons per layer), assembled into block-diagonal MLP weight matrices — one per layer.

This is the "lossy" step.

Why "lossy": The generated neurons are statistical variations of two source neurons. They capture the geometric character (boundary locations, slope ratios, activation patterns) of the originals but not their exact values. Each generated neuron is a sample from the distribution defined by the two extracted parameter vectors — a lossy expansion of two data points into 614K variants via Gaussian sampling, interpolation, or grid spanning. The covariance structure between the two source neurons defines the "axis of variation" that the generated population explores.

Stage 4: Shape-Contoured Fusion into Gemma

What it does: Fuses the generated adapter weights into the base Gemma-4-31B model's native weight tensors using a technique called Shape-Contoured Fusion — a streaming, zero-copy pipeline that modifies weights in-place with bounded RAM footprint.

B. SwiGLU Gate Modulation (Asymmetry Encoding)

This is where the Opus "flavor" lives. The tool analyzes each neuron's activation asymmetry.

Why This Works

The Geometric Argument

Each neuron is a hyperplane in activation space. The extracted neurons from Stage 1 characterize where these hyperplanes are (boundary locations) and how much they matter (slopes). Stage 3 generates a statistical population of hyperplanes with similar geometry. Stage 4 projects these hyperplanes along Gemma's native directions.

The result is not "Claude inside Gemma." It's Gemma whose MLP layers have been contoured — their down-projection slopes adjusted and their gating asymmetries modulated — according to the geometric signature extracted from Claude's activation patterns.

The neurons encode functional character (where to put decision boundaries, how to weight different activation regions) rather than specific knowledge. This is why the fusion doesn't break coherence: it's adjusting the "shape" of existing computations, not replacing them.

Downloads last month: 608

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants