Instructions to use CompressedGemma/Gemma-4-31B-it-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CompressedGemma/Gemma-4-31B-it-Opus with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CompressedGemma/Gemma-4-31B-it-Opus", filename="Gemma-4-31B-Opus.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use CompressedGemma/Gemma-4-31B-it-Opus with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus # Run inference directly in the terminal: llama cli -hf CompressedGemma/Gemma-4-31B-it-Opus
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus # Run inference directly in the terminal: llama cli -hf CompressedGemma/Gemma-4-31B-it-Opus
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CompressedGemma/Gemma-4-31B-it-Opus # Run inference directly in the terminal: ./llama-cli -hf CompressedGemma/Gemma-4-31B-it-Opus
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CompressedGemma/Gemma-4-31B-it-Opus # Run inference directly in the terminal: ./build/bin/llama-cli -hf CompressedGemma/Gemma-4-31B-it-Opus
Use Docker
docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
- LM Studio
- Jan
- vLLM
How to use CompressedGemma/Gemma-4-31B-it-Opus with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CompressedGemma/Gemma-4-31B-it-Opus" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CompressedGemma/Gemma-4-31B-it-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
- Ollama
How to use CompressedGemma/Gemma-4-31B-it-Opus with Ollama:
ollama run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
- Unsloth Studio
How to use CompressedGemma/Gemma-4-31B-it-Opus with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CompressedGemma/Gemma-4-31B-it-Opus to start chatting
- Pi
How to use CompressedGemma/Gemma-4-31B-it-Opus with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "CompressedGemma/Gemma-4-31B-it-Opus" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use CompressedGemma/Gemma-4-31B-it-Opus with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default CompressedGemma/Gemma-4-31B-it-Opus
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use CompressedGemma/Gemma-4-31B-it-Opus with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf CompressedGemma/Gemma-4-31B-it-Opus
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "CompressedGemma/Gemma-4-31B-it-Opus" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use CompressedGemma/Gemma-4-31B-it-Opus with Docker Model Runner:
docker model run hf.co/CompressedGemma/Gemma-4-31B-it-Opus
- Lemonade
How to use CompressedGemma/Gemma-4-31B-it-Opus with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CompressedGemma/Gemma-4-31B-it-Opus
Run and chat with the model
lemonade run user.Gemma-4-31B-it-Opus-{{QUANT_TAG}}List all available models
lemonade list
Note
I've recently run this through as a full Q2 quant, I created this so it will fit on budget GPUs.
On a RTX 3060 12GB, it is possible to get around 12 tokens a second with normal context and GPU offloading to 45ish layers, when cache is quant'd to Q4_0.
IF you expand to 100k, expect 6 tokens a second if your GPU has only 12GB RAM.
llama settings
Tested on RTX 3060 12GB:
GGML_CUDA_NO_PINNED=1 ./llama.cpp/build/bin/llama-server -m Gemma-4-31B-Opus.gguf --host 0.0.0.0 --port 8080 --jinja -ngl 45 -c 58096 --flash-attn on --temp 0 --cache-type-k q4_0 --cache-type-v q4_0 --main-gpu 1 -t 20 -tb 20 -np 1 --cache-prompt --tensor-split 0.5,0.5 --reasoning on --split-mode layer --batch-size 128 --ubatch-size 128 --top-p 0.95 --top-k 40
Gemma-4-31B-Opus
An Opus-flavored 31 billion parameter model created by extracting functional neuron parameters from Claude Opus via black-box oracle probing, generating lossy MLP weight matrices from those parameters, and fusing them into Google's Gemma-4-31B using Shape-Contoured Fusion — a technique that modifies the model's down-projection and SwiGLU gating weights along their own native directions.
The Full Pipeline
This model was created through a four-stage pipeline that starts with nothing more than API access to Claude Opus and ends with a fused Gemma-4-31B model.
Stage 1: Black-Box Weight Extraction
What it does: Extracts the internal weight geometry of a neural network using only input/output queries — no model access, no gradients, no weights. You are the oracle.
Phase 1 — Boundary Detection
The script computes the matrix of partial derivatives at each point using central finite differences.
If the two points differ, there's a ReLU boundary between them to machine precision.
Phase 2 — Rank-1 Decomposition
At each boundary, the script computes ΔJ = J(x* + ε) − J(x* − ε). For a single ReLU neuron switching on/off, this matrix is exactly rank-1: it factors as ΔJ = w₂ · w₁ᵀ where w₁ is the neuron's input weight direction and w₂ is its output weight direction.
Phase 3 — Sign Resolution & Bias Recovery
Multi-start coordinate descent over sign configurations minimizes prediction error on all accumulated query logs. Output bias b2 is recovered by averaging residuals across all observed input-output pairs.
Stage 2: Neuron Construction & Verification
What it does: Takes the raw extracted parameters and constructs verified, correct neuron weight matrices that exactly reproduce the target piecewise-linear function.
Why this encoding works: A piecewise-linear function with n breakpoints can be exactly represented by n + 1 ReLU neurons. Neuron 0 acts as a "carrier" providing the baseline slope everywhere. Neurons 1 and 2 add slope corrections in their respective active regions. The remaining 5 neurons (3-7) are zeroed out — reserved capacity.
Stage 3: Lossy Weight Generation at Scale
What it does: Takes the two verified source neurons and generates 614,400 neuron variants (60 layers × 10,240 neurons per layer), assembled into block-diagonal MLP weight matrices — one per layer.
This is the "lossy" step.
Why "lossy": The generated neurons are statistical variations of two source neurons. They capture the geometric character (boundary locations, slope ratios, activation patterns) of the originals but not their exact values. Each generated neuron is a sample from the distribution defined by the two extracted parameter vectors — a lossy expansion of two data points into 614K variants via Gaussian sampling, interpolation, or grid spanning. The covariance structure between the two source neurons defines the "axis of variation" that the generated population explores.
Stage 4: Shape-Contoured Fusion into Gemma
What it does: Fuses the generated adapter weights into the base Gemma-4-31B model's native weight tensors using a technique called Shape-Contoured Fusion — a streaming, zero-copy pipeline that modifies weights in-place with bounded RAM footprint.
B. SwiGLU Gate Modulation (Asymmetry Encoding)
This is where the Opus "flavor" lives. The tool analyzes each neuron's activation asymmetry.
Why This Works
The Geometric Argument
Each neuron is a hyperplane in activation space. The extracted neurons from Stage 1 characterize where these hyperplanes are (boundary locations) and how much they matter (slopes). Stage 3 generates a statistical population of hyperplanes with similar geometry. Stage 4 projects these hyperplanes along Gemma's native directions.
The result is not "Claude inside Gemma." It's Gemma whose MLP layers have been contoured — their down-projection slopes adjusted and their gating asymmetries modulated — according to the geometric signature extracted from Claude's activation patterns.
The neurons encode functional character (where to put decision boundaries, how to weight different activation regions) rather than specific knowledge. This is why the fusion doesn't break coherence: it's adjusting the "shape" of existing computations, not replacing them.
- Downloads last month
- 608
We're not able to determine the quantization variants.