Instructions to use XxACCOxX/gemma3.5-48b_a4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use XxACCOxX/gemma3.5-48b_a4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="XxACCOxX/gemma3.5-48b_a4b", filename="Gemma3.5-48b_a4b.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use XxACCOxX/gemma3.5-48b_a4b with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf XxACCOxX/gemma3.5-48b_a4b # Run inference directly in the terminal: llama cli -hf XxACCOxX/gemma3.5-48b_a4b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf XxACCOxX/gemma3.5-48b_a4b # Run inference directly in the terminal: llama cli -hf XxACCOxX/gemma3.5-48b_a4b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf XxACCOxX/gemma3.5-48b_a4b # Run inference directly in the terminal: ./llama-cli -hf XxACCOxX/gemma3.5-48b_a4b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf XxACCOxX/gemma3.5-48b_a4b # Run inference directly in the terminal: ./build/bin/llama-cli -hf XxACCOxX/gemma3.5-48b_a4b
Use Docker
docker model run hf.co/XxACCOxX/gemma3.5-48b_a4b
- LM Studio
- Jan
- vLLM
How to use XxACCOxX/gemma3.5-48b_a4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "XxACCOxX/gemma3.5-48b_a4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XxACCOxX/gemma3.5-48b_a4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/XxACCOxX/gemma3.5-48b_a4b
- Ollama
How to use XxACCOxX/gemma3.5-48b_a4b with Ollama:
ollama run hf.co/XxACCOxX/gemma3.5-48b_a4b
- Unsloth Studio
How to use XxACCOxX/gemma3.5-48b_a4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for XxACCOxX/gemma3.5-48b_a4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for XxACCOxX/gemma3.5-48b_a4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for XxACCOxX/gemma3.5-48b_a4b to start chatting
- Pi
How to use XxACCOxX/gemma3.5-48b_a4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf XxACCOxX/gemma3.5-48b_a4b
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "XxACCOxX/gemma3.5-48b_a4b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use XxACCOxX/gemma3.5-48b_a4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf XxACCOxX/gemma3.5-48b_a4b
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default XxACCOxX/gemma3.5-48b_a4b
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use XxACCOxX/gemma3.5-48b_a4b with Docker Model Runner:
docker model run hf.co/XxACCOxX/gemma3.5-48b_a4b
- Lemonade
How to use XxACCOxX/gemma3.5-48b_a4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull XxACCOxX/gemma3.5-48b_a4b
Run and chat with the model
lemonade run user.gemma3.5-48b_a4b-{{QUANT_TAG}}List all available models
lemonade list
Gemma3.5-48B-A4B-Q4_K_M
by XxACCOxX
Gemma3.5-48B-A4B-Q4_K_M is a community-built GGUF release that combines the instruction behavior of Gemma 3 27B with the Gemma 4 26B A4B MoE backbone through full-layer activation-distilled donor experts.
The model keeps the original Gemma 4 expert bank intact, appends a second donor bank derived from Gemma 3 across all 30 language layers, and preserves a Gemma 4-compatible inference path for local deployment.
Model Summary
- Formal name:
Gemma3.5-48B-A4B-Q4_K_M - Architecture:
gemma4 - Quantization:
Q4_K_M - Total parameters:
48.1B - Estimated active parameters:
~3.8B - Context length:
262,144 - Total experts:
256 - Experts used per token:
8 - Expert layout:
- slots
0..127: original Gemma 4 experts - slots
128..255: Gemma 3 activation-distilled donor experts
- slots
Construction
This model uses Gemma 4 26B A4B as the base MoE runtime and appends a full 30-layer donor expert bank derived from Gemma 3 27B.
The donor side was built through activation distillation rather than direct dense replacement. Non-expert backbone tensors remain aligned with the Gemma 4 runtime layout, while the appended donor experts extend the expert bank without overwriting the original Gemma 4 experts.
Previous release from the same author:
MMLU-Pro No-Think Result
On a fixed stratified MMLU-Pro subset with 280 items, 20 items per category across 14 categories, seed = 42, think = false, and num_predict = 2048, the model reaches:
Gemma3.5-48B-A4B-Q4_K_M:63.21%(177 / 280)Gemma 4 26B A4B base:55.71%(156 / 280)
This is a +7.50 point improvement over the original Gemma 4 26B A4B base under the same no-think configuration.
Intended Use
Gemma3.5-48B-A4B-Q4_K_M is intended for local instruction-following use in direct-answer mode, with strong emphasis on general chat, mathematics, technical prompts, and broad knowledge tasks while retaining a relatively small active path compared with its total parameter count.
Format
This release is provided as a GGUF model for local inference stacks such as llama.cpp and Ollama-based workflows.
License
This release is derived from both Gemma 3 and Gemma 4 upstream checkpoints.
Gemma 4 model pages are published under Apache 2.0, while Gemma 3 model access remains tied to Google's usage license. Because this release combines both upstream lines, it should not be represented as a pure Apache-2.0 model artifact.
Use and redistribution should follow the applicable upstream terms for both source model families.
- Downloads last month
- 101
We're not able to determine the quantization variants.