Instructions to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="zeroxjason200/google_gemma-4-26B-A4B-it-assistant", filename="google_gemma-4-26B-A4B-it-assistant.Q8_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0 # Run inference directly in the terminal: llama cli -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0 # Run inference directly in the terminal: llama cli -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
Use Docker
docker model run hf.co/zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
- LM Studio
- Jan
- Ollama
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with Ollama:
ollama run hf.co/zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
- Unsloth Studio
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for zeroxjason200/google_gemma-4-26B-A4B-it-assistant to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for zeroxjason200/google_gemma-4-26B-A4B-it-assistant to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for zeroxjason200/google_gemma-4-26B-A4B-it-assistant to start chatting
- Atomic Chat new
- Docker Model Runner
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with Docker Model Runner:
docker model run hf.co/zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
- Lemonade
How to use zeroxjason200/google_gemma-4-26B-A4B-it-assistant with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull zeroxjason200/google_gemma-4-26B-A4B-it-assistant:Q8_0
Run and chat with the model
lemonade run user.google_gemma-4-26B-A4B-it-assistant-Q8_0
List all available models
lemonade list
Q8_0 Quant of the google assistant model (MTP) for the 26B-A4B. Works with latest llama.cpp (9551). All the other quants uploaded seem to have non matching architecture so failed to load for me [1].
Gives me 100ts/s up from about 85t/s without MTP (--spec-draft-max 2) on a fresh task.
.38.515.475 I slot print_timing: id 3 | task 0 | prompt eval time = 1147.24 ms / 2419 tokens ( 0.47 ms per token, 2108.55 tokens per second)
0.38.515.479 I slot print_timing: id 3 | task 0 | eval time = 14292.86 ms / 1432 tokens ( 9.98 ms per token, 100.19 tokens per second)
0.38.515.481 I slot print_timing: id 3 | task 0 | total time = 15440.10 ms / 3851 tokens
0.38.515.485 I slot print_timing: id 3 | task 0 | graphs reused = 668
0.38.515.486 I slot print_timing: id 3 | task 0 | draft acceptance = 0.56231 ( 758 accepted / 1348 generated)
0.38.515.506 I statistics draft-mtp: #calls(b,g,a) = 1 674 674, #gen drafts = 674, #acc drafts = 461, #gen tokens = 1348, #acc tokens = 758, dur(b,g,a) = 0.003, 2567.398, 1.100 ms
[1] e.g. stuff like this when you try to load them
0.01.002.728 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1279.96 MiB
0.01.314.746 E llama_model_load: error loading model: unknown model architecture: 'gemma4_mtp'
0.01.314.762 E llama_model_load_from_file_impl: failed to load model
0.01.314.822 W srv load_model: [spec] failed to measure draft model memory: failed to load model
- Downloads last month
- 392
8-bit
Model tree for zeroxjason200/google_gemma-4-26B-A4B-it-assistant
Base model
google/gemma-4-26B-A4B-it-assistant