Intics/gemma-4-26B-A4B-IT-Int8

INT8 quantized version of google/gemma-4-26B-A4B-it optimized for efficient inference and serving using vLLM.

Model Details

Base Model: google/gemma-4-26B-A4B-it
Quantization: INT8 (W8A8)
Architecture: Mixture-of-Experts (MoE)
Modalities: Text + Image
Context Length: 256K
Active Parameters: ~4B
Total Parameters: ~26B

This model is intended for:

Efficient inference
vLLM serving
Multi-GPU deployments
Lower VRAM usage compared to BF16

Quantization

This model was quantized using:

llm-compressor
compressed-tensors

Quantization format:

Weights: INT8
Activations: INT8

The vision encoder and embedding layers were excluded from quantization for better stability and multimodal quality.

Hardware Requirements

Recommended:

2× RTX 3090
A100 40GB+
H100

Approximate VRAM:

BF16: ~55GB
INT8: ~30GB

vLLM Serving

docker run --runtime=nvidia \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $(pwd):/model \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92 \
  --served-model-name gemma4-int8

Transformers Usage

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "Intics/gemma-4-26B-A4B-IT-Int8"

processor = AutoProcessor.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": "Explain Mixture-of-Experts models."
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    text=text,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256
)

print(processor.decode(outputs[0]))

Notes

Optimized primarily for inference workloads.
INT8 quantization significantly reduces VRAM usage while preserving most model quality.
Best served using vLLM.

License

This model follows the same license as the original Gemma 4 release.

Please review: https://ai.google.dev/gemma/docs/gemma_4_license

Credits

Google DeepMind
vLLM
llm-compressor
compressed-tensors

Downloads last month: 32

Safetensors

Model size

27B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intics/gemma-4-26B-A4B-IT-Int8

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(264)

this model

Intics
/

gemma-4-26B-A4B-IT-Int8