Intics/gemma-4-26B-A4B-IT-Int8

INT8 quantized version of google/gemma-4-26B-A4B-it optimized for efficient inference and serving using vLLM.

Model Details

  • Base Model: google/gemma-4-26B-A4B-it
  • Quantization: INT8 (W8A8)
  • Architecture: Mixture-of-Experts (MoE)
  • Modalities: Text + Image
  • Context Length: 256K
  • Active Parameters: ~4B
  • Total Parameters: ~26B

This model is intended for:

  • Efficient inference
  • vLLM serving
  • Multi-GPU deployments
  • Lower VRAM usage compared to BF16

Quantization

This model was quantized using:

  • llm-compressor
  • compressed-tensors

Quantization format:

  • Weights: INT8
  • Activations: INT8

The vision encoder and embedding layers were excluded from quantization for better stability and multimodal quality.


Hardware Requirements

Recommended:

  • 2× RTX 3090
  • A100 40GB+
  • H100

Approximate VRAM:

  • BF16: ~55GB
  • INT8: ~30GB

vLLM Serving

docker run --runtime=nvidia \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $(pwd):/model \
  vllm/vllm-openai:latest \
  --model /model \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92 \
  --served-model-name gemma4-int8

Transformers Usage

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "Intics/gemma-4-26B-A4B-IT-Int8"

processor = AutoProcessor.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": "Explain Mixture-of-Experts models."
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    text=text,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256
)

print(processor.decode(outputs[0]))

Notes

  • Optimized primarily for inference workloads.
  • INT8 quantization significantly reduces VRAM usage while preserving most model quality.
  • Best served using vLLM.

License

This model follows the same license as the original Gemma 4 release.

Please review: https://ai.google.dev/gemma/docs/gemma_4_license


Credits

  • Google DeepMind
  • vLLM
  • llm-compressor
  • compressed-tensors
Downloads last month
32
Safetensors
Model size
27B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intics/gemma-4-26B-A4B-IT-Int8

Quantized
(264)
this model

Dataset used to train Intics/gemma-4-26B-A4B-IT-Int8