mit-han-lab/pile-val-backup
Viewer • Updated • 215k • 22.2k • 27
INT8 quantized version of google/gemma-4-26B-A4B-it optimized for efficient inference and serving using vLLM.
google/gemma-4-26B-A4B-itThis model is intended for:
This model was quantized using:
llm-compressorcompressed-tensorsQuantization format:
The vision encoder and embedding layers were excluded from quantization for better stability and multimodal quality.
Recommended:
Approximate VRAM:
docker run --runtime=nvidia \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-v $(pwd):/model \
vllm/vllm-openai:latest \
--model /model \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92 \
--served-model-name gemma4-int8
from transformers import AutoProcessor, AutoModelForCausalLM
MODEL_ID = "Intics/gemma-4-26B-A4B-IT-Int8"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": "Explain Mixture-of-Experts models."
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(
text=text,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256
)
print(processor.decode(outputs[0]))
This model follows the same license as the original Gemma 4 release.
Please review: https://ai.google.dev/gemma/docs/gemma_4_license