Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-26B-A4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-4-26B-A4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-26B-A4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-26B-A4B-it
- SGLang
How to use google/gemma-4-26B-A4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-26B-A4B-it
Release the 124B parent weights... We know you have it.
Google’s latest Gemma 4 release came with an odd twist. Jeff Dean, Google’s top AI leader, casually referenced a “124 billion parameter, Mixture of Experts” model, and then that model seemed to disappear almost immediately. After that, Google’s PR side quickly moved to frame it as a typo or mistake, saying the largest model actually released was the 26B version. But that explanation does not sit right. For one, it is hard to believe someone would accidentally type “124” when they meant “26.” On top of that, the released 26B model’s own files, including the README.md and model.safetensors.index.json for gemma-4-26B-A4B-it, show an architecture with about 26.5 billion parameters spread across 128 experts. That is where things get even more interesting, because dividing only 26 billion parameters across 128 experts would leave each expert unusually small and inefficient, which makes little sense unless this model is actually a heavily trimmed or distilled version of a much larger parent system. If you scale 128 experts to more typical functional sizes, you hit exactly ~124B parameters. Add to that the fact that Hugging Face representatives are already steering developers who want larger models toward the paid Gemini API, and it starts to feel like there may be more to this story than a simple typo. So the real question is: where did the 124B model go?
Jeff Deans X Post: https://archive.li/5vxUY