Zamba2-VL-2.7B
Zamba2-VL is a family of vision-language models built on Zyphra's Zamba2 LLM suite. It supports single- and multi-image understanding and grounding, achieving state-of-the-art performance among multimodal models of similar size while benefiting from the inference efficiency of the Zamba2 architecture. You can find all models in the Zamba2-VL family here.
Zamba2-VL uses the Mistral v0.1 tokenizer and was trained on 100B tokens of vision-text and pure text data sourced from open web-datasets.
Zamba2-VL-2.7B is based on Zamba2-2.7B LLM and uses the Qwen2.5-VL vision encoder as vision backbone.
Performance
Zamba2-VL-2.7B performs strongly against models of comparable size and inference FLOPs, outperforming several larger models as well. Its small compute and memory footprint make it an ideal generalist model for on-device applications.
| Eval | Zamba2-VL-2.7B | InternVL3.5-2B | Qwen3-VL-2B | PerceptionLM-3B | Molmo2-4B | Qwen3-VL-4B | InternVL3.5-4B |
|---|---|---|---|---|---|---|---|
| AI2D (test) | 85.8 | 88.6 | 86.2 | 92.2 | 93.8 | 91.8 | 92.0 |
| ChartQA (test) | 79.6 | 81.6 | 78.7 | 85.1 | 86.1 | 81.8 | 86.4 |
| DocVQA (test) | 90.9 | 89.4 | 93.3 | 93.8 | 87.8 | 95.3 | 92.4 |
| InfoVQA (test) | 66.5 | 70.8 | 72.4 | 74.6 | 78.6 | 80.3 | 78.0 |
| TextVQA (val) | 77.4 | 76.5 | 79.9 | 80.0 | 83.1 | 81.5 | 77.6 |
| OCRBench | 73.6 | 83.4 | 84.1 | 80.1 | 62.0 | 84.1 | 82.0 |
| VQA v2.0 (val) | 79.6 | 73.6 | 78.8 | 76.9 | 85.3 | 80.7 | 76.4 |
| MathVista (mini) | 51.0 | 61.4 | 51.8 | 61.6 | 56.5 | 63.6 | 72.8 |
| MMMU (val) | 37.7 | 49.9 | 40.9 | 41.4 | 48.8 | 51.4 | 57.2 |
| SEED (image) | 73.0 | 75.2 | 74.8 | 78.3 | 78.0 | 77.3 | 76.3 |
| BLINK (val) | 42.3 | 51.3 | 53.2 | 49.8 | 63.5 | 63.2 | 58.2 |
| RealWorldQA | 61.7 | 61.6 | 66.0 | 73.1 | 73.8 | 71.0 | 67.8 |
| CountBenchQA | 87.5 | 70.0 | 87.9 | 88.1 | 91.2 | 87.3 | 82.5 |
| PixMoCount (test) | 82.5 | 32.8 | 55.7 | 41.6 | 87.0 | 89.2 | 47.3 |
| Point-Bench (avg) | 61.2 | -- | 53.5 | -- | 68.5 | 65.1 | -- |
All numbers are run on the Zyphra evaluation harness (based on VLMEvalKit). Other models are ordered by total parameter count. Bold indicates the best score in each row, while underlined values indicate the lowest score.
Quick start
Prerequisites
To use Zamba2-VL, install zamba2-vl branch from our fork of transformers library, which is based on the v4.57.1 of transformers:
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"
pip install qwen-vl-utils==0.0.2
pip install flash_attn
The command above relies on requirements for transformers v4.57.1 being installed in your environment. If you're installing in a fresh Python environment, you might want to specify a specific extra, like [dev-torch], to install all the dependencies:
pip install "transformers[dev-torch] @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"
For the fastest setup, ensure your environment matches an existing flash_attn wheel, otherwise the installation will build from source.
To install dependencies necessary to run Mamba2 kernels, install mamba-ssm from source (due to compatibility issues with PyTorch) as well as causal-conv1d:
pip install --no-build-isolation "causal-conv1d @ git+https://github.com/Zyphra/z-causal-conv1d.git@zamba2-vl"
pip install --no-build-isolation "mamba-ssm @ git+https://github.com/Zyphra/mamba.git@zamba2-vl"
You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage.
Inference
from transformers import Zamba2_VLForConditionalGeneration, Zamba2_VLProcessor
import torch
from PIL import Image
from qwen_vl_utils import process_vision_info
import requests
device = "cuda"
processor = Zamba2_VLProcessor.from_pretrained("Zyphra/Zamba2-VL-2.7B", temporal_patch_size=1)
model = Zamba2_VLForConditionalGeneration.from_pretrained("Zyphra/Zamba2-VL-2.7B", device_map=device, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
question = "What do you see in the image? Give us some detail."
num_img_tokens = 3400
conversation = [
{"role": "user", "content": [
{"type": "image", "image": image, "max_pixels" : num_img_tokens * 28 * 28, "min_pixels" : 10 * 28 * 28},
{"type": "text", "text": question},
]
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
images, _ = process_vision_info(conversation)
inputs = processor(text=prompt, images=images, add_special_tokens=True, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
- Downloads last month
- 4
Model tree for Zyphra/Zamba2-VL-2.7B
Base model
Zyphra/Zamba2-2.7B