Zamba2-VL-2.7B

Zamba2-VL is a family of vision-language models built on Zyphra's Zamba2 LLM suite. It supports single- and multi-image understanding and grounding, achieving state-of-the-art performance among multimodal models of similar size while benefiting from the inference efficiency of the Zamba2 architecture. You can find all models in the Zamba2-VL family here.

Zamba2-VL uses the Mistral v0.1 tokenizer and was trained on 100B tokens of vision-text and pure text data sourced from open web-datasets.

Zamba2-VL-2.7B is based on Zamba2-2.7B LLM and uses the Qwen2.5-VL vision encoder as vision backbone.

Performance

Zamba2-VL-2.7B performs strongly against models of comparable size and inference FLOPs, outperforming several larger models as well. Its small compute and memory footprint make it an ideal generalist model for on-device applications.

Eval	Zamba2-VL-2.7B	InternVL3.5-2B	Qwen3-VL-2B	PerceptionLM-3B	Molmo2-4B	Qwen3-VL-4B	InternVL3.5-4B
AI2D (test)	85.8	88.6	86.2	92.2	93.8	91.8	92.0
ChartQA (test)	79.6	81.6	78.7	85.1	86.1	81.8	86.4
DocVQA (test)	90.9	89.4	93.3	93.8	87.8	95.3	92.4
InfoVQA (test)	66.5	70.8	72.4	74.6	78.6	80.3	78.0
TextVQA (val)	77.4	76.5	79.9	80.0	83.1	81.5	77.6
OCRBench	73.6	83.4	84.1	80.1	62.0	84.1	82.0
VQA v2.0 (val)	79.6	73.6	78.8	76.9	85.3	80.7	76.4
MathVista (mini)	51.0	61.4	51.8	61.6	56.5	63.6	72.8
MMMU (val)	37.7	49.9	40.9	41.4	48.8	51.4	57.2
SEED (image)	73.0	75.2	74.8	78.3	78.0	77.3	76.3
BLINK (val)	42.3	51.3	53.2	49.8	63.5	63.2	58.2
RealWorldQA	61.7	61.6	66.0	73.1	73.8	71.0	67.8
CountBenchQA	87.5	70.0	87.9	88.1	91.2	87.3	82.5
PixMoCount (test)	82.5	32.8	55.7	41.6	87.0	89.2	47.3
Point-Bench (avg)	61.2	--	53.5	--	68.5	65.1	--

All numbers are run on the Zyphra evaluation harness (based on VLMEvalKit). Other models are ordered by total parameter count. Bold indicates the best score in each row, while underlined values indicate the lowest score.

Quick start

Prerequisites

To use Zamba2-VL, install zamba2-vl branch from our fork of transformers library, which is based on the v4.57.1 of transformers:

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"
pip install qwen-vl-utils==0.0.2
pip install flash_attn

The command above relies on requirements for transformers v4.57.1 being installed in your environment. If you're installing in a fresh Python environment, you might want to specify a specific extra, like [dev-torch], to install all the dependencies:

pip install "transformers[dev-torch] @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"

For the fastest setup, ensure your environment matches an existing flash_attn wheel, otherwise the installation will build from source.

To install dependencies necessary to run Mamba2 kernels, install mamba-ssm from source (due to compatibility issues with PyTorch) as well as causal-conv1d:

pip install --no-build-isolation "causal-conv1d @ git+https://github.com/Zyphra/z-causal-conv1d.git@zamba2-vl"
pip install --no-build-isolation "mamba-ssm @ git+https://github.com/Zyphra/mamba.git@zamba2-vl"

You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage.

Inference

from transformers import Zamba2_VLForConditionalGeneration, Zamba2_VLProcessor
import torch
from PIL import Image
from qwen_vl_utils import process_vision_info
import requests

device = "cuda"
processor = Zamba2_VLProcessor.from_pretrained("Zyphra/Zamba2-VL-2.7B", temporal_patch_size=1)
model = Zamba2_VLForConditionalGeneration.from_pretrained("Zyphra/Zamba2-VL-2.7B", device_map=device, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
question = "What do you see in the image? Give us some detail."
num_img_tokens = 3400

conversation = [
    {"role": "user", "content": [
        {"type": "image", "image": image, "max_pixels" : num_img_tokens * 28 * 28, "min_pixels" : 10 * 28 * 28},
        {"type": "text", "text": question},
      ]
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
images, _ = process_vision_info(conversation)
inputs = processor(text=prompt, images=images, add_special_tokens=True, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=100)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zyphra/Zamba2-VL-2.7B

Base model

Zyphra/Zamba2-2.7B

Finetuned

(3)

this model

Collection including Zyphra/Zamba2-VL-2.7B

Zamba2-VL

Collection

A suite of vision-language models based on Zamba2. • 3 items • Updated 2 days ago • 3