Instructions to use vectionlabs/Salience-1.5-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vectionlabs/Salience-1.5-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="vectionlabs/Salience-1.5-Flash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("vectionlabs/Salience-1.5-Flash")
model = AutoModelForMultimodalLM.from_pretrained("vectionlabs/Salience-1.5-Flash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use vectionlabs/Salience-1.5-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vectionlabs/Salience-1.5-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vectionlabs/Salience-1.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/vectionlabs/Salience-1.5-Flash

SGLang

How to use vectionlabs/Salience-1.5-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vectionlabs/Salience-1.5-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vectionlabs/Salience-1.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vectionlabs/Salience-1.5-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vectionlabs/Salience-1.5-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use vectionlabs/Salience-1.5-Flash with Docker Model Runner:
```
docker model run hf.co/vectionlabs/Salience-1.5-Flash
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Salience 1.5 — Flash

A 30B-A3B Mixture-of-Experts multimodal agent — only 3.3B active params per token: the decode speed of a small model with the reach of a large one.

Vection Labs

Weights · Benchmarks · Quickstart · Fast inference · Limitations

Abstract

Salience 1.5 Flash is a sparse Mixture-of-Experts vision-language model: 30B total parameters, but only 3.3B active per token. It decodes at the speed of a ~3B dense model while reasoning with the capacity of a 30B one — built for hard, practical work: writing and debugging real code, driving tools and agents, designing production-grade interfaces, and visual understanding over images and video, inside a single model with a context window of up to 1M tokens.

It is the fast, multimodal tier of the Salience family — engineered for people who care less about chat pleasantries and more about whether the model can do the thing: ship the function, find the bug, call the right tool, design the screen, read the diagram.

Highlights

Bigger and faster at once. Sparse activation means ~3B of compute for 30B of knowledge — roughly 2× the decode speed of a dense 8B at far greater capacity.
Code & agentic first. Tuned to produce runnable code, repo-scale edits, and well-formed native tool calls.
Designs, not just describes. Defaults to modern stacks (React/Next, TypeScript, Tailwind, shadcn/ui), real design tokens, accessible (WCAG) semantics, and tasteful motion.
Reasoning that shows its work. Structured, inspectable chains of thought — with a one-token switch to turn them off when you want instant answers.
Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
Long context. Up to 1M tokens via interleaved multimodal RoPE — whole repos, long papers, or long videos in a single prompt.
Fast on modest hardware. Runs on 2× T4 with no GGUF (~17 GB in 4-bit NF4).
Open weights. Apache-2.0, transformers-native, single-file deployment.

Model overview


Parameters	30B total / 3.3B active (Mixture-of-Experts)
Modalities	text, image, video → text
Context window	up to 1,000,000 tokens (256K native, interleaved multimodal RoPE)
Precision	bfloat16 master weights
Architecture	Qwen3-VL MoE (30B-A3B) + native vision encoder
License	Apache-2.0
Library	🤗 `transformers` (`AutoModelForImageTextToText`)

Architecture & capabilities

Salience 1.5 Flash is a Qwen3-VL Mixture-of-Experts model: a 30B-parameter expert network that routes only 3.3B parameters per token, coupled to a native vision encoder, with interleaved multimodal RoPE carrying the context window from 256K up to 1M tokens.

Its capability profile is built around four pillars:

Code & agentic execution — runnable code, repo-scale edits, and well-formed tool calls.
Frontier UI/UX design — modern stacks, design tokens, accessible semantics, tasteful motion.
Deep reasoning — structured, inspectable chains of thought for math and logic.
Multimodal perception — images and video as first-class inputs, not bolted-on captioning.

The vision pathway and long-context behavior are preserved end to end, so the same reasoning that solves a hard problem also reads a chart, a UI screenshot, or a short clip.

Agent persona & thinking control

With no system prompt, the model adopts an elite software-engineering + design agent persona automatically — pass your own system message to override it completely. Thinking is on by default; append /no_think (or pass enable_thinking=False) for instant direct answers.

Intended use

Salience 1.5 Flash targets technical assistance, coding & design agents, and research:

Code generation, explanation, debugging, review, and repo-scale tasks.
Frontend / UI generation and design-system work.
Agentic / tool-using workflows that emit structured calls.
Step-by-step math and quantitative reasoning.
Visual question answering and document/diagram/chart/UI understanding.
Video understanding over short clips, and long-document / long-context analysis.

It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.

Benchmarks

All results use a single reproducible evaluation harness with greedy/CoT settings.

Reasoning, math & code

Benchmark	Setting	Salience-1.5-Flash
GSM8K	0-shot CoT, exact match	—
MATH-500	0-shot CoT, exact match	—
HumanEval	0-shot, pass@1	—
MBPP	3-shot, pass@1	—
MMLU	0-shot	—

Multimodal

Benchmark	Setting	Salience-1.5-Flash
MMMU (val)	0-shot	—
MathVista (testmini)	0-shot	—
DocVQA (val)	0-shot, ANLS	—

The evaluation protocol, prompts, and answer-extraction logic are fixed and reproducible end-to-end.

Quickstart

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_id = "vectionlabs/Salience-1.5-Flash"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/diagram.png"},
        {"type": "text", "text": "Explain what this diagram proves, step by step."},
    ],
}]

text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Text-only works the same way with a plain {"type": "text", ...} message. Append /no_think to any prompt for the fastest, direct answer.

Fast inference (2× T4, no GGUF)

T4 (Turing) has no bf16 and no FlashAttention2 — use fp16 + SDPA, or 4-bit NF4 to fit ~17 GB:

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

repo = "vectionlabs/Salience-1.5-Flash"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForImageTextToText.from_pretrained(
    repo, quantization_config=bnb, device_map="auto", attn_implementation="sdpa")
proc = AutoProcessor.from_pretrained(repo)

Because only 3.3B parameters are active per token, decode stays fast even at 30B scale.

Speed & efficiency

Sparse MoE. ~3B active params/token means memory traffic per token is roughly halved versus a dense model of equal quality — the core "bigger AND faster" lever.
Adaptive thinking. Append /no_think for instant direct answers, or keep thinking on for deep step-by-step reasoning on hard math and multi-step agentic planning — you spend latency only when the task is worth it.
Speculative decoding with a small same-family draft (Qwen/Qwen3-0.6B) gives a lossless 1.5–2.5× speedup on code and structured text (assistant_model= in transformers, or --speculative-model in vLLM).
Production serving. On Ampere+ GPUs use vLLM (fp16 / AWQ) for high-throughput deployment.

Prompting tips

Code: specify language, constraints ("no external libraries"), and the exact I/O contract.
Design: name the stack and the look you want; it will return tokens, semantics, and structure.
Agentic / tools: give the tool schema and ask for the call as strict JSON.
Math/logic: ask it to reason step by step; it is tuned to externalize its work.
Vision: put the image/video before the question in the message content.
Sampling (Qwen3 family): thinking → temperature=0.6, top_p=0.95, top_k=20; direct answers → temperature=0.7, top_p=0.8, top_k=20.

Long context (large codebases)

256K out of the box. To push toward ~1M, enable YaRN in config.json:

"rope_scaling": { "type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144 }

(Small short-context quality cost — enable only when you actually need >256K.)

Deployment

Single / dual GPU: loads in bf16/fp16 with device_map="auto"; 4-bit NF4 fits ~17 GB across 2× T4.
Serving: integrates with standard transformers generation and vision-capable serving stacks such as vLLM (with optional speculative decoding) for high-throughput production use.
Quantized formats: GGUF and other community quantizations are supported.

Limitations & responsible use

Salience 1.5 Flash can be confidently wrong. Verify mathematical and factual claims.
Generated code may be insecure or incorrect — review before running, never execute untrusted output.
Long-context and long-video inputs increase latency and memory substantially.
It inherits the licenses, biases, and failure modes of its base model. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
No audio modality.

Citation

@misc{vectionlabs2026salience15flash,
  title  = {Salience 1.5 Flash: A Sparse-MoE Multimodal Agent},
  author = {Vection Labs},
  year   = {2026},
  url    = {https://huggingface.co/vectionlabs/Salience-1.5-Flash}
}

Downloads last month: 77,351

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for vectionlabs/Salience-1.5-Flash

Quantizations

2 models

Collection including vectionlabs/Salience-1.5-Flash

Salience

Collection

Currently the second-to-last flagship of the Salience family, this collection also includes community GGUFs and Heretic versions. • 7 items • Updated 1 day ago • 4