Salience 1.5 — Flash

Vection Labs Salience 1.5 Flash Banner

A 30B-A3B Mixture-of-Experts multimodal agent — only 3.3B active params per token: the decode speed of a small model with the reach of a large one.

Vection Labs

Weights · Benchmarks · Quickstart · Fast inference · Limitations


Abstract

Salience 1.5 Flash is a sparse Mixture-of-Experts vision-language model: 30B total parameters, but only 3.3B active per token. It decodes at the speed of a ~3B dense model while reasoning with the capacity of a 30B one — built for hard, practical work: writing and debugging real code, driving tools and agents, designing production-grade interfaces, and visual understanding over images and video, inside a single model with a context window of up to 1M tokens.

It is the fast, multimodal tier of the Salience family — engineered for people who care less about chat pleasantries and more about whether the model can do the thing: ship the function, find the bug, call the right tool, design the screen, read the diagram.

Highlights

  • Bigger and faster at once. Sparse activation means ~3B of compute for 30B of knowledge — roughly 2× the decode speed of a dense 8B at far greater capacity.
  • Code & agentic first. Tuned to produce runnable code, repo-scale edits, and well-formed native tool calls.
  • Designs, not just describes. Defaults to modern stacks (React/Next, TypeScript, Tailwind, shadcn/ui), real design tokens, accessible (WCAG) semantics, and tasteful motion.
  • Reasoning that shows its work. Structured, inspectable chains of thought — with a one-token switch to turn them off when you want instant answers.
  • Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
  • Long context. Up to 1M tokens via interleaved multimodal RoPE — whole repos, long papers, or long videos in a single prompt.
  • Fast on modest hardware. Runs on 2× T4 with no GGUF (~17 GB in 4-bit NF4).
  • Open weights. Apache-2.0, transformers-native, single-file deployment.

Model overview

Parameters 30B total / 3.3B active (Mixture-of-Experts)
Modalities text, image, video → text
Context window up to 1,000,000 tokens (256K native, interleaved multimodal RoPE)
Precision bfloat16 master weights
Architecture Qwen3-VL MoE (30B-A3B) + native vision encoder
License Apache-2.0
Library 🤗 transformers (AutoModelForImageTextToText)

Architecture & capabilities

Salience 1.5 Flash is a Qwen3-VL Mixture-of-Experts model: a 30B-parameter expert network that routes only 3.3B parameters per token, coupled to a native vision encoder, with interleaved multimodal RoPE carrying the context window from 256K up to 1M tokens.

Its capability profile is built around four pillars:

  • Code & agentic execution — runnable code, repo-scale edits, and well-formed tool calls.
  • Frontier UI/UX design — modern stacks, design tokens, accessible semantics, tasteful motion.
  • Deep reasoning — structured, inspectable chains of thought for math and logic.
  • Multimodal perception — images and video as first-class inputs, not bolted-on captioning.

The vision pathway and long-context behavior are preserved end to end, so the same reasoning that solves a hard problem also reads a chart, a UI screenshot, or a short clip.

Agent persona & thinking control

With no system prompt, the model adopts an elite software-engineering + design agent persona automatically — pass your own system message to override it completely. Thinking is on by default; append /no_think (or pass enable_thinking=False) for instant direct answers.

Intended use

Salience 1.5 Flash targets technical assistance, coding & design agents, and research:

  • Code generation, explanation, debugging, review, and repo-scale tasks.
  • Frontend / UI generation and design-system work.
  • Agentic / tool-using workflows that emit structured calls.
  • Step-by-step math and quantitative reasoning.
  • Visual question answering and document/diagram/chart/UI understanding.
  • Video understanding over short clips, and long-document / long-context analysis.

It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.

Benchmarks

All results use a single reproducible evaluation harness with greedy/CoT settings.

Reasoning, math & code

Benchmark Setting Salience-1.5-Flash
GSM8K 0-shot CoT, exact match
MATH-500 0-shot CoT, exact match
HumanEval 0-shot, pass@1
MBPP 3-shot, pass@1
MMLU 0-shot

Multimodal

Benchmark Setting Salience-1.5-Flash
MMMU (val) 0-shot
MathVista (testmini) 0-shot
DocVQA (val) 0-shot, ANLS

The evaluation protocol, prompts, and answer-extraction logic are fixed and reproducible end-to-end.

Quickstart

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_id = "vectionlabs/Salience-1.5-Flash"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://example.com/diagram.png"},
        {"type": "text", "text": "Explain what this diagram proves, step by step."},
    ],
}]

text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

Text-only works the same way with a plain {"type": "text", ...} message. Append /no_think to any prompt for the fastest, direct answer.

Fast inference (2× T4, no GGUF)

T4 (Turing) has no bf16 and no FlashAttention2 — use fp16 + SDPA, or 4-bit NF4 to fit ~17 GB:

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

repo = "vectionlabs/Salience-1.5-Flash"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForImageTextToText.from_pretrained(
    repo, quantization_config=bnb, device_map="auto", attn_implementation="sdpa")
proc = AutoProcessor.from_pretrained(repo)

Because only 3.3B parameters are active per token, decode stays fast even at 30B scale.

Speed & efficiency

  • Sparse MoE. ~3B active params/token means memory traffic per token is roughly halved versus a dense model of equal quality — the core "bigger AND faster" lever.
  • Adaptive thinking. Append /no_think for instant direct answers, or keep thinking on for deep step-by-step reasoning on hard math and multi-step agentic planning — you spend latency only when the task is worth it.
  • Speculative decoding with a small same-family draft (Qwen/Qwen3-0.6B) gives a lossless 1.5–2.5× speedup on code and structured text (assistant_model= in transformers, or --speculative-model in vLLM).
  • Production serving. On Ampere+ GPUs use vLLM (fp16 / AWQ) for high-throughput deployment.

Prompting tips

  • Code: specify language, constraints ("no external libraries"), and the exact I/O contract.
  • Design: name the stack and the look you want; it will return tokens, semantics, and structure.
  • Agentic / tools: give the tool schema and ask for the call as strict JSON.
  • Math/logic: ask it to reason step by step; it is tuned to externalize its work.
  • Vision: put the image/video before the question in the message content.
  • Sampling (Qwen3 family): thinking → temperature=0.6, top_p=0.95, top_k=20; direct answers → temperature=0.7, top_p=0.8, top_k=20.

Long context (large codebases)

256K out of the box. To push toward ~1M, enable YaRN in config.json:

"rope_scaling": { "type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144 }

(Small short-context quality cost — enable only when you actually need >256K.)

Deployment

  • Single / dual GPU: loads in bf16/fp16 with device_map="auto"; 4-bit NF4 fits ~17 GB across 2× T4.
  • Serving: integrates with standard transformers generation and vision-capable serving stacks such as vLLM (with optional speculative decoding) for high-throughput production use.
  • Quantized formats: GGUF and other community quantizations are supported.

Limitations & responsible use

  • Salience 1.5 Flash can be confidently wrong. Verify mathematical and factual claims.
  • Generated code may be insecure or incorrect — review before running, never execute untrusted output.
  • Long-context and long-video inputs increase latency and memory substantially.
  • It inherits the licenses, biases, and failure modes of its base model. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
  • No audio modality.

Citation

@misc{vectionlabs2026salience15flash,
  title  = {Salience 1.5 Flash: A Sparse-MoE Multimodal Agent},
  author = {Vection Labs},
  year   = {2026},
  url    = {https://huggingface.co/vectionlabs/Salience-1.5-Flash}
}

© 2026 Vection Labs · Apache-2.0
Downloads last month
77,351
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for vectionlabs/Salience-1.5-Flash

Quantizations
2 models

Collection including vectionlabs/Salience-1.5-Flash