Instructions to use vectionlabs/Salience-1.5-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vectionlabs/Salience-1.5-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="vectionlabs/Salience-1.5-Flash") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("vectionlabs/Salience-1.5-Flash") model = AutoModelForMultimodalLM.from_pretrained("vectionlabs/Salience-1.5-Flash") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use vectionlabs/Salience-1.5-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "vectionlabs/Salience-1.5-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vectionlabs/Salience-1.5-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/vectionlabs/Salience-1.5-Flash
- SGLang
How to use vectionlabs/Salience-1.5-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "vectionlabs/Salience-1.5-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vectionlabs/Salience-1.5-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "vectionlabs/Salience-1.5-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vectionlabs/Salience-1.5-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use vectionlabs/Salience-1.5-Flash with Docker Model Runner:
docker model run hf.co/vectionlabs/Salience-1.5-Flash
Salience 1.5 — Flash
A 30B-A3B Mixture-of-Experts multimodal agent — only 3.3B active params per token: the decode speed of a small model with the reach of a large one.
Vection Labs
Weights · Benchmarks · Quickstart · Fast inference · Limitations
Abstract
Salience 1.5 Flash is a sparse Mixture-of-Experts vision-language model: 30B total parameters, but only 3.3B active per token. It decodes at the speed of a ~3B dense model while reasoning with the capacity of a 30B one — built for hard, practical work: writing and debugging real code, driving tools and agents, designing production-grade interfaces, and visual understanding over images and video, inside a single model with a context window of up to 1M tokens.
It is the fast, multimodal tier of the Salience family — engineered for people who care less about chat pleasantries and more about whether the model can do the thing: ship the function, find the bug, call the right tool, design the screen, read the diagram.
Highlights
- Bigger and faster at once. Sparse activation means ~3B of compute for 30B of knowledge — roughly 2× the decode speed of a dense 8B at far greater capacity.
- Code & agentic first. Tuned to produce runnable code, repo-scale edits, and well-formed native tool calls.
- Designs, not just describes. Defaults to modern stacks (React/Next, TypeScript, Tailwind, shadcn/ui), real design tokens, accessible (WCAG) semantics, and tasteful motion.
- Reasoning that shows its work. Structured, inspectable chains of thought — with a one-token switch to turn them off when you want instant answers.
- Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
- Long context. Up to 1M tokens via interleaved multimodal RoPE — whole repos, long papers, or long videos in a single prompt.
- Fast on modest hardware. Runs on 2× T4 with no GGUF (~17 GB in 4-bit NF4).
- Open weights. Apache-2.0,
transformers-native, single-file deployment.
Model overview
| Parameters | 30B total / 3.3B active (Mixture-of-Experts) |
| Modalities | text, image, video → text |
| Context window | up to 1,000,000 tokens (256K native, interleaved multimodal RoPE) |
| Precision | bfloat16 master weights |
| Architecture | Qwen3-VL MoE (30B-A3B) + native vision encoder |
| License | Apache-2.0 |
| Library | 🤗 transformers (AutoModelForImageTextToText) |
Architecture & capabilities
Salience 1.5 Flash is a Qwen3-VL Mixture-of-Experts model: a 30B-parameter expert network that routes only 3.3B parameters per token, coupled to a native vision encoder, with interleaved multimodal RoPE carrying the context window from 256K up to 1M tokens.
Its capability profile is built around four pillars:
- Code & agentic execution — runnable code, repo-scale edits, and well-formed tool calls.
- Frontier UI/UX design — modern stacks, design tokens, accessible semantics, tasteful motion.
- Deep reasoning — structured, inspectable chains of thought for math and logic.
- Multimodal perception — images and video as first-class inputs, not bolted-on captioning.
The vision pathway and long-context behavior are preserved end to end, so the same reasoning that solves a hard problem also reads a chart, a UI screenshot, or a short clip.
Agent persona & thinking control
With no system prompt, the model adopts an elite software-engineering + design agent persona
automatically — pass your own system message to override it completely. Thinking is on by
default; append /no_think (or pass enable_thinking=False) for instant direct answers.
Intended use
Salience 1.5 Flash targets technical assistance, coding & design agents, and research:
- Code generation, explanation, debugging, review, and repo-scale tasks.
- Frontend / UI generation and design-system work.
- Agentic / tool-using workflows that emit structured calls.
- Step-by-step math and quantitative reasoning.
- Visual question answering and document/diagram/chart/UI understanding.
- Video understanding over short clips, and long-document / long-context analysis.
It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.
Benchmarks
All results use a single reproducible evaluation harness with greedy/CoT settings.
Reasoning, math & code
| Benchmark | Setting | Salience-1.5-Flash |
|---|---|---|
| GSM8K | 0-shot CoT, exact match | — |
| MATH-500 | 0-shot CoT, exact match | — |
| HumanEval | 0-shot, pass@1 | — |
| MBPP | 3-shot, pass@1 | — |
| MMLU | 0-shot | — |
Multimodal
| Benchmark | Setting | Salience-1.5-Flash |
|---|---|---|
| MMMU (val) | 0-shot | — |
| MathVista (testmini) | 0-shot | — |
| DocVQA (val) | 0-shot, ANLS | — |
The evaluation protocol, prompts, and answer-extraction logic are fixed and reproducible end-to-end.
Quickstart
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model_id = "vectionlabs/Salience-1.5-Flash"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/diagram.png"},
{"type": "text", "text": "Explain what this diagram proves, step by step."},
],
}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Text-only works the same way with a plain {"type": "text", ...} message. Append /no_think
to any prompt for the fastest, direct answer.
Fast inference (2× T4, no GGUF)
T4 (Turing) has no bf16 and no FlashAttention2 — use fp16 + SDPA, or 4-bit NF4 to fit ~17 GB:
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig
repo = "vectionlabs/Salience-1.5-Flash"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForImageTextToText.from_pretrained(
repo, quantization_config=bnb, device_map="auto", attn_implementation="sdpa")
proc = AutoProcessor.from_pretrained(repo)
Because only 3.3B parameters are active per token, decode stays fast even at 30B scale.
Speed & efficiency
- Sparse MoE. ~3B active params/token means memory traffic per token is roughly halved versus a dense model of equal quality — the core "bigger AND faster" lever.
- Adaptive thinking. Append
/no_thinkfor instant direct answers, or keep thinking on for deep step-by-step reasoning on hard math and multi-step agentic planning — you spend latency only when the task is worth it. - Speculative decoding with a small same-family draft (
Qwen/Qwen3-0.6B) gives a lossless 1.5–2.5× speedup on code and structured text (assistant_model=intransformers, or--speculative-modelin vLLM). - Production serving. On Ampere+ GPUs use vLLM (fp16 / AWQ) for high-throughput deployment.
Prompting tips
- Code: specify language, constraints ("no external libraries"), and the exact I/O contract.
- Design: name the stack and the look you want; it will return tokens, semantics, and structure.
- Agentic / tools: give the tool schema and ask for the call as strict JSON.
- Math/logic: ask it to reason step by step; it is tuned to externalize its work.
- Vision: put the image/video before the question in the message content.
- Sampling (Qwen3 family): thinking →
temperature=0.6, top_p=0.95, top_k=20; direct answers →temperature=0.7, top_p=0.8, top_k=20.
Long context (large codebases)
256K out of the box. To push toward ~1M, enable YaRN in config.json:
"rope_scaling": { "type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144 }
(Small short-context quality cost — enable only when you actually need >256K.)
Deployment
- Single / dual GPU: loads in bf16/fp16 with
device_map="auto"; 4-bit NF4 fits ~17 GB across 2× T4. - Serving: integrates with standard
transformersgeneration and vision-capable serving stacks such as vLLM (with optional speculative decoding) for high-throughput production use. - Quantized formats: GGUF and other community quantizations are supported.
Limitations & responsible use
- Salience 1.5 Flash can be confidently wrong. Verify mathematical and factual claims.
- Generated code may be insecure or incorrect — review before running, never execute untrusted output.
- Long-context and long-video inputs increase latency and memory substantially.
- It inherits the licenses, biases, and failure modes of its base model. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
- No audio modality.
Citation
@misc{vectionlabs2026salience15flash,
title = {Salience 1.5 Flash: A Sparse-MoE Multimodal Agent},
author = {Vection Labs},
year = {2026},
url = {https://huggingface.co/vectionlabs/Salience-1.5-Flash}
}
- Downloads last month
- 77,351