MOSAIC-4B

MOSAIC-4B is an efficient heterogeneous Vision-Language Model derived from Qwen3-VL-4B-Instruct via the MOSAIC (Multi-Objective Search for Adaptive Inter-layer Composition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.

Paper: MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models (arXiv preprint) Authors: Yuncheng Yang*, Feiyang Ye*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.

Highlights

Metric	Value
Decoding speedup (TPOT)	2.54× vs. Qwen3-VL-4B-Instruct
Prefilling speedup (TTFT @ 96k tokens)	1.76× vs. Qwen3-VL-4B-Instruct
Performance gap (19 benchmarks avg)	−0.6% on image, −0.8% on video
Training cost	< 2% of original Qwen3-VL-4B-Instruct

Key Advantages

Hardware-aware automatic architecture search. MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.
Heterogeneous operator mixing. Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.
Matches teacher performance at a fraction of the training cost. MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.
Scalable inference acceleration. The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.
Principled two-stage parameter recovery. Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.

Architecture

The figure below shows the per-layer operator assignment and relative runtime reduction for MOSAIC-4B (1.5× speedup target). Green bars indicate saved runtime compared to the original full-attention layer.

HuggingFace Transformers

Installation

pip install transformers torch
pip install flash-linear-attention  # required for linear attention operators (KDA, GDN, MLA)

Dependencies

Package	Version
transformers	≥ 4.57.0
torch	≥ 2.0
flash-linear-attention (fla)	latest

Usage

This model uses a custom architecture and requires trust_remote_code=True.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "LiAuto-DSR/MOSAIC-4B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://example.com/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

vLLM Acceleration

For significantly faster inference, MOSAIC-4B supports vLLM via an out-of-tree monkey patch plugin.

Installation

pip install -e .
# or from the model directory:
pip install -e /path/to/MOSAIC-4B

Dependencies

Package	Version
vllm	≥ 0.17.0, < 0.18.0
flash-linear-attention (fla)	≥ 0.4.2
einops	latest

Usage

import nas_child_vl_vllm  # Register MOSAIC-4B with vLLM (must be imported before vLLM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="LiAuto-DSR/MOSAIC-4B",
    trust_remote_code=True,
    enforce_eager=True,  # Required: FLA layers don't support CUDAGraph
    dtype="bfloat16",
)

output = llm.generate("Hello, how are you?", SamplingParams(max_tokens=128))
print(output)

Note: enforce_eager=True is required because the FLA (Flash Linear Attention) GDN/KDA layers do not support CUDA graphs. trust_remote_code=True is needed to load the custom config and model classes.

What the plugin does when you import nas_child_vl_vllm:

Registers NasChildVLConfig with vLLM's config discovery system
Maps model_type="nas-child-vl" to the config class
Maps NasChildVLModelForCausalLM architecture to the vLLM model class (lazy-loaded)
Registers MambaModelConfig for proper hybrid state management (GDN/KDA layers)
Patches rmsnorm_fn to support sigmoid activation (needed by KDA layers)

Citation

@article{yang2026mosaic,
  title     = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
  author    = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
  journal = {arXiv preprint},
  year      = {2026}
}

License

This model is released under the Apache 2.0 license. The base model weights are derived from Qwen3-VL-4B-Instruct, which is licensed under Qwen Research License.

Downloads last month: 11

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for LiAuto-DSR/MOSAIC-4B

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(294)

this model