Access Marlin 2B

Marlin 2B is free to use. Please share a few details so we can keep you posted on new releases and gather feedback.

Marlin: a tiny VLM to extract structured information from videos

Marlin is a 2B video VLM tuned for the two questions developers actually like ask their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), and competitive with Gemini-2.5 at a fraction of the cost.

✨ Key features

📝 State-of-the-art dense captioning at 2B. Tops the CaReBench leaderboard and sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K, two of the most rigorous fine-grained video-captioning benchmarks in the community.
⏱️ Best-in-class temporal grounding at 2B. On Tencent's TimeLens-Bench (Charades / ActivityNet / QVHighlights), Marlin beats Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash.
🔥 Built to deploy. 2B params, vLLM- and swift-deploy-compatible, runs on a single consumer GPU. Same canonical training prompt at inference time, no special wrappers required.
🛠️ Developer-friendly. Standard HF transformers API, two convenience methods (.caption, .find) that return parsed dicts, raw .generate() access for custom prompts, Gradio demo ready out of the box.

Need Marlin tailored to your specific video processing needs? Our team can help with custom fine-tuning and integrations — contact us ✉️

Examples

🧠 Model & training

Architecture. Marlin is a fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. The model exposes two modes (caption and find) through custom modeling code in modeling_marlin.py, which wraps a single canonical training prompt per mode and parses the structured output into typed Python dicts.

Training data. We assembled a high-quality training corpus by combining sparse public annotations (ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens) with dense re-annotations from Gemini-3-Flash in thinking mode, followed by targeted human review on the highest-impact splits. The teacher pipeline was tuned specifically to produce temporally grounded atomic events and actions, with explicit <start-end> boundaries per claim rather than free-form prose. The final mix is ~400K high-quality clip-level annotations for caption mode and a separate grounding-tuned split for find mode.

Training technique. Two-stage post-training on a single H100. Stage 1 is supervised fine-tuning (SFT) on the curated dataset above, with a fixed canonical prompt per mode and Tarsier-schema output formatting. Stage 2 is preference optimization via SimPO (Simple Preference Optimization) on a teacher-distilled preference set. For each clip, candidate completions from the SFT checkpoint are scored against a stronger Gemini-3-Flash judge using a rich rubric (factual accuracy, completeness, temporal alignment), and the resulting win/lose pairs align Marlin without a reference model, making it cheaper and more stable than DPO at this scale. ✏️ Recipe paper coming soon.

🏆 Evaluation

Marlin is, to our knowledge, the strongest open video VLM in its weight class on both axes that matter for video analysis in production: fine-grained dense captioning and natural-language temporal grounding. The three-panel figure below summarises the trajectory from the Qwen3.5-2B base, through Marlin-SFT, to Marlin-SimPO (the release checkpoint) across:

CaReBench — CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
DREAM-1K — Tarsier: Recipes for Training and Evaluating Large Video Description Models
TimeLens-Bench — TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Marlin 2B trajectory across CaReBench, DREAM-1K, and TimeLens-Charades

Same training pipeline on every panel; same evaluation harness across all rows. On captioning, Marlin closes the gap to its Gemini-2.5-Flash teacher to within 0.21 / 0.43 of 10. On temporal grounding, Marlin sits on the Pareto frontier in the 2B band and matches Gemini-2.5-Flash (non-thinking). Specialised 7B+ models on these benchmarks (TimeLens-7B/8B, MiMo-VL, Time-R1) still carry the upper frontier becasue they have task-specific data during training; Marlin is the strongest general-purpose model on these tasks at 2B.

Quickstart

The model ships with custom modeling code that adds two convenience methods (caption and find) directly to the model object. Loading with trust_remote_code=True returns a ready-to-use instance:

import torch
from transformers import AutoModelForCausalLM

marlin = AutoModelForCausalLM.from_pretrained(
    "NemoStation/Marlin-2B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
marlin.compile()  # optional — wraps torch.compile, faster after first call

Caption mode — `marlin.caption()`

result = marlin.caption("video.mp4")

print(result["caption"])  # full raw caption text (Scene: ... Events: ...)
print(result["scene"])    # parsed Scene paragraph
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

Optional kwargs:

max_new_tokens=2048 (default) — generation token cap.
prompt=None — override the canonical training prompt (almost always leave as None).
do_sample=False, temperature=1.0, top_p=1.0 — sampling controls.

The model was trained on dense captions of variable length and will produce as much detail as it sees fit within max_new_tokens.

Find mode — `marlin.find()`

result = marlin.find("video.mp4", event="a person enters the room")

print(result["raw"])        # "From 14.3 to 18.2." raw model output
print(result["span"])       # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"])  # True if output matched the trained format

System requirements

transformers >= 5.7.0 (for native qwen3_5 architecture)
torch >= 2.11.0
torchcodec (video decoding)
qwen-vl-utils >= 0.0.14
av (torchcodec system dep)
pillow

Install:

pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow

Video preprocessing

The custom modeling code sets these env vars internally (matches the training-time setup). If you want to override them, set them in your shell before importing transformers:

Env var	Default	What it does
`FORCE_QWENVL_VIDEO_READER`	`torchcodec`	Video decoder backend
`VIDEO_MAX_PIXELS`	`200704`	Max pixels per frame (~448×448)
`FPS`	`2.0`	Frame sampling rate
`FPS_MAX_FRAMES`	`240`	Cap on total frames (covers ~2 min videos)
`FPS_MIN_FRAMES`	`4`	Floor for very short videos

Capabilities

Caption (Mode 1): produces Scene: <paragraph> + Events: <X.X - Y.Y> <description> format.
Find (Mode 2): given a natural-language event query, returns From X.X to Y.Y..
Multichunk reasoning (limited in this checkpoint): <think>-style chunked-video reasoning with explicit chunk-time → source-time arithmetic. Not directly exposed via .caption() / .find() — use a raw prompt if needed.

Training data

Caption mode: ANet, LSMDC, YC2, COIN, GOT-10k/LaSOT — Gemini-generated dense captions.
Find mode: HC-STVG, VidSTG, TimeLens — ground-truth spans + multichunk variants.

Advanced — raw inference

If you want to bypass the helper methods and call generate() directly (e.g., for custom prompts), the standard transformers pattern works:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "NemoStation/Marlin-2B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Notes on output

The model emits a <think> token at the start of every response (an artifact of training with add_non_thinking_prefix=True). The .caption() and .find() methods strip this automatically. If you're using generate() directly, strip <think>...</think> (with or without closing tag) from the start of the output.

Downloads last month: 46

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NemoStation/Marlin-2B

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B