Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Marlin: a tiny VLM to extract structured information from videos
Marlin is a 2B video VLM tuned for the two questions developers actually like ask their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), and competitive with Gemini-2.5 at a fraction of the cost.
✨ Key features
- 📝 State-of-the-art dense captioning at 2B. Tops the CaReBench leaderboard and sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K, two of the most rigorous fine-grained video-captioning benchmarks in the community.
- ⏱️ Best-in-class temporal grounding at 2B. On Tencent's TimeLens-Bench (Charades / ActivityNet / QVHighlights), Marlin beats Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash.
- 🔥 Built to deploy. 2B params, vLLM- and swift-deploy-compatible, runs on a single consumer GPU. Same canonical training prompt at inference time, no special wrappers required.
- 🛠️ Developer-friendly. Standard HF
transformersAPI, two convenience methods (.caption,.find) that return parsed dicts, raw.generate()access for custom prompts, Gradio demo ready out of the box.
Need Marlin tailored to your specific video processing needs? Our team can help with custom fine-tuning and integrations — contact us ✉️
Examples
🧠 Model & training
Architecture. Marlin is a fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. The model exposes two modes (caption and find) through custom modeling code in modeling_marlin.py, which wraps a single canonical training prompt per mode and parses the structured output into typed Python dicts.
Training data. We assembled a high-quality training corpus by combining sparse public annotations (ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens) with dense re-annotations from Gemini-3-Flash in thinking mode, followed by targeted human review on the highest-impact splits. The teacher pipeline was tuned specifically to produce temporally grounded atomic events and actions, with explicit <start-end> boundaries per claim rather than free-form prose. The final mix is ~400K high-quality clip-level annotations for caption mode and a separate grounding-tuned split for find mode.
Training technique. Two-stage post-training on a single H100. Stage 1 is supervised fine-tuning (SFT) on the curated dataset above, with a fixed canonical prompt per mode and Tarsier-schema output formatting. Stage 2 is preference optimization via SimPO (Simple Preference Optimization) on a teacher-distilled preference set. For each clip, candidate completions from the SFT checkpoint are scored against a stronger Gemini-3-Flash judge using a rich rubric (factual accuracy, completeness, temporal alignment), and the resulting win/lose pairs align Marlin without a reference model, making it cheaper and more stable than DPO at this scale. ✏️ Recipe paper coming soon.
🏆 Evaluation
Marlin is, to our knowledge, the strongest open video VLM in its weight class on both axes that matter for video analysis in production: fine-grained dense captioning and natural-language temporal grounding. The three-panel figure below summarises the trajectory from the Qwen3.5-2B base, through Marlin-SFT, to Marlin-SimPO (the release checkpoint) across:
- CaReBench — CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
- DREAM-1K — Tarsier: Recipes for Training and Evaluating Large Video Description Models
- TimeLens-Bench — TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Same training pipeline on every panel; same evaluation harness across all rows. On captioning, Marlin closes the gap to its Gemini-2.5-Flash teacher to within 0.21 / 0.43 of 10. On temporal grounding, Marlin sits on the Pareto frontier in the 2B band and matches Gemini-2.5-Flash (non-thinking). Specialised 7B+ models on these benchmarks (TimeLens-7B/8B, MiMo-VL, Time-R1) still carry the upper frontier becasue they have task-specific data during training; Marlin is the strongest general-purpose model on these tasks at 2B.
Quickstart
The model ships with custom modeling code that adds two convenience methods (caption and find) directly to the model object. Loading with trust_remote_code=True returns a ready-to-use instance:
import torch
from transformers import AutoModelForCausalLM
marlin = AutoModelForCausalLM.from_pretrained(
"NemoStation/Marlin-2B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
marlin.compile() # optional — wraps torch.compile, faster after first call
Caption mode — marlin.caption()
result = marlin.caption("video.mp4")
print(result["caption"]) # full raw caption text (Scene: ... Events: ...)
print(result["scene"]) # parsed Scene paragraph
for ev in result["events"]:
print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")
Optional kwargs:
max_new_tokens=2048(default) — generation token cap.prompt=None— override the canonical training prompt (almost always leave asNone).do_sample=False,temperature=1.0,top_p=1.0— sampling controls.
The model was trained on dense captions of variable length and will produce as much detail as it sees fit within max_new_tokens.
Find mode — marlin.find()
result = marlin.find("video.mp4", event="a person enters the room")
print(result["raw"]) # "From 14.3 to 18.2." raw model output
print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"]) # True if output matched the trained format
System requirements
transformers >= 5.7.0(for nativeqwen3_5architecture)torch >= 2.11.0torchcodec(video decoding)qwen-vl-utils >= 0.0.14av(torchcodec system dep)pillow
Install:
pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow
Video preprocessing
The custom modeling code sets these env vars internally (matches the training-time setup). If you want to override them, set them in your shell before importing transformers:
| Env var | Default | What it does |
|---|---|---|
FORCE_QWENVL_VIDEO_READER |
torchcodec |
Video decoder backend |
VIDEO_MAX_PIXELS |
200704 |
Max pixels per frame (~448×448) |
FPS |
2.0 |
Frame sampling rate |
FPS_MAX_FRAMES |
240 |
Cap on total frames (covers ~2 min videos) |
FPS_MIN_FRAMES |
4 |
Floor for very short videos |
Capabilities
- Caption (Mode 1): produces
Scene: <paragraph>+Events: <X.X - Y.Y> <description>format. - Find (Mode 2): given a natural-language event query, returns
From X.X to Y.Y.. - Multichunk reasoning (limited in this checkpoint):
<think>-style chunked-video reasoning with explicit chunk-time → source-time arithmetic. Not directly exposed via.caption()/.find()— use a raw prompt if needed.
Training data
- Caption mode: ANet, LSMDC, YC2, COIN, GOT-10k/LaSOT — Gemini-generated dense captions.
- Find mode: HC-STVG, VidSTG, TimeLens — ground-truth spans + multichunk variants.
Advanced — raw inference
If you want to bypass the helper methods and call generate() directly (e.g., for custom prompts), the standard transformers pattern works:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"NemoStation/Marlin-2B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True)
messages = [{"role": "user", "content": [
{"type": "video", "video": "video.mp4"},
{"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
).to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)
Notes on output
The model emits a <think> token at the start of every response (an artifact of training with add_non_thinking_prefix=True). The .caption() and .find() methods strip this automatically. If you're using generate() directly, strip <think>...</think> (with or without closing tag) from the start of the output.
- Downloads last month
- 46