video-scan

A 2B-parameter video VLM for dense captioning and natural-language temporal grounding. Given a video, it produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to (start, end) time spans in the video.

This repository is a redistribution of NemoStation/Marlin-2B packaged for internal use. Weights are unmodified. Licensed under the Business Source License 1.1 — see LICENSE and NOTICE for terms and attribution. The internal Python module name (modeling_marlin.py) and class name (MarlinForConditionalGeneration) are preserved verbatim so that trust_remote_code=True loading via auto_map continues to work without modification.

Capabilities

Caption mode: returns Scene: <paragraph> followed by Events: <X.X - Y.Y> <description> lines.
Find mode: given a natural-language event description, returns the matching time span as From X.X to Y.Y..
Multichunk reasoning (limited): <think>-style chunked-video reasoning with chunk-time to source-time arithmetic. Not exposed through the .caption() / .find() helpers — use a raw prompt to access it.

Architecture

Fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. Custom modeling code in modeling_marlin.py exposes two convenience methods (.caption() and .find()) that wrap a single canonical training prompt per mode and parse the structured output into typed Python dicts. Raw .generate() is also available for custom prompts.

Component	Value
Base model	Qwen/Qwen3.5-2B
Parameters	2.21B (text + vision combined)
Precision	bfloat16
Storage on disk	~5.5 GB
Architecture string	`MarlinForConditionalGeneration`
`model_type`	`qwen3_5`
Context length	262144 tokens

Training (upstream)

The following describes the upstream training pipeline as documented by NemoStation. We have not retrained or modified the weights.

Data: ~400K clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens, with dense re-annotations distilled from Gemini-3-Flash and targeted human review on the highest-impact splits.
Stage 1: supervised fine-tuning on the curated corpus with a fixed canonical prompt per mode and Tarsier-schema output formatting.
Stage 2: SimPO (Simple Preference Optimization) on a teacher-distilled preference set, scored against Gemini-3-Flash on factual accuracy, completeness, and temporal alignment.
Hardware: single H100.

Evaluation (upstream-reported)

Upstream benchmarks Marlin-2B on three suites:

CaReBench — arXiv:2501.00513
DREAM-1K — arXiv:2407.00634
TimeLens-Bench — arXiv:2512.14698

Headline numbers reported by NemoStation: tops the CaReBench leaderboard at the 2B scale, +6.4 mIoU over Qwen2.5-VL-7B on TimeLens-Bench (Charades / ActivityNet / QVHighlights). These numbers have not been independently re-verified in this repository.

Quickstart

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "cudabenchmarktest/video-scan",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
model.compile()  # optional — wraps torch.compile, faster after first call

Caption mode

result = model.caption("video.mp4")

print(result["caption"])  # full raw caption text (Scene: ... Events: ...)
print(result["scene"])    # parsed Scene paragraph
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

Optional kwargs:

max_new_tokens=2048 — generation token cap (default).
prompt=None — override the canonical training prompt. Almost always leave as None.
do_sample=False, temperature=1.0, top_p=1.0 — sampling controls.

Find mode

result = model.find("video.mp4", event="a person enters the room")

print(result["raw"])        # "From 14.3 to 18.2." raw model output
print(result["span"])       # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"])  # True if output matched the trained format

Raw inference

To bypass the helper methods and call generate() directly:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "cudabenchmarktest/video-scan",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
processor = AutoProcessor.from_pretrained(
    "cudabenchmarktest/video-scan", trust_remote_code=True
)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Output format notes

The model emits a <think> token at the start of every response (an artifact of training with add_non_thinking_prefix=True). The .caption() and .find() helpers strip this automatically. When calling generate() directly, strip any leading <think>...</think> block (with or without closing tag) from the output before parsing.

Requirements

transformers >= 5.7.0 (for native qwen3_5 architecture)
torch >= 2.11.0
torchcodec (video decoding)
qwen-vl-utils >= 0.0.14
av (torchcodec system dependency)
pillow

pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow

Video preprocessing

The custom modeling code sets these environment variables internally to match the training-time setup. Override them in your shell before importing transformers if needed.

Env var	Default	Purpose
`FORCE_QWENVL_VIDEO_READER`	`torchcodec`	Video decoder backend
`VIDEO_MAX_PIXELS`	`200704`	Max pixels per frame (~448x448)
`FPS`	`2.0`	Frame sampling rate
`FPS_MAX_FRAMES`	`240`	Cap on total frames (~2 min at 2 FPS)
`FPS_MIN_FRAMES`	`4`	Floor for very short videos

License and attribution

This redistribution is licensed under the Business Source License 1.1. The full license text is in LICENSE. The Qwen3.5-2B base weights remain under Apache License 2.0 — see LICENSE-QWEN-BASE and NOTICE.

Key terms of BSL 1.1 as applied here:

Copy, modify, redistribute, and non-production use are permitted.
Production use is permitted except for offering this work to third parties on a hosted or embedded basis in a way that competes with NemoStation's paid version(s).
Internal organizational use is explicitly not a competitive offering.
On the Change Date (two years after upstream public release), the license converts to Apache License 2.0.

The "Marlin" name and any logos are trademarks of NemoStation and are not granted by this license. The class identifier MarlinForConditionalGeneration and the module name modeling_marlin.py are preserved only because auto_map requires them for trust_remote_code loading; they do not imply trademark use beyond technical interoperability.

Upstream source: https://huggingface.co/NemoStation/Marlin-2B