video-scan

A 2B-parameter video VLM for dense captioning and natural-language temporal grounding. Given a video, it produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to (start, end) time spans in the video.

This repository is a redistribution of NemoStation/Marlin-2B packaged for internal use. Weights are unmodified. Licensed under the Business Source License 1.1 — see LICENSE and NOTICE for terms and attribution. The internal Python module name (modeling_marlin.py) and class name (MarlinForConditionalGeneration) are preserved verbatim so that trust_remote_code=True loading via auto_map continues to work without modification.

Capabilities

  • Caption mode: returns Scene: <paragraph> followed by Events: <X.X - Y.Y> <description> lines.
  • Find mode: given a natural-language event description, returns the matching time span as From X.X to Y.Y..
  • Multichunk reasoning (limited): <think>-style chunked-video reasoning with chunk-time to source-time arithmetic. Not exposed through the .caption() / .find() helpers — use a raw prompt to access it.

Architecture

Fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. Custom modeling code in modeling_marlin.py exposes two convenience methods (.caption() and .find()) that wrap a single canonical training prompt per mode and parse the structured output into typed Python dicts. Raw .generate() is also available for custom prompts.

Component Value
Base model Qwen/Qwen3.5-2B
Parameters 2.21B (text + vision combined)
Precision bfloat16
Storage on disk ~5.5 GB
Architecture string MarlinForConditionalGeneration
model_type qwen3_5
Context length 262144 tokens

Training (upstream)

The following describes the upstream training pipeline as documented by NemoStation. We have not retrained or modified the weights.

  • Data: ~400K clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens, with dense re-annotations distilled from Gemini-3-Flash and targeted human review on the highest-impact splits.
  • Stage 1: supervised fine-tuning on the curated corpus with a fixed canonical prompt per mode and Tarsier-schema output formatting.
  • Stage 2: SimPO (Simple Preference Optimization) on a teacher-distilled preference set, scored against Gemini-3-Flash on factual accuracy, completeness, and temporal alignment.
  • Hardware: single H100.

Evaluation (upstream-reported)

Upstream benchmarks Marlin-2B on three suites:

Headline numbers reported by NemoStation: tops the CaReBench leaderboard at the 2B scale, +6.4 mIoU over Qwen2.5-VL-7B on TimeLens-Bench (Charades / ActivityNet / QVHighlights). These numbers have not been independently re-verified in this repository.

Quickstart

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "cudabenchmarktest/video-scan",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
model.compile()  # optional — wraps torch.compile, faster after first call

Caption mode

result = model.caption("video.mp4")

print(result["caption"])  # full raw caption text (Scene: ... Events: ...)
print(result["scene"])    # parsed Scene paragraph
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

Optional kwargs:

  • max_new_tokens=2048 — generation token cap (default).
  • prompt=None — override the canonical training prompt. Almost always leave as None.
  • do_sample=False, temperature=1.0, top_p=1.0 — sampling controls.

Find mode

result = model.find("video.mp4", event="a person enters the room")

print(result["raw"])        # "From 14.3 to 18.2." raw model output
print(result["span"])       # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"])  # True if output matched the trained format

Raw inference

To bypass the helper methods and call generate() directly:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "cudabenchmarktest/video-scan",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
processor = AutoProcessor.from_pretrained(
    "cudabenchmarktest/video-scan", trust_remote_code=True
)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Output format notes

The model emits a <think> token at the start of every response (an artifact of training with add_non_thinking_prefix=True). The .caption() and .find() helpers strip this automatically. When calling generate() directly, strip any leading <think>...</think> block (with or without closing tag) from the output before parsing.

Requirements

  • transformers >= 5.7.0 (for native qwen3_5 architecture)
  • torch >= 2.11.0
  • torchcodec (video decoding)
  • qwen-vl-utils >= 0.0.14
  • av (torchcodec system dependency)
  • pillow
pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow

Video preprocessing

The custom modeling code sets these environment variables internally to match the training-time setup. Override them in your shell before importing transformers if needed.

Env var Default Purpose
FORCE_QWENVL_VIDEO_READER torchcodec Video decoder backend
VIDEO_MAX_PIXELS 200704 Max pixels per frame (~448x448)
FPS 2.0 Frame sampling rate
FPS_MAX_FRAMES 240 Cap on total frames (~2 min at 2 FPS)
FPS_MIN_FRAMES 4 Floor for very short videos

License and attribution

This redistribution is licensed under the Business Source License 1.1. The full license text is in LICENSE. The Qwen3.5-2B base weights remain under Apache License 2.0 — see LICENSE-QWEN-BASE and NOTICE.

Key terms of BSL 1.1 as applied here:

  • Copy, modify, redistribute, and non-production use are permitted.
  • Production use is permitted except for offering this work to third parties on a hosted or embedded basis in a way that competes with NemoStation's paid version(s).
  • Internal organizational use is explicitly not a competitive offering.
  • On the Change Date (two years after upstream public release), the license converts to Apache License 2.0.

The "Marlin" name and any logos are trademarks of NemoStation and are not granted by this license. The class identifier MarlinForConditionalGeneration and the module name modeling_marlin.py are preserved only because auto_map requires them for trust_remote_code loading; they do not imply trademark use beyond technical interoperability.

Upstream source: https://huggingface.co/NemoStation/Marlin-2B

Downloads last month
12
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cudabenchmarktest/video-scan

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(176)
this model

Papers for cudabenchmarktest/video-scan