RF-DETR-Temporal

RF-DETR-Temporal extends Roboflow's RF-DETR from a single-image detector into a multi-frame, motion-aware detector. It stacks three consecutive frames and fuses them with a small temporal pre-embedding module placed in front of the unmodified pretrained DINOv2 patch embed β€” so existing RF-DETR weights load verbatim and training starts at exact single-frame parity, then learns temporal cues as a residual.

Its whole purpose is moving objects β€” e.g. smoke or fire β€” that are typically small, distant, and semi-transparent in surveillance video: precisely the regime where a single frame is weakest and inter-frame motion is the discriminative cue a still-image detector throws away. The 9-channel design exists to exploit that motion.

Derived from roboflow/rf-detr (Apache-2.0); significant changes were made β€” see Attribution & license.

Pipeline

Pipeline comparison β€” baseline vs preembed vs bgsub

The three input / fusion configurations. Only the front β€” the input and the orange TemporalPreEmbed block β€” differs; the patch embed β†’ DINOv2 β†’ LW-DETR decoder downstream is identical to upstream RF-DETR and loads its pretrained weights verbatim.

At initialisation R ≑ 0 and (for bgsub/bgsubcoh) the add-weight term vanishes on static input, so the backbone receives exactly the current frame β†’ identical to the single-frame baseline. The temporal contribution is learned from there.

What's different from upstream

Upstream RF-DETR RF-DETR-Temporal
Input 1 frame, (B,3,H,W) 3 stacked frames, (B,9,H,W), last = current
Multi-channel widens the patch-embed conv (sums channels β†’ averages frames) TemporalPreEmbed reduces 9β†’3 before the unmodified 3ch patch embed
Pretrained patch embed lossily widened loaded unchanged; new module absorbed by strict=False
Init behaviour β€” exact single-frame parity; temporal learned as a residual
New config β€” in_channels, temporal_fusion ∈ {none, preembed, bgsub, bgsubcoh}
Training PyTorch-Lightning stack standalone DDP script + manifest loader + temporal-aware augmentation
Tooling β€” diagnostics/ suite (parity, motion-/size-stratified eval, …)

The detector, decoder, heads, loss, and matcher are unchanged; the only model change is the new module inside the backbone plus two config fields and a one-line gate in the weight loader.

How it works (short)

  • The pitfall. Naively widening the patch-embed Conv2d to 9 channels makes the embedding compute the temporal average of the frames β€” a low-pass / motion-blur op that destroys the change signal and feeds the backbone an out-of-distribution blurred image.
  • The fix β€” TemporalPreEmbed. Reduce 9β†’3 channels with a small motion-aware module before the unmodified 3-channel patch embed, so DINOv2/RF-DETR weights load verbatim and the model starts at single-frame parity. Three fusion modes (preembed, bgsub, bgsubcoh) trade off how the temporal/motion signal is injected.
  • The data lever. A size census showed the validation set was dominated by large instances with the small-object regime nearly empty β€” so the detection bottleneck was data, not architecture. A small-object copy-paste augmentation manufactures the missing small (optionally moving) targets.

The only architectural change is the front module. In preembed it is a zero-init residual on motion (frame) differences β€” it adds nothing at initialisation (so the model starts exactly at the single-frame baseline) and learns the temporal cue as a residual from there:

TemporalPreEmbed (preembed) β€” detailed schematic

Full details: docs/architecture.md Β· docs/temporal-fusion.md Β· docs/training.md Β· docs/findings.md

Quick start

pip install uv && uv sync --all-groups        # PyTorch >=2.2,<3; transformers >=5,<6; Python >=3.10

# provide your data as two manifests β€” data_manifests/{train,valid}.txt β€” one clip per line:
#   /abs/frame0|/abs/frame1|/abs/frame2|<labels>
# where <labels> are YOLO "cls cx cy w h" boxes for the LAST (current) frame. See docs/training.md.

# train on 4 GPUs (coherence-gated motion add-weight + moving small-object augmentation)
DATA_DIR=data_manifests NUM_GPUS=4 RESOLUTION=952 \
TEMPORAL_FUSION=bgsubcoh AUG_SMALLOBJ_P=0.5 AUG_SMALLOBJ_MOTION=14 \
uv run --no-sync python train_temporal_base_v4.py

Env-var reference, data format, ONNX export and inference: docs/training.md.

Results

Accuracy is class-averaged mAP@0.5. "Aggregate" is on the real validation set (dominated by large instances); "small-object" is on a deterministic synthetic small-object set, since the real data has almost no small instances (the data bottleneck β€” see below).

Configuration Aggregate mAP@0.5 Synthetic small-object mAP@0.5
single-frame baseline 0.875 β€”
naive 9-channel (temporal averaging) ~0.892 β€”
preembed (temporal) 0.907 ~0.06
preembed + small-object augmentation ~0.88 ~0.80
bgsub (plain motion add-weight) 0.891 β€”
bgsubcoh + moving-object augmentation in progress in progress

Reading: the temporal fix recovers and exceeds the single-frame baseline; the plain motion add-weight raises motion-region attention ~3Γ— but did not improve aggregate detection (an honest negative result); and the real lever for the moving/small regime was data β€” synthetic small-object augmentation lifts small-object mAP from β‰ˆ0.06 to β‰ˆ0.80. Full record, caveats, and the motion-stratified breakdown: docs/findings.md.

Speed

The temporal extension is near-zero overhead: the DINOv2 backbone and LW-DETR decoder are unchanged, and the fusion module adds just 438 parameters (0.001% of the model) β€” a few convolutions at input resolution. Per-inference network latency therefore matches upstream RF-DETR Base (real-time class) on the same hardware; the only added runtime cost is decoding 3 frames instead of 1 per inference (I/O, not compute). Absolute FPS is not separately benchmarked here.

Repository layout

src/rfdetr/                              # upstream RF-DETR (Apache-2.0), minimally modified
└── models/backbone/temporal_fusion.py  # β˜… TemporalPreEmbed (9chβ†’3ch motion fusion)
train_temporal_base_v4.py               # β˜… DDP training entrypoint + dataset + augmentations
diagnostics/                            # β˜… probes + stratified evaluators
export_onnx.py                          # ONNX export (9-channel temporal model)
docs/                                   # detailed design / training / findings

Generated locally and git-ignored: runs/, onnx_exports/, data_manifests/, and all *.pth/*.onnx/*.mp4 artifacts (see .gitignore).

Attribution & license

Derivative work of RF-DETR by Roboflow, licensed under the Apache License 2.0, and itself released under Apache-2.0. The upstream LICENSE is retained and all upstream source files keep their original license headers. Per Apache Β§4, this notice states that significant changes were made: a new temporal pre-embedding module and fusion modes, 9-channel input wiring, a small-object augmentation, and standalone training/diagnostics tooling. The DINOv2-with-Registers backbone code is itself derived from HuggingFace Transformers (see that file's header).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support