DriveBench: General-Purpose Driving Scene Encoder
Author: Nikhil Upadhyay | MSc Business Analytics | Dublin Business School Project: PRECOG-AV
Overview
DriveBench is the first general-purpose driving scene encoder trained with safety-focused multi-task supervision across 25 countries and 298,326 real driving clips β the largest geographic scale in driving representation learning.
Each clip is encoded into a 256-dimensional DriveBench embedding that simultaneously captures danger context, geographic driving patterns, time-of-day risk, radar sensor health, and traffic density. Use these embeddings like ImageNet features β but for driving scenes.
Results
| Task | Metric | Score | Random Baseline |
|---|---|---|---|
| Danger Anticipation | AUC | 0.8385 | 0.500 |
| Geographic Region | Accuracy | 0.4438 | 0.167 (6 classes) |
| Time of Day | Accuracy | 0.5168 | 0.250 (4 classes) |
| Radar Health | AUC | 1.0000 | 0.500 |
| TTC Regression | Pearson r | 0.3009 | 0.000 |
Tested on Greece and Bulgaria β countries never seen during training.
What makes this different
All existing driving pre-training (DriveWorld, DriveTok, GASP) uses geometric proxy tasks β depth prediction, occupancy, reconstruction β on 1 to 3 cities.
DriveBench uses safety-relevant supervision signals across 25 countries:
- Danger labels from physics-based TTC analysis (not manual annotation)
- Radar sensor health as a training signal
- Geographic region (6 regions, 25 countries)
- Time-of-day risk patterns (peak danger 13:00-15:00 confirmed)
- Traffic density
Architecture
ViT-B/16 features (5 frames Γ 768-dim)
β
TransformerEncoder (3 layers, 8 heads, 2048 FFN)
β
DriveBench Embedding (256-dim) β use this downstream
β
5 multi-task heads:
Danger head β AUC 0.84
Region head β Acc 0.44 (6 regions)
Time-of-day β Acc 0.52 (4 buckets)
Radar head β AUC 1.00
TTC regression β r = 0.30
Usage
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
class DriveBenchModel(nn.Module):
def __init__(self, embed_dim=256, n_frames=5, n_regions=6):
super().__init__()
self.cls_token = nn.Parameter(torch.randn(1,1,768))
self.pos_embed = nn.Embedding(n_frames+1, 768)
layer = nn.TransformerEncoderLayer(
d_model=768, nhead=8, dim_feedforward=2048,
dropout=0.1, batch_first=True, norm_first=True)
self.transformer = nn.TransformerEncoder(layer, num_layers=3)
self.norm = nn.LayerNorm(768)
self.projector = nn.Sequential(
nn.Linear(768,512), nn.GELU(), nn.Dropout(0.15),
nn.Linear(512,embed_dim), nn.LayerNorm(embed_dim))
def encode(self, x):
B = x.shape[0]
cls = self.cls_token.expand(B,-1,-1)
x = torch.cat([cls,x],dim=1)
pos = torch.arange(x.shape[1], device=x.device)
x = x + self.pos_embed(pos)
x = self.norm(self.transformer(x))
return self.projector(x[:,0])
path = hf_hub_download("Trazemag/DriveBench", "drivebench_best.pt")
model = DriveBenchModel()
ckpt = torch.load(path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state"])
model.eval()
# Input: (batch, 5, 768) ViT-B/16 features from 5 consecutive frames
# Output: (batch, 256) DriveBench embedding
# Use as features for any downstream driving task
Pre-computed Embeddings
298,326 embeddings already computed β download and use directly:
import numpy as np
from huggingface_hub import hf_hub_download
path = hf_hub_download(
"Trazemag/DriveBench-Embeddings",
"drivebench_embeddings.npz",
repo_type="dataset")
data = np.load(path)
embeddings = data["embeddings"] # (298326, 256)
Training Data
Built on the NVIDIA PhysicalAI-AV dataset (gated β request access at HuggingFace).
Danger labels available at Trazemag/PRECOG-Labels.
Related Models
| Model | Task | Link |
|---|---|---|
| PRECOG-SENSE | Radar health from camera | Trazemag/PRECOG-SENSE |
| PRECOG-HERALD | Danger anticipation | Trazemag/PRECOG-HERALD |
| DriveBench | General scene encoder | This model |
Citation
@misc{upadhyay2026drivebench,
title = {DriveBench: General-Purpose Driving Scene Encoder
via Multi-Task Safety-Focused Pre-training across 25 Countries},
author = {Upadhyay, Nikhil},
year = {2026},
url = {https://github.com/TrazeMaG/PRECOG-AV}
}