ObjectForesight-DiT (EPIC-KITCHENS-100)

📄 Paper (arXiv:2601.05237) · 📦 Dataset: raivn/ObjectForesight-EPIC · 🛠️ Code: RustinS/ObjectForesight

The main model from ObjectForesight, a 3D object-centric dynamics model that predicts H=8 future 6-DoF object poses from a single egocentric observation (scene point cloud + the object's recent pose context). This is the DiT (diffusion-transformer) variant trained on EPIC-KITCHENS-100.

ObjectForesight architecture

Model

PoserV1 = PTv3 scene encoder (PointTransformer V3 / Sonata, 50.6M) + DiT diffusion temporal head (132.7M) → 183.25M params.

Encoder PTv3, embed_dim=768, in_channels=6 (camera-xyz ⊕ object-centric-xyz), attn_obj pooling, voxel grid 0.005 m
Temporal head DiT, 12 layers / 768-d / 12 heads, adaln_zero conditioning, cosine β-schedule, v-prediction, T=1000, 50 DDIM steps
Input scene point cloud [N,3] (depth-lifted, voxel-downsampled to ~4096 pts) + context_len=3 frames of [t(3), rot6d(6)] + bbox + object-in-camera pose
Output [H=8, 9] future poses, [t_x, t_y, t_z, rot6d(6)] per frame; 6D rotation → SO(3) via Gram-Schmidt
Training data raivn/ObjectForesight-EPIC, frame_skips=0, IoU-drop filtering
Checkpoint epoch 134 / step 22k; batch 128; AdamW, cosine LR 2e-41e-5, warmup 500, wd 0.01

Reported metrics (EPIC-KITCHENS-100, from the paper)

ADE/FDE = average/final translation displacement error (m, ↓); ARE/FRE = average/final rotation error (°, ↓); DES/RES = error slope over the horizon (↓).

Model ADE FDE DES ARE FRE RES
ObjectForesight-DiT (this model) 0.019 0.035 0.005 7.98° 13.93° 1.86°
ObjectForesight-AR (baseline) 0.067 0.074 0.002 9.48° 12.58° 0.93°

Files

File Size Description
model.safetensors 0.73 GB Inference weights, pickle-free. 183.25M params, fp32.
best.pt 0.73 GB Same weights as a torch checkpoint (state_dict + training metrics) for the repo's loader.
config.yaml n/a Exact architecture + data-preprocessing recipe that defines this model.
architecture.png n/a Model diagram.

Usage

This is a weights-only release; the model definition lives in RustinS/ObjectForesight. It needs CUDA-compiled deps for the PTv3 encoder:

pip install spconv-cu124          # match your CUDA (e.g. 12.4)
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.8.0+cu124.html
pip install flash-attn --no-build-isolation   # optional; falls back to SDPA if absent

Load the weights into PoserV1 (built from config.yaml):

import torch
from safetensors.torch import load_file
# from objectforesight repo:
from src.models.poser_v1.builder import build_poser_v1
from src.utils.config_adapter import apply_config_adapter   # builds the model cfg

model = build_poser_v1(**model_cfg)          # model_cfg from config.yaml (encoder + temporal)
sd = load_file("model.safetensors")          # raw inference weights
model.load_state_dict(sd, strict=False)      # only the tied `dit.*` alias is reported missing
model.eval().cuda()

# one observation -> 8 future 6-DoF poses
with torch.no_grad():
    cond = model.condition_from_batch(batch)              # batch from the dataset loader
    future = model.sample(cond["scene_pcd"], cond["context_vec"],
                          T_cam_anchor_obj=cond["T_cam_anchor_obj"],
                          steps=50, sampler="ddim", ctx_tokens_9d=ctx_9d)   # -> [B, 8, 9]

Inputs come from the companion dataset raivn/ObjectForesight-EPIC, whose bundled loader (SceneSequenceDataset) produces the exact batch contract above (scene_pcd, context_init_9d, context_bbox_norm, context_T_cam_anchor_obj).

best.pt loads via the repo's own utilities with no missing/unexpected keys:

from src.models.poser_v1.utils.checkpoint import resolve_and_load_state_dict
sd, _ = resolve_and_load_state_dict("best.pt", map_location="cpu", prefer_ema=False)
model.load_state_dict(sd, strict=False)

License & attribution

Released under CC BY-NC 4.0, inherited from EPIC-KITCHENS-100 (the model is trained on derivatives of that data). Non-commercial research use only. You must cite ObjectForesight and EPIC-KITCHENS-100 and comply with the EPIC-KITCHENS terms. Do not use this model to identify or infer private information about individuals depicted in the source video.

Citation

@article{soraki2026objectforesight,
  title   = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
  author  = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
  journal = {arXiv preprint arXiv:2601.05237},
  year    = {2026}
}
@article{damen2022rescaling,
  title   = {Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100},
  author  = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Furnari, Antonino and
             Kazakos, Evangelos and Ma, Jian and Moltisanti, Davide and Munro, Jonathan and
             Perrett, Toby and Price, Will and Wray, Michael},
  journal = {International Journal of Computer Vision (IJCV)},
  year    = {2022}
}

Built with (please also cite): PointTransformer V3 / Sonata · EPIC-KITCHENS-100. See the code repository for full references.

Downloads last month
20
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train raivn/ObjectForesight-EPIC-DiT

Paper for raivn/ObjectForesight-EPIC-DiT