Papers
arxiv:2402.02519

SIMPL: A Simple and Efficient Multi-agent Motion Prediction Baseline for Autonomous Driving

Published on Feb 4
Authors:
,
,
,

Abstract

This paper presents a Simple and effIcient Motion Prediction baseLine (SIMPL) for autonomous vehicles. Unlike conventional agent-centric methods with high accuracy but repetitive computations and scene-centric methods with compromised accuracy and generalizability, SIMPL delivers real-time, accurate motion predictions for all relevant traffic participants. To achieve improvements in both accuracy and inference speed, we propose a compact and efficient global feature fusion module that performs directed message passing in a symmetric manner, enabling the network to forecast future motion for all road users in a single feed-forward pass and mitigating accuracy loss caused by viewpoint shifting. Additionally, we investigate the continuous trajectory parameterization using Bernstein basis polynomials in trajectory decoding, allowing evaluations of states and their higher-order derivatives at any desired time point, which is valuable for downstream planning tasks. As a strong baseline, SIMPL exhibits highly competitive performance on Argoverse 1 & 2 motion forecasting benchmarks compared with other state-of-the-art methods. Furthermore, its lightweight design and low inference latency make SIMPL highly extensible and promising for real-world onboard deployment. We open-source the code at https://github.com/HKUST-Aerial-Robotics/SIMPL.

Community

Proposes SIMPL (simple and effective motion prediction baseline): a generalised method for future motion prediction of all road participants using a novel global feature fusion module. Introduces instance-centric scene representation with symmetric fusion transformer (SFT), parameterization of trajectories using Bernstein polynomials (Bezier curves) in real time. Predicts future (possible) trajectories with scores (multimodal distribution). Framework: Instance-centric scene representation encoded by MLP to get relative pose embeddings (RPE), encoders for actor and map tokens, contented tokens and RPE is given to SFT which outputs coefficients for Bezier-based motion decoder. SFT: source and target embeddings are concatenate with the RPE (one-to-one, vectorised by expand and repeat), pass through MLP to get context embeddings, cross-attention on target (as query) and context (as key, value), residual (add with query) and normalization, FF then res-norm to get output (updated) tokens; RPE tokens are context through MLP and res (add with input RPE) norm. Multiple layers of SFT is used. Output action tokens are sent to MLP head for K trajectories (as control points) and classification with softmax for probability. Trained using regression loss and classification loss; regression has smooth-L1 position coordinate loss and yaw loss (cosine similarity). Trained and tested on Argoverse 1 and 2 motion forecasting datasets. Better than MacFormer, HiVT, and comparable to Wayformer on Argoverse 1 (but lower number of parameters) - compared using min-ADE (avg. displacement error), Brier-minFDE, etc.; just behind QCNet on Argoverse 2 (but this has fewer parameters). Has ablations on feature fusion module (SFT) embedding size (dimensionality), trajectory parameterization, and auxiliary loss functions (heading angle). Can be extendend with a refinement module (re-encode, collect nearby instances, fuse, and then decode); slight improvement in performance. From HKUST and DJI.

Links: GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.02519 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.02519 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.02519 in a Space README.md to link it from this page.

Collections including this paper 1