SIMPL: A Simple and Efficient Multi-agent Motion Prediction Baseline for Autonomous Driving
Abstract
This paper presents a Simple and effIcient Motion Prediction baseLine (SIMPL) for autonomous vehicles. Unlike conventional agent-centric methods with high accuracy but repetitive computations and scene-centric methods with compromised accuracy and generalizability, SIMPL delivers real-time, accurate motion predictions for all relevant traffic participants. To achieve improvements in both accuracy and inference speed, we propose a compact and efficient global feature fusion module that performs directed message passing in a symmetric manner, enabling the network to forecast future motion for all road users in a single feed-forward pass and mitigating accuracy loss caused by viewpoint shifting. Additionally, we investigate the continuous trajectory parameterization using Bernstein basis polynomials in trajectory decoding, allowing evaluations of states and their higher-order derivatives at any desired time point, which is valuable for downstream planning tasks. As a strong baseline, SIMPL exhibits highly competitive performance on Argoverse 1 & 2 motion forecasting benchmarks compared with other state-of-the-art methods. Furthermore, its lightweight design and low inference latency make SIMPL highly extensible and promising for real-world onboard deployment. We open-source the code at https://github.com/HKUST-Aerial-Robotics/SIMPL.
Community
Proposes SIMPL (simple and effective motion prediction baseline): a generalised method for future motion prediction of all road participants using a novel global feature fusion module. Introduces instance-centric scene representation with symmetric fusion transformer (SFT), parameterization of trajectories using Bernstein polynomials (Bezier curves) in real time. Predicts future (possible) trajectories with scores (multimodal distribution). Framework: Instance-centric scene representation encoded by MLP to get relative pose embeddings (RPE), encoders for actor and map tokens, contented tokens and RPE is given to SFT which outputs coefficients for Bezier-based motion decoder. SFT: source and target embeddings are concatenate with the RPE (one-to-one, vectorised by expand and repeat), pass through MLP to get context embeddings, cross-attention on target (as query) and context (as key, value), residual (add with query) and normalization, FF then res-norm to get output (updated) tokens; RPE tokens are context through MLP and res (add with input RPE) norm. Multiple layers of SFT is used. Output action tokens are sent to MLP head for K trajectories (as control points) and classification with softmax for probability. Trained using regression loss and classification loss; regression has smooth-L1 position coordinate loss and yaw loss (cosine similarity). Trained and tested on Argoverse 1 and 2 motion forecasting datasets. Better than MacFormer, HiVT, and comparable to Wayformer on Argoverse 1 (but lower number of parameters) - compared using min-ADE (avg. displacement error), Brier-minFDE, etc.; just behind QCNet on Argoverse 2 (but this has fewer parameters). Has ablations on feature fusion module (SFT) embedding size (dimensionality), trajectory parameterization, and auxiliary loss functions (heading angle). Can be extendend with a refinement module (re-encode, collect nearby instances, fuse, and then decode); slight improvement in performance. From HKUST and DJI.
Links: GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper