
Closed-loop framework with two diffusion model-based policies: an evaluator to predict human intent, and a copilot to provide optimal trajectories and ensure smooth control transitions during safety-critical situations.
Closed-loop framework with two diffusion model-based policies: an evaluator to predict human intent, and a copilot to provide optimal trajectories and ensure smooth control transitions during safety-critical situations.
Safe handover in shared autonomy for vehicle control is well-established in modern vehicles. However, avoiding accidents often requires action several seconds in advance. This necessitates understanding human driver behavior and an expert control strategy for seamless intervention when a collision or unsafe state is predicted.
We propose Diffusion-SAFE, a closed-loop shared autonomy framework leveraging diffusion models to: (1) predict human driving behavior for detection of potential risks, (2) generate safe expert trajectories, and (3) enable smooth handovers by blending human and expert policies over a short time horizon. Unlike prior works which use engineered score functions to rate driving performance, our approach enables both performance evaluation and optimal action sequence generation from demonstrations. By adjusting the forward and reverse processes of the diffusion-based copilot, our method ensures a gradual transition of control authority, by mimicking the drivers behavior before intervention, which mitigates abrupt takeovers, leading to smooth transitions.
We evaluated Diffusion-SAFE in both simulation (CarRacing-v0) and real-world (ROS- based race car), measuring human driving similarity, safety, and computational efficiency. Results demonstrate a 98.5% successful handover rate, highlighting the framework's effectiveness in progressively correcting human actions and continuously sampling optimal robot actions. Code will be released upon publication.
Diffusion-SAFE architecture: The evaluator model processes observations and action sequences, sampling future action sequences aligned with human intent in a simulated environment. The copilot model generates and executes expert action sequences when the human performance score falls below a predefined threshold \( \tau_{NLL} \).
Noise Estimator Architecture: U-Net design with residual connections, positional embedding of step \( t \), and conditioning vector \( \mathbf{C}_{0:t_{\text{obs}}} \). Double convolution block (DC in the figure).
Here we showcase four different simulated scenes randomly generated in Gym CarRacing-v0.
By changing \( \gamma \), we can adjust the balance between preserving human input and following the safe behavior of the copilot: when \( \gamma \) is small, human intent is well preserved, and thus with limited alignment to \( P_{\text{copilot}} \); in contrast, larger values \( \gamma \) would lead the system to prioritize aligning with the copilot policy over human input.
Comparison of Our Partial Diffusion Method and Simple Blending in the Handover Process:
Our approach (via forward diffusion ratio \( \gamma \) )
Simply blend: \( \mathbf{a}_{blend} = k \mathbf{a}_{H} + (1 - k) \mathbf{a}_{copilot} \)
Ablation studies are conducted for both the evaluator and the copilot. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.
Ablation table for the evaluator model
Ablation table for the copilot model
In this work, we utilize the ability of the diffusion policy to inherently express multimodal distributions. Our method is compared to the following multimodal methods: LSTM-GMM and Behavior Transformers models (BET). The results are summarized in the following tables. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.
Baseline Comparison table for the evaluator model
Baseline Comparison table for the copilot model
We conducted real-world experiments on a ROS-based race car equipped with a Jetson Orin Nano for onboard processing. A Windows11 computer running Motive software streamed data from a 13-camera OptiTrack motion capture system.
To be detailed, we use Mocap system to obtain real-time poses in the world frame of the car, convert poses in the world frame into the pixel frame, and use pose data (in the pixel frame) to real-time crop image patches for dataset collection.
Real-World Experiment Results: Columns represent unseen maps. Rows represent different initial conditions ('start' points), human-driver temporal strategies, and correlated handovers.
Here are four real-world demos showing smooth and successful handovers on various unseen tracks.
@article{DiffusionSAFE,
author = {Yunxin Fan and Monroe Kennedy III},
title = {Diffusion-SAFE: Shared Autonomy Framework with Diffusion for Safe Human-to-Robot Driving Handover},
journal = {arXiv},
year = {2025},
}