Diffusion-SAFE: Shared Autonomy Framework with Diffusion for Safe Human-to-Robot Driving Handover

Yunxin Fan, Monroe Kennedy III

Stanford University

arXiv Video Code

splash figure

Closed-loop framework with two diffusion model-based policies: an evaluator to predict human intent, and a copilot to provide optimal trajectories and ensure smooth control transitions during safety-critical situations.

Abstract

Safe handover in shared autonomy for vehicle control is well-established in modern vehicles. However, avoiding accidents often requires action several seconds in advance. This necessitates understanding human driver behavior and an expert control strategy for seamless intervention when a collision or unsafe state is predicted.

We propose Diffusion-SAFE, a closed-loop shared autonomy framework leveraging diffusion models to: (1) predict human driving behavior for detection of potential risks, (2) generate safe expert trajectories, and (3) enable smooth handovers by blending human and expert policies over a short time horizon. Unlike prior works which use engineered score functions to rate driving performance, our approach enables both performance evaluation and optimal action sequence generation from demonstrations. By adjusting the forward and reverse processes of the diffusion-based copilot, our method ensures a gradual transition of control authority, by mimicking the drivers behavior before intervention, which mitigates abrupt takeovers, leading to smooth transitions.

We evaluated Diffusion-SAFE in both simulation (CarRacing-v0) and real-world (ROS- based race car), measuring human driving similarity, safety, and computational efficiency. Results demonstrate a 98.5% successful handover rate, highlighting the framework's effectiveness in progressively correcting human actions and continuously sampling optimal robot actions. Code will be released upon publication.

Framework Image

Diffusion-SAFE architecture: The evaluator model processes observations and action sequences, sampling future action sequences aligned with human intent in a simulated environment. The copilot model generates and executes expert action sequences when the human performance score falls below a predefined threshold \( \tau_{NLL} \).

Model Architecture

Noise Estimator Architecture: U-Net design with residual connections, positional embedding of step \( t \), and conditioning vector \( \mathbf{C}_{0:t_{\text{obs}}} \). Double convolution block (DC in the figure).

Simulation Demos

Here we showcase four different simulated scenes randomly generated in Gym CarRacing-v0.

Partial Diffusion as Handover Process

By changing \( \gamma \), we can adjust the balance between preserving human input and following the safe behavior of the copilot: when \( \gamma \) is small, human intent is well preserved, and thus with limited alignment to \( P_{\text{copilot}} \); in contrast, larger values \( \gamma \) would lead the system to prioritize aligning with the copilot policy over human input.

Comparison of Our Partial Diffusion Method and Simple Blending in the Handover Process:

Our approach (via forward diffusion ratio \( \gamma \) )

Simply blend: \( \mathbf{a}_{blend} = k \mathbf{a}_{H} + (1 - k) \mathbf{a}_{copilot} \)

Ablations

Ablation studies are conducted for both the evaluator and the copilot. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.

evaluator_ablation

Ablation table for the evaluator model

evaluator_ablation

Ablation table for the copilot model

Comparison with Baselines

In this work, we utilize the ability of the diffusion policy to inherently express multimodal distributions. Our method is compared to the following multimodal methods: LSTM-GMM and Behavior Transformers models (BET). The results are summarized in the following tables. The horizon unit is measured in steps, where each step corresponds to 0.1s. \( \textbf{Bold} \) indicates the best result, while \( \textit{Italic} \) indicates the second-best result.

evaluator_comparison

Baseline Comparison table for the evaluator model

copilot_comparison

Baseline Comparison table for the copilot model

Real-world Experiment Setup and Dataset Collection Pipeline

We conducted real-world experiments on a ROS-based race car equipped with a Jetson Orin Nano for onboard processing. A Windows11 computer running Motive software streamed data from a 13-camera OptiTrack motion capture system.

To be detailed, we use Mocap system to obtain real-time poses in the world frame of the car, convert poses in the world frame into the pixel frame, and use pose data (in the pixel frame) to real-time crop image patches for dataset collection.

Motion Capture Visualization

Real-world Demo

Real-World Experiment Results: Columns represent unseen maps. Rows represent different initial conditions ('start' points), human-driver temporal strategies, and correlated handovers.

Real-world Demo

Here are four real-world demos showing smooth and successful handovers on various unseen tracks.

BibTeX

@article{DiffusionSAFE,
  author    = {Yunxin Fan and Monroe Kennedy III},
  title     = {Diffusion-SAFE: Shared Autonomy Framework with Diffusion for Safe Human-to-Robot Driving Handover},
  journal   = {arXiv},
  year      = {2025},
}