Skip to content

fix: synchronize controller spawning with robot boot using hardware awaiter#1806

Open
srvald wants to merge 13 commits into
UniversalRobots:mainfrom
srvald:booting_driver_error
Open

fix: synchronize controller spawning with robot boot using hardware awaiter#1806
srvald wants to merge 13 commits into
UniversalRobots:mainfrom
srvald:booting_driver_error

Conversation

@srvald

@srvald srvald commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Description

This PR addresses a race condition during the UR robot bringup sequence. Previously, controller_spawners would attempt to load controllers before the physical hardware had fully initialized.

This premature spawning caused the controller_manager to hang or crash due to timeouts, as can be seen in the following example:

~/ws_rolling/ur_ros2_driver > ros2 launch ur_robot_driver ur_control.launch.py ur_type:=ur3e robot_ip:=10.54.4.16                                                                                                                     08:18:59
[INFO] [launch]: All log files can be found below /home/mirserv/.ros/log/2026-06-01-08-19-02-528240-ia-li-9yp6qn3.unirobotts.local-76696
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [dashboard_client-2]: process started with pid [76701]
[INFO] [controller_stopper_node-4]: process started with pid [76703]
[INFO] [ros2_control_node-1]: process started with pid [76700]
[INFO] [robot_state_helper-3]: process started with pid [76702]
[INFO] [urscript_interface-5]: process started with pid [76704]
[INFO] [robot_state_publisher-6]: process started with pid [76705]
[INFO] [rviz2-7]: process started with pid [76706]
[INFO] [trajectory_until_node-8]: process started with pid [76707]
[INFO] [spawner-9]: process started with pid [76708]
[INFO] [spawner-10]: process started with pid [76709]
[spawner-9] [FATAL] [1780294753.712275335] [spawner_joint_state_broadcaster]: Could not contact service /controller_manager/list_controllers
[spawner-10] [INFO] [1780294755.972623035] [spawner_forward_velocity_controller]: waiting for service /controller_manager/list_controllers to become available...
[ERROR] [spawner-9]: process has died [pid 76708, exit code 1, cmd '/home/mirserv/ws_rolling/ur_ros2_driver/install/controller_manager/lib/controller_manager/spawner --controller-manager /controller_manager --controller-manager-timeout 10 joint_state_broadcaster io_and_status_controller speed_scaling_state_broadcaster force_torque_sensor_broadcaster tcp_pose_broadcaster ur_configuration_controller friction_model_controller joint_trajectory_controller --ros-args --params-file /tmp/launch_params_nt6rhiz1 --params-file /tmp/launch_params_o1xa4z78'].
d[spawner-10] [FATAL] [1780294765.980977257] [spawner_forward_velocity_controller]: Could not contact service /controller_manager/list_controllers
[ERROR] [spawner-10]: process has died [pid 76709, exit code 1, cmd '/home/mirserv/ws_rolling/ur_ros2_driver/install/controller_manager/lib/controller_manager/spawner --controller-manager /controller_manager --controller-manager-timeout 10 --inactive forward_velocity_controller forward_position_controller forward_effort_controller force_mode_controller passthrough_trajectory_controller freedrive_mode_controller tool_contact_controller motion_primitive_forward_controller --ros-args --params-file /tmp/launch_params_by8uu0gi --params-file /tmp/launch_params_kkpsyn1r'].

Solution

Introduced a dedicated synchronization node (ur_hardware_awaiter.py) that acts as a gatekeeper. It ensures the physical robot is reachable and the ROS 2 controller manager is fully responsive before allowing the launch sequence to execute the controller spawners.

Changes Made

  • ur_hardware_awaiter.py: Added a new Python lifecycle node with a dual-check validation system:
    1. Opens TCP sockets to verify physical UR network interfaces (ports 30001, 30002, 30004) are reachable. (Note: socket 29999 is not used to maintain compatibility with PolyScope X).
    2. Uses an asynchronous service client to ping /controller_manager/list_controllers to ensure the manager is actually alive and responsive.
    3. Non-blocking Timer Logic: Implemented a 10-second periodic timer to avoid blocking the main thread. To account for ROS 2 timer behavior (which waits before the first execution), an immediate preliminary check is performed upon initialization. If the robot is already fully booted and the service is available, the timer is canceled immediately, ensuring zero delay in the startup sequence.
  • Launch File (launch.py): * Integrated the awaiter node into the boot sequence.
    • Used RegisterEventHandler with OnProcessExit to strictly delay controller_spawners until the awaiter exits with a success code.
    • Added conditional logic to bypass the awaiter entirely when use_mock_hardware is set to true, maintaining fast startup times for simulations.

Result

The awaiter is executed at the start, periodically checking sockets every 10 seconds (matching other node timeouts):

[ur_hardware_awaiter.py-9] [INFO] [1780293664.406038309] [ur_hardware_awaiter]: Awaiting robot initialization at IP 10.54.4.16...
[ur_hardware_awaiter.py-9] [INFO] [1780293665.407508626] [ur_hardware_awaiter]: System is still initializing. Retrying in 10.0 seconds...

Once the UR client library is fully connected and the hardware interface is initialized, the awaiter unblocks the spawners:

[ur_hardware_awaiter.py-9] [INFO] [1780293864.409172749] [ur_hardware_awaiter]: Service found in registry. Pinging to verify it is responsive...
[ur_hardware_awaiter.py-9] [INFO] [1780293864.505356299] [ur_hardware_awaiter]: Service responded successfully. Controller spawner is unblocked.
[INFO] [ur_hardware_awaiter.py-9]: process has finished cleanly [pid 66379]
[INFO] [spawner-10]: process started with pid [68274]
[INFO] [spawner-11]: process started with pid [68275]

If the robot is already fully booted, the initial check passes instantly without triggering the 10-second wait:

[ur_hardware_awaiter.py-9] [INFO] [1780295496.805358990] [ur_hardware_awaiter]: Awaiting robot initialization at IP 10.54.4.16...
[ur_hardware_awaiter.py-9] [INFO] [1780295497.058702224] [ur_hardware_awaiter]: Service found in registry. Pinging to verify it is responsive...
[ur_hardware_awaiter.py-9] [INFO] [1780295497.060046771] [ur_hardware_awaiter]: Service responded successfully. Controller spawner is unblocked.
[INFO] [ur_hardware_awaiter.py-9]: process has finished cleanly [pid 83394]

If use_mock_hardware:=true then, the hardware awaiter will not be used:

~/ws_rolling/ur_ros2_driver > ros2 launch ur_robot_driver ur_control.launch.py ur_type:=ur5e robot_ip:=192.168.56.101 use_mock_hardware:=true                                                                                   5m 7s 08:06:09
[INFO] [launch]: All log files can be found below /home/mirserv/.ros/log/2026-06-01-08-16-34-930421-ia-li-9yp6qn3.unirobotts.local-74946
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [ros2_control_node-1]: process started with pid [74967]
[INFO] [robot_state_publisher-2]: process started with pid [74968]
[INFO] [rviz2-3]: process started with pid [74969]
[INFO] [trajectory_until_node-4]: process started with pid [74970]
[INFO] [spawner-5]: process started with pid [74971]
[INFO] [spawner-6]: process started with pid [74972]
[robot_state_publisher-2] [INFO] [1780294595.601324289] [robot_state_publisher]: Robot initialized
[ros2_control_node-1] [INFO] [1780294595.644906685] [controller_manager]: Using Steady (Monotonic) clock for triggering controller manager cycles.
[ros2_control_node-1] [INFO] [1780294595.669633976] [controller_manager]: Subscribing to '/robot_description' topic for robot description.
[ros2_control_node-1] [INFO] [1780294595.669735736] [controller_manager]: update rate is 500 Hz
[ros2_control_node-1] [INFO] [1780294595.669750998] [controller_manager]: Overruns handling is : enabled
[ros2_control_node-1] [INFO] [1780294595.669760803] [controller_manager]: Spawning controller_manager RT thread with scheduler priority: 50
[ros2_control_node-1] [INFO] [1780294595.670302549] [controller_manager]: Successful set up FIFO RT scheduling policy with priority 50.
[ros2_control_node-1] [INFO] [1780294595.931139361] [controller_manager]: Received robot description from topic.
[ros2_control_node-1] [INFO] [1780294595.931278015] [controller_manager]: Enforcing command limits is disabled. Command limits from URDF will be ignored.
[ros2_control_node-1] [INFO] [1780294595.961021240] [controller_manager]: Loading hardware 'ur5e' 
[ros2_control_node-1] [INFO] [1780294595.971426541] [controller_manager]: Loaded hardware 'ur5e' from plugin 'mock_components/GenericSystem'
[ros2_control_node-1] [INFO] [1780294595.971630189] [controller_manager]: Initialize hardware 'ur5e' 
[ros2_control_node-1] [INFO] [1780294595.986438334] [controller_manager]: Successful initialization of hardware 'ur5e'
[ros2_control_node-1] [INFO] [1780294595.990269208] [controller_manager]: Activating component 'ur5e'.
[ros2_control_node-1] [INFO] [1780294595.990481656] [resource_manager]: 'configure' hardware 'ur5e' 
[ros2_control_node-1] [INFO] [1780294595.990672834] [resource_manager]: Successful 'configure' of hardware 'ur5e'
[ros2_control_node-1] [INFO] [1780294595.990767637] [resource_manager]: 'activate' hardware 'ur5e' 
[ros2_control_node-1] [INFO] [1780294595.990808725] [resource_manager]: Successful 'activate' of hardware 'ur5e'
[ros2_control_node-1] [INFO] [1780294595.992243385] [controller_manager]: Registering statistics for : ur5e
[ros2_control_node-1] [INFO] [1780294595.992396041] [controller_manager]: Resource Manager has been successfully initialized. Starting Controller Manager services...
[spawner-5] [INFO] [1780294596.254464295] [spawner_joint_state_broadcaster]: waiting for service /controller_manager/list_controllers to become available...
[rviz2-3] [INFO] [1780294596.363580801] [rviz2]: Stereo is NOT SUPPORTED
[rviz2-3] [INFO] [1780294596.363814046] [rviz2]: OpenGl version: 4.6 (GLSL 4.6)
[rviz2-3] [INFO] [1780294596.394431639] [rviz2]: Stereo is NOT SUPPORTED
[spawner-5] [INFO] [1780294596.525132447] [spawner_joint_state_broadcaster]: Setting controller param "params_file" to "['/tmp/launch_params_b2uhhsfm']" for joint_state_broadcaster
[ros2_control_node-1] [INFO] [1780294596.528219094] [controller_manager]: Loading controller : 'joint_state_broadcaster' of type 'joint_state_broadcaster/JointStateBroadcaster'
[ros2_control_node-1] [INFO] [1780294596.528268452] [controller_manager]: Loading controller 'joint_state_broadcaster'
[ros2_control_node-1] [INFO] [1780294596.533071151] [controller_manager]: Controller 'joint_state_broadcaster' node arguments: '--ros-args --params-file /tmp/launch_params_b2uhhsfm'
[spawner-5] [INFO] [1780294596.555583026] [spawner_joint_state_broadcaster]: Loaded joint_state_broadcaster

Related Issues & Comments

  • Addresses discussions in Connecting to robot during boot #349.
  • This approach implements one of the solutions suggested by @fmauch. While other alternatives were discussed (such as implementing a flag directly within the controllers to defer spawning), that approach would require upstream modifications in ROS 2 core packages. This PR resolves the problem cleanly on the driver side without needing upstream changes.
  • Note: While checking the list_controllers service alone could technically suffice, the TCP socket check is included as a preliminary precaution to provide better diagnostic logs.

Testing

It has been tested with a real robot with version PolyScope X 10.13.0 and in URSim with version 5.25.1


Note

Medium Risk
Changes default bringup ordering for real hardware; failed awaiter now blocks all controller loading, though behavior is easier to diagnose than silent spawner crashes.

Overview
Fixes a bringup race where controller spawners started in parallel with ros2_control_node and could fail when /controller_manager/list_controllers was not yet available.

Adds ur_controller_manager_awaiter.py, which loops until ListControllers is reachable and returns successfully (with retries/timeouts). It exits immediately with success when use_mock_hardware is true so simulation startup stays unchanged.

ur_control.launch.py no longer starts spawners at launch: it runs the awaiter first and uses RegisterEventHandler / OnProcessExit to start the active and inactive spawner nodes only after the awaiter exits with code 0; a non-zero exit logs a message and skips spawning.

CMakeLists.txt installs the new script under lib/ur_robot_driver.

Reviewed by Cursor Bugbot for commit 90ccc21. Bugbot is set up for automated code reviews on this repo. Configure here.

@srvald srvald changed the title feat: synchronize controller spawning with robot boot using hardware awaiter fix: synchronize controller spawning with robot boot using hardware awaiter Jun 1, 2026
Comment thread ur_robot_driver/launch/ur_control.launch.py
Comment thread ur_robot_driver/scripts/ur_hardware_awaiter.py Outdated
Comment thread ur_robot_driver/scripts/ur_hardware_awaiter.py Outdated
Comment thread ur_robot_driver/scripts/ur_hardware_awaiter.py Outdated
Comment thread ur_robot_driver/scripts/ur_hardware_awaiter.py Outdated
Comment thread ur_robot_driver/scripts/ur_hardware_awaiter.py Outdated
@srvald

srvald commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

In the last commit, I implemented a change suggested by Cursor: using self.client.service_is_ready() instead of self.client.wait_for_service(timeout_sec=0.1). This makes theoretical sense as it prevents the timer callback from blocking the single-threaded executor.

However, in practice, if the robot is already fully booted, service_is_ready() performs an instantaneous check and might miss the service by a millisecond if it is just initializing. When this happens, the node is penalized by waiting the full check_interval (10 seconds) before trying again.

On the other hand, using self.client.wait_for_service(timeout_sec=0.1) provides a tiny 100ms window that catches the service coming online "on the fly". Even though it is a blocking call, 100ms is negligible and yields a better boot performance when the robot hardware is already up.

@urfeex urfeex left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, I have a couple of thoughts:

  • I don't think I'm in favor of mixing hardware connection with controller_manager pre-conditions. The components already check themselves, whether they are able to connect to the hardware.
  • In the long term I would like to clean up the launch file with less logic, so adding another perform(context) and event handler doesn't seem desirable in my opinion. To avoid this, the awaiter itself could check for mock hardware and simply exit when run on mock_hw, so we won't need the different branches in the launchfile.

Alternatively, we could start the hardware in unconfigured state by default and add a separate node that handles transitions internally. this would have the downside that that node would have to know the names of all hardware interfaces and controllers it should start. AFAIK, the ros2_control project is currently working on something like this.

We could add this as an intermediate solution until upstream has finished their work and then take that as a cleanup motivation.

Comment thread ur_robot_driver/launch/ur_control.launch.py
@srvald srvald requested a review from urfeex June 11, 2026 08:33
Comment thread ur_robot_driver/scripts/ur_controller_manager_awaiter.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 90ccc21. Configure here.

trajectory_until_node,
] + controller_spawners
controller_manager_awaiter,
spawn_controllers_event,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late OnProcessExit registration race

Medium Severity

Controller spawners are started only via OnProcessExit on the awaiter, but spawn_controllers_event is listed after controller_manager_awaiter in nodes_to_start. If the awaiter process exits before that handler is registered, launch may never run the spawner nodes, leaving the stack without loaded controllers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 90ccc21. Configure here.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 5.00%. Comparing base (1b121b7) to head (90ccc21).
⚠️ Report is 584 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##            main   #1806      +/-   ##
========================================
+ Coverage   3.59%   5.00%   +1.41%     
========================================
  Files         13      34      +21     
  Lines        947    4255    +3308     
  Branches     152     500     +348     
========================================
+ Hits          34     213     +179     
- Misses       843    4037    +3194     
+ Partials      70       5      -65     
Flag Coverage Δ
unittests 5.00% <ø> (+1.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants