fix: synchronize controller spawning with robot boot using hardware awaiter#1806
fix: synchronize controller spawning with robot boot using hardware awaiter#1806srvald wants to merge 13 commits into
Conversation
|
In the last commit, I implemented a change suggested by Cursor: using However, in practice, if the robot is already fully booted, On the other hand, using |
urfeex
left a comment
There was a problem hiding this comment.
Thank you for the contribution, I have a couple of thoughts:
- I don't think I'm in favor of mixing hardware connection with controller_manager pre-conditions. The components already check themselves, whether they are able to connect to the hardware.
- In the long term I would like to clean up the launch file with less logic, so adding another
perform(context)and event handler doesn't seem desirable in my opinion. To avoid this, the awaiter itself could check for mock hardware and simply exit when run on mock_hw, so we won't need the different branches in the launchfile.
Alternatively, we could start the hardware in unconfigured state by default and add a separate node that handles transitions internally. this would have the downside that that node would have to know the names of all hardware interfaces and controllers it should start. AFAIK, the ros2_control project is currently working on something like this.
We could add this as an intermediate solution until upstream has finished their work and then take that as a cleanup motivation.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 90ccc21. Configure here.
| trajectory_until_node, | ||
| ] + controller_spawners | ||
| controller_manager_awaiter, | ||
| spawn_controllers_event, |
There was a problem hiding this comment.
Late OnProcessExit registration race
Medium Severity
Controller spawners are started only via OnProcessExit on the awaiter, but spawn_controllers_event is listed after controller_manager_awaiter in nodes_to_start. If the awaiter process exits before that handler is registered, launch may never run the spawner nodes, leaving the stack without loaded controllers.
Reviewed by Cursor Bugbot for commit 90ccc21. Configure here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1806 +/- ##
========================================
+ Coverage 3.59% 5.00% +1.41%
========================================
Files 13 34 +21
Lines 947 4255 +3308
Branches 152 500 +348
========================================
+ Hits 34 213 +179
- Misses 843 4037 +3194
+ Partials 70 5 -65
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|


Description
This PR addresses a race condition during the UR robot bringup sequence. Previously,
controller_spawnerswould attempt to load controllers before the physical hardware had fully initialized.This premature spawning caused the
controller_managerto hang or crash due to timeouts, as can be seen in the following example:Solution
Introduced a dedicated synchronization node (
ur_hardware_awaiter.py) that acts as a gatekeeper. It ensures the physical robot is reachable and the ROS 2 controller manager is fully responsive before allowing the launch sequence to execute the controller spawners.Changes Made
ur_hardware_awaiter.py: Added a new Python lifecycle node with a dual-check validation system:/controller_manager/list_controllersto ensure the manager is actually alive and responsive.launch.py): * Integrated the awaiter node into the boot sequence.RegisterEventHandlerwithOnProcessExitto strictly delaycontroller_spawnersuntil the awaiter exits with a success code.use_mock_hardwareis set totrue, maintaining fast startup times for simulations.Result
The awaiter is executed at the start, periodically checking sockets every 10 seconds (matching other node timeouts):
[ur_hardware_awaiter.py-9] [INFO] [1780293664.406038309] [ur_hardware_awaiter]: Awaiting robot initialization at IP 10.54.4.16... [ur_hardware_awaiter.py-9] [INFO] [1780293665.407508626] [ur_hardware_awaiter]: System is still initializing. Retrying in 10.0 seconds...Once the UR client library is fully connected and the hardware interface is initialized, the awaiter unblocks the spawners:
[ur_hardware_awaiter.py-9] [INFO] [1780293864.409172749] [ur_hardware_awaiter]: Service found in registry. Pinging to verify it is responsive... [ur_hardware_awaiter.py-9] [INFO] [1780293864.505356299] [ur_hardware_awaiter]: Service responded successfully. Controller spawner is unblocked. [INFO] [ur_hardware_awaiter.py-9]: process has finished cleanly [pid 66379] [INFO] [spawner-10]: process started with pid [68274] [INFO] [spawner-11]: process started with pid [68275]If the robot is already fully booted, the initial check passes instantly without triggering the 10-second wait:
[ur_hardware_awaiter.py-9] [INFO] [1780295496.805358990] [ur_hardware_awaiter]: Awaiting robot initialization at IP 10.54.4.16... [ur_hardware_awaiter.py-9] [INFO] [1780295497.058702224] [ur_hardware_awaiter]: Service found in registry. Pinging to verify it is responsive... [ur_hardware_awaiter.py-9] [INFO] [1780295497.060046771] [ur_hardware_awaiter]: Service responded successfully. Controller spawner is unblocked. [INFO] [ur_hardware_awaiter.py-9]: process has finished cleanly [pid 83394]If
use_mock_hardware:=truethen, the hardware awaiter will not be used:Related Issues & Comments
list_controllersservice alone could technically suffice, the TCP socket check is included as a preliminary precaution to provide better diagnostic logs.Testing
It has been tested with a real robot with version PolyScope X 10.13.0 and in URSim with version 5.25.1
Note
Medium Risk
Changes default bringup ordering for real hardware; failed awaiter now blocks all controller loading, though behavior is easier to diagnose than silent spawner crashes.
Overview
Fixes a bringup race where controller spawners started in parallel with
ros2_control_nodeand could fail when/controller_manager/list_controllerswas not yet available.Adds
ur_controller_manager_awaiter.py, which loops untilListControllersis reachable and returns successfully (with retries/timeouts). It exits immediately with success whenuse_mock_hardwareis true so simulation startup stays unchanged.ur_control.launch.pyno longer starts spawners at launch: it runs the awaiter first and usesRegisterEventHandler/OnProcessExitto start the active and inactive spawner nodes only after the awaiter exits with code 0; a non-zero exit logs a message and skips spawning.CMakeLists.txtinstalls the new script underlib/ur_robot_driver.Reviewed by Cursor Bugbot for commit 90ccc21. Bugbot is set up for automated code reviews on this repo. Configure here.