Once again, it’s time to learn from my mistakes.

ROS has become the standard way to build software for robots. Qt is a common and easy-to-use framework for creating GUIs that runs on almost anything with a display. Why might you want to use the two of them together? Perhaps your robot, like ours, has a human-machine interface. Maybe there’s some sort of display that users can interact with in order to control the robot. In this situation, you might turn to something like Qt to make your life easier. But if you’re not careful, it could actually make your life a lot harder.

Integrating ROS and Qt obviously can be done. Some of ROS’ internal utilities, including Rviz, appear to use Qt. It’s possible that this is all trivial if you are using C++. I haven’t tried that. I’m using Python (specifically the PySide6 bindings), and my experience has not been smooth sailing. My goal is to share some of what I’ve learned with you, so that you can (hopefully) avoid making the same errors.

The Mysterious Crashing Problem

Early last year, we started to have a problem with the MARS-X HMI crashing. It would reliably crash if you let it run for long enough. Usually, “long enough” was the annoying threshold of around 30 minutes.

What’s more, the crashes were brutal. The HMI is written as a Python application, which uses Qt and runs as a ROS node on the robot controller. However, instead of crashing nicely with a stack trace, as a Python application should, it was segfaulting. Poking around in the debugger, I saw that the crash seemed to be happening deep within a Qt painting routine.

Then things got fun.

Over the course of a week, I started debugging this problem in earnest. Initially, I had a huge amount of trouble reproducing the bug in the lab. As soon as I removed the controller from the rest of the robot and tried to run the HMI application in isolation, it would work perfectly. I found this endlessly frustrating, until I stumbled upon a reliable way to reproduce the crash: If I went to the lab and picked up a camera module, connected this to the controller, and set the HMI to display a live feed from the camera, it would reliably crash within minutes!

The HMI application displays a live camera feed.

This was my first clue as to what was going on. The initial debugging had led me to believe the problem was with Qt, but now it was becoming clear that ROS was involved too. Over the course of an unhappy afternoon spent with the robot at the Citra research farm, I finally had a breakthrough when I started digging into ROS subscribers.

Threading Troubles

So, first a quick primer on how ROS subscribers work. To subscribe to a topic in ROS, you have to create a subscriber, and then call rospy.spin(), like this:

rospy.Subscriber(topic, MessageType, callback)
rospy.spin()

The existence of the blocking rospy.spin() call might lead you to believe that this is implemented under the hood using a select() call, or something like that. In other words, you might be tempted to believe that the spinner is running an event loop, and callback() is being run in the same thread. Not so! Internally, ROS uses threading to implement subscribers, and callbacks will not be run in the main thread. (As a matter of fact, ROS also supports explicitly multi-threaded spinners that make this more obvious.) This by itself can be the source of weird race conditions when you have multiple subscribers touching the same data, but that’s not my main concern.

Internally, the HMI application was pretty simple. Whenever we needed to react to data coming from ROS (such as displaying an incoming image from a camera), we would create a ROS subscriber. The callback for this subscriber might do all kinds of things, but eventually, it would start manipulating the Qt UI (by painting the camera frame, for example).

How we handled updates from ROS in the original HMI application.

Eventually, I realized that the callback was being called in a separate thread, and this wasn’t playing nice with Qt. I’m not sure exactly why, but I know that Qt uses its own threads internally, and you’re supposed to use the signals and slots mechanism when you want to pass data between threads instead of just YOLOing it. Evidently, there was some kind of race condition that was causing Qt to blow up whenever we touched the painting code from another thread.

To get around this, I had to very carefully isolate ROS and Qt. In the current version of the dashboard, whenever a ROS subscriber callback is triggered, the first order of business is to write the message to a queue. I used Qt’s QThread class to spawn a separate thread that’s dedicated to handling the messages from this queue. Since Qthreads are Qt objects, that means we can safely use signals and slots on them. My thread has a signal, appropriately called new_message, which gets emitted every time we pop a message off the queue. This signal can then be connected to a corresponding slot in the code that updates the U.I.

How we currently implement updates to the U.I. triggered by ROS.

Since Qt’s signal/slot mechanism handles thread-safety internally (as long as you’re working with Qt objects), this new design completely side-steps the race condition. It doesn’t matter if ROS runs its callbacks in a new thread internally, because all that callback is doing now is writing to a queue.

Wrapping Up

This single change was enough to fix the crashing problem. It has never happened again, despite the HMI application getting a lot more complex and pulling data from more sources. Overall, I wouldn’t say that Qt and ROS are happy around each-other, but at least they’re not trading blows.

Why was connecting the camera enough to reproduce this issue? That’s pretty simple. Out of all the HMI functionality, the camera by far generates the most ROS messages. A single camera generates one message for each frame at 30 FPS, and we can have up to six of them! The HMI code is written such that each message triggers an update to the U.I. Therefore, when you’re displaying a camera feed, you have a lot of chances to hit this race condition.

So, what did we learn? That multithreaded programming sucks, but we should have already known that. I’m not sure if anyone will see this, but if I can save at least one other person from having to debug this issue I’ll feel vindicated.

Hey, maybe next time I’ll write about the issue where certain Python packages were breaking OpenGL acceleration in Qt! Boy, was that a trip…

Categories:

Tags:

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *