CANBus 2: Electric Boogaloo

on December 18, 2024

I did a previous post on the CANBus, which can be summed up as:

CAN is better than the alternatives for robotics
Switching to CAN is actually kind of complicated

I thought I covered the hardware side of things pretty well in that earlier post, and in this one, I was planning on talking about the software changes. As usual, fate had other plans. The upshot is that we’re still going to talk about software, but then I’m going to take a few paragraphs to whine about all the problems we’ve had with our hardware setup over the past few weeks.

I think we’re really getting into the Christmas spirit around here.

Software Stack

All of the hardware on the robot (most critically, the BMS and motors) support communication over CAN and USB. However, as you might have guessed, the protocols are quite different. Specifically, our hardware uses CANopen, which is several layers of protocols built on top of CAN that attempt to define a standard, interoperable interface for many different types of equipment. ROS has a third-party CANopen driver which is built on top of SocketCAN in the Linux kernel and works on the Jetson. However, this library does not support all the features we needed, so Rui decided to write his own.

In software engineering, the phrase “I decided to write my own” usually marks the start of some kind of journey, and this case did not disappoint. Rui’s CANopen library was hacked together with ChatGPT. It was enough to move the wheels, but it had some race conditions that could quickly ruin your day if you tried to do anything more complicated. Specifically, it did not handle multiple processes trying to read from the same SocketCAN interface very well.

To fix that, I split out the CAN reading into its own separate ROS node. This node is called “can_bridge” and does nothing except read from the SocketCAN interface and write all the frames it receives to a ROS topic. ROS topics automatically handle the annoying multi-consumer stuff (ensuring every consumer sees all incoming messages), making them much more amenable for reading CAN frames than simply having every process read directly from the CAN interface.

I also wrote a class called “CanClient” that abstracts all of this away and presents a nice interface for reading CAN frames that has similar semantics to raw SocketCAN. Internally, it uses an asynchronous spinner to read messages from the topic into a queue, and then returns those messages one-by-one when someone calls the “Read()” method. You can also tell the CanClient to ignore any messages not coming from a specific CAN node, which is useful for implementing CANopen.

Diagram of the CAN software architecture. — Diagram of the CAN software stack. I added an abstraction layer that sits between the CANopen driver and the kernel SocketCAN driver.

Importantly, this abstraction layer sits between SocketCAN and Rui’s original CANopen code. It operates on raw CAN frames and is not aware of CANopen in any way. Diverting every single CAN frame through ROS does add some latency, but in practice, we’ve found that it’s not enough to have practical consequences. This architecture can handle thousands of frames per second without breaking a sweat.

Hardware Issues

In my previous post, I remember making a flippant comment about how the wireless link between the two sides of the robot wasn’t actually a reliability problem, because it had very good latency. Boy, did that age poorly.

One of the types of CANopen services is called an SDO, which is supposed to allow a client device to read/set values in a dictionary on a server device. Much like most client-server protocols, this one has a fair amount of overhead, and requires several frames to go back and forth in order to complete the request. If any of those frames go MIA, this request will usually fail. SDOs are used extensively on MARS, most notably when commanding a position/velocity on the motors.

Pretty much from the point where we first started using CAN, I’ve noted some reliability issues with SDOs. Frequently, these requests would just start timing out, or failing with weird errors. Physically, this manifested as motors on the robot occasionally behaving erratically or going into protective stop mode, which is not something you want to happen while the robot is driving and the department head is watching. An important clue was that this only happened when there was a large amount of traffic on the bus. However, the smoking gun was when I temporarily replaced the wireless link with a physical connection between the two sides of the robot, and saw the errors completely vanish.

I noticed something else from that experiment, which was that over the physical connection, the robot was trying to send >2000 messages per second. Even taking our one millisecond latency on the wireless link at face value, this kind of traffic would still be enough to overload it. Therefore, I figured that what was happening was not so much wireless unreliability, per say, and more that the poor Arduinos simply couldn’t keep up with the traffic on the bus.

I decided to attack the problem from both sides: increasing the bandwidth of the wireless link, while simultaneously reducing the demands on it. For the first goal, I modified my hand-built wireless protocol to pack more than one CAN frame into each ESPNow packet. This reduced the overhead a bit and allowed us to eke out a bit more throughput. For the second goal, I bit the bullet and made the Arduinos partially aware of the CANopen protocol, so that they could automatically filter out frames that didn’t need to go to the other side of the robot. I’m not a huge fan of weakening the separation between protocol layers, but this did significantly reduce traffic. Even more impactful, however, were changes to the ROS motor driver code that prevented it from sending new motor commands unless the commanded velocity had actually changed.

Taken together, these interventions were enough to at least improve the situation. I’m still seeing occasional SDO request failures, but they don’t seem to be frequent enough to impact the operation of the robot. The whole thing is a work in progress, and I’ll probably continue to modify the code in the search for perfect reliability. Then again, Dr. Li managed to break the robot HMI last week just by poking at it for 30 seconds, so CANBus may not actually be our biggest reliability problem.

I tested the robot in the parking lot at Frazier-Rogers with all my modifications, and it seemed to work okay. Maybe it’s ready for an actual field now? We’ll see.

Categories:

tech

Tags:

CANBus mars

CANBus 2: Electric Boogaloo

Software Stack

Hardware Issues

No responses yet

Leave a Reply Cancel reply