I finally got my hands on one of those Raspberry Pi AI Hat+ kits. For the uninitiated, this is a first-party hat which contains a Hailo 8 or 8L processor that can be used to accelerate computer vision tasks. Those videos that you’ve seen of people doing real-time computer vision on a Pi are probably using this hardware. Well, I like to do real-time computer vision too!
Real-Time Flower Counting: The Past
I previously developed a real-time flower counting approach for MARS that relies on three cameras. I implemented a multi-object tracking pipeline that runs on the robot’s Nvidia Jetson AGX controller. It detects flowers in ever frame from the cameras using a YOLO object detector, and then uses a custom association model to handle the tracking. After filtering duplicate counts from multiple cameras and fusing the results with GPS data, I was eventually able to localize all the flowers in a field. Neat!

One of the problems that I found with this approach is its scalability. Surprisingly, the detector turns out to be the most computationally expensive part of the pipeline, primarily because we have to run it on every frame from every camera. Initially, I ran the detector on the Jetson’s GPU, which worked well. The Jetson is able to keep up with three cameras. But as soon as you start adding more cameras, detection quickly becomes a bottleneck.
Real-Time Flower Counting: The Present
How to get around the detection bottleneck? My solution was to push detection onto the cameras, freeing up the Jetson for bigger and better things. This is where the AI Hat+ with its Hailo 8 comes in. The second revision of the MARS camera module features, among other upgrades, an integrated AI Hat+. When I designed it, the idea was that, sometime in the nebulous future, this hardware would enable on-camera detection. Of course, this turned out to be more trouble than expected.

Software Implementation
Internally, the cameras are running a lightly-customized version of rpicam-apps, with a ROS layer on top. Because rpicam-apps already includes software support for running object detectors on the Hailo, integrating Hailo-based detection into the MARS camera software was relatively easy. I did have to modify the provided code slightly, in order to enable the extraction and transmission of raw image features from the YOLO model as well as bounding boxes. This is necessary because image features are a required input for the track association model that runs on the Jetson.

Hailo Compilation
Clearly, running an existing model on the Hailo, and even integrating it into ROS, is not overly difficult. The main difficulty is getting your custom model to run on the Hailo. “But Daniel,” I hear you ask. “I built my model with PyTorch! Surely I can just run ‘model.to(‘hailo’)’ or something.” Not so fast!
The Hailo 8 might live on the PCIe bus like a GPU, and it might do the fancy AI things like a GPU, but it is not a GPU. It is not really a general-purpose processor at all. From what I understand, it is an extremely specialized piece of hardware with its own integrated memory and custom instruction-set. As such, “executables” for the Hailo (which are really models with included weights) are stored in .hef files. What is a .hef file? Who knows, really? All we know is that we give the proprietary Hailo compiler our model, and it spits out a .hef file. Then we give that to the Hailo accelerator and, ideally, it does all the model-y things.
But I’m way oversimplifying, of course…
Hailo compilation is actually a three-part process. You start with an ONNX version of your model (which should be easy to get if you are using YOLO). Then, you create a .har file. Then, you optimize the .har file. Finally, you generate the .hef file. Lets look at all of these steps in more detail.
Note that my own method is heavily based on this example. I don’t know who this guy is, but I owe him my life. Or some sexual favors, at least. If you want to reproduce my method, look there for details (at least until I am allowed to make my own conversion scripts public).
Parsing the Model
In this step, the compiler will parse the ONNX model. The model will be converted to a “Hailo Archive” (.har) file, which is what will be used for the rest of the process. Performing this step requires you to know the end node names of you model, which can be a little tricky. Different YOLO variants will have different end nodes. You typically have to load the model in Netron and poke around near the bottom until you find something that looks right. For my specific use-case, I also added an additional end node for the image features that I wanted to extract.
This step isn’t particularly computationally intensive, and can be run on pretty much any machine. In fact, I had trouble running this on HiperGator because it apparently didn’t respect the CPU limits for the job. Consequently, it was trying to use all 192 or whatever CPUs on the node, and would subsequently run out of memory and crash. Brilliant.
Optimizing the Model
The next major step in the compilation process is optimizing the model. This step performs some sort of post-training quantization process, which requires some representative calibration data. I’ve found empirically that the choice of calibration data can have a significant effect on the ultimate model performance. I always use the maximum amount (1024 images), and make it as representative as possible of the ultimate inference data. For instance, for the MARS camera, I typically try to use calibration data that was taken with the actual camera.
This step is very computationally intensive. As a matter of fact, during the quantization step, the compiler fine-tunes the model directly, which necessitates an Nvidia GPU for any decently-sized model. This would be less annoying if the Hailo compiler didn’t use an old-ass version of TensorFlow. Because of this, I usually don’t even bother trying to run this on the B200 nodes on HiperGator (which are known to get snippy about older versions of CUDA). Initially, I had some issues with running out of VRAM, but luckily, the Hailo compiler allows you to force a smaller batch size during the optimization process. I also sometimes have problems with running out of conventional RAM, but I haven’t found a way around that. (Specifically, some of the larger models I’ve compiled have a peak memory usage of >120 GB during optimization.)
Building the HEF
Once the model is optimized, it is time to use the compiler to build a .hef file. This is what will run on the Hailo device itself. This process can take hours, but it doesn’t require a GPU. I have seen it use a lot of CPU cores, though.
Annoyingly, this step in the process can fail after running for hours. In my experience, this usually happens with large models, when the compiler is struggling to find a way to fit the model within the accelerator’s resource constraints. In this situation, my advice is to first try reducing memory usage by decreasing input resolution. Obviously, this decreases the quality of the outputs, but it can have dramatic effects on the model’s ability to compile and ultimate inference speed.

Lessons from the Hailo
In summary, getting stuff to compile on the Hailo is really annoying, but you can usually get it to work eventually if you’re stubborn. That being said, it is clear that certain types of models will be much easier to compile than others. For instance, it seems that YOLO models are pretty well-supported on the Hailo, even when you customize them. I suggest having a look at the Hailo Model Zoo to get a feel for what types of models are supported.
I had a much worse experience trying to get my EfficientPlantSAM model to work on the Hailo 8. This model is a vision transformer, which the Hailo seems to struggle with a lot more than a normal CNN. I had difficulty with the initial parsing, and the optimization step used way too much memory, but I managed to work around that. Finally, I got to the .hef generation step and found that my model would reliably cause the compiler to hard crash.
After deploying some choice profanities, I regrouped and found that I was able to successfully compile a portion of the model, but the memory requirements were so astronomical that I had to reduce the input size to nearly a third of the original. But after all that, it ran on the Hailo 8 at 40 FPS!
So, at the end of the day, the Hailo 8 is an interesting little chip, but it’s not the right tool for everything. A Jetson and TensorRT is still going to be your friend for anything large and/or transformer-y, despite the extra cost and power consumption. You can bet I’ll continue to use the Hailo in the future, though. And I’m not just saying that because I (technically Dr. Li) spent $960 on AI Hats.
No responses yet