HailoRT Multiple Devices?

I’ve been testing out the HailoRT API, in attempt to build a basic example app to use more than one Hailo8 device to speed up inference, running one model. I’ve built the object_detection example and altered the HailoInfer class initialization to specify which Hailo Devices to use when creating the VDevice:

hailort::VDevice::create(const std::vector<std::string> &device_ids);

This works, and I can verify that the specified device list is being used by monitoring using hailortcli monitor.

However, the fps throughput does not change when assigning 1, 2, 3, or 4 connected devices. I always get ~280 fps. Interestingly, as I add devices, the utilization decreases with each added device.

Here is a chart showing the results I found in the hailortcli monitor:

1 device: 95% utilization
2 devices utilization: 45% 45%
3 devices utilization: 30% 30% 30%
4 devices utilization: 23% 23% 23% 23%

It seems it is spreading the model across the devices, which decreases utilization per device and still completes the inference job in the same amount of time as fewer used devices.

It was my understanding that adding devices should allow the Hailo scheduler to fully utilize all of the devices. I am writing a more general inference manager to do this now, but if anyone at Hailo can chime in as to why this behavior occurs, it would be very helpful.

Perhaps there is a different design pattern to use? If so, I’d love to see an example. I’d love to make use of your scheduler if I can. No sense reinventing the wheel.

Thanks in advance for any help.

Hi @KromIsGood, welcome to the Hailo Community! :grin:

It’s possible the bottleneck is coming from another part of the app. A good first step is to check the FPS when running inference only. You can test this using 4 devices with:
hailortcli run2 --device-count 4 set-net HEF_PATH

Hi @nina-vilela - thanks for your response. I tried out the hailortcli run2 command, and it seems that the model I’m using doesn’t scale well with more than one device:

$ hailortcli run2 --device-count 1 set-net yolov8n.hef
yolov8n: fps: 426.76
$ hailortcli run2 --device-count 4 set-net yolov8n.hef
yolov8n: fps: 550.50

I would have assumed that this should get 3-4x performance scaling with four devices. Is this not correct?

I did some more investigating after reading this post. The initial PC I was testing on used an external Thunderbolt to PCIe connection, which inherently limits the number lanes to four. I had assumed that there would be less need for bandwidth than there is on Hailo devices.

I moved to the devices to a PCIe x8 switch card (all four m.2 Hailo8 devices) like the setup from the user described in the link above. Running this again, I got fps=760. Faster, but I get the same results when setting the number of devices to 1, 2, 3, or 4. Huh.

After this, I tried a larger model: yolov8m. With this model I started seeing scaling:

fps = 66, 132, 193, 240. Fairly linear.

I guess at this point that the limit may be determined by PCIe bandwidth. But I’m still uncertain, it could be that small models like yolo nano just are too small to scale with multiple devices?

I then tested some more HEF models and compared the results with the model zoo performance expectations. The following models all seemed to scale and match the fps performance fairly well:

yolov8m, yolov8m_pose, vit_pose_small, mspn_regnetx_800mf

Next I tested by removing the devices from the PCIe m.2 switch card to verify that it is not somehow throttling performance. Two Hailo8 devices each plugged into a verified x16 PCIe gen3 slot gave the same results for yolov8n: ~750fps, with --device-count 1 and --device-count 2.

At this point, it seems like there may be a scaling issue with the small yolov8n model. Any comments or suggestions?

@KromIsGood Great testing, we really appreciate you sharing your results with us.

We will look into it, thank you.

@nina-vilela Thanks for responding - it’s nice that you all are active in the community forum! Very helpful.

One note, if it potentially could be a factor: the PC host is an i9-7900x on X299 chipset. It has 44 PCIe lanes. I mention it in case performance could be affected by the generation of the CPU and PCIe peformance. I will be testing soon on a Threadripper Pro host, which is newer.