I did some more investigating after reading this post. The initial PC I was testing on used an external Thunderbolt to PCIe connection, which inherently limits the number lanes to four. I had assumed that there would be less need for bandwidth than there is on Hailo devices.
I moved to the devices to a PCIe x8 switch card (all four m.2 Hailo8 devices) like the setup from the user described in the link above. Running this again, I got fps=760. Faster, but I get the same results when setting the number of devices to 1, 2, 3, or 4. Huh.
After this, I tried a larger model: yolov8m. With this model I started seeing scaling:
fps = 66, 132, 193, 240. Fairly linear.
I guess at this point that the limit may be determined by PCIe bandwidth. But I’m still uncertain, it could be that small models like yolo nano just are too small to scale with multiple devices?
I then tested some more HEF models and compared the results with the model zoo performance expectations. The following models all seemed to scale and match the fps performance fairly well:
yolov8m, yolov8m_pose, vit_pose_small, mspn_regnetx_800mf
Next I tested by removing the devices from the PCIe m.2 switch card to verify that it is not somehow throttling performance. Two Hailo8 devices each plugged into a verified x16 PCIe gen3 slot gave the same results for yolov8n: ~750fps, with --device-count 1 and --device-count 2.
At this point, it seems like there may be a scaling issue with the small yolov8n model. Any comments or suggestions?