Hi,
I’ve just started using the Hailo-8 chip and I don’t know much about the hailort library APIs yet.
First, I started by analyzing vstreams_example.c.
However, shortcut network used in vstreams_example.c doesn’t perform any meaningful image processing and, src_data used as input image is just a random data, not an image.
So, I decided to use yolov7-tiny.hef instead of the shortcut network and use meaningful images instead of random data.
By running hailortcli parse-hef yolov7_tiny.hef, I was able to obtain the following output:
As far as I know, final output’s depth of YOLO network is Bx(C+5) where B is the number of bounding box predicted by each grid and C is the number of class.
the output of yolov7_tiny.hef has depth of 255.
For yolov7_tiny.hef, I guess B=3 and C = 80, which results in B x (C+5) = 255. Is this right?
Parsing yolov7_tiny.hef provide 3 output vstreams which have different height and width(20, 40, 80) but same depth(255). Why does yolov7_tiny.hef have multiple output, and which one do I use to check the result of detection? (and does the width and height correspond to the number of grid?)
In your reply, you mentioned that the FCR means the first channel is sent to the HW. I still don’t fully understand this.
If the result from a (FCR)20x20x255 of output_vstream is sent to the dst_data array from Hailo, does FCR mean that the values stored in the order 1x1x1, 1x2x1, 1x3x1 … 20x20x1, 1x1x2, 1x2x2, 1x3x2 … 20x20x2,
and so on in the dst_data array?
Could you perhaps explain with an example?
In yolov7_tiny when using the Hailo SW, it depends - if you use the Hailo NMS, then the output will be a single output in a shape of (C,B) where C is the number classes the model was trained on and B in the maximum number of bboxes (defined in the optimization step) and it will include the bbox decoding & extraction and NMS.
If you don’t use the Hailo NMS, for yolov7 you’ll have 3 outputs with shape that corresponds to the input resolution of the model and the number of classes. The height and width are different because each of them is used for a different object size - the bigger grid is used for the bigger anchors and stride, the medium for the medium and the small for the small. The number of channels doesn’t change because it’s the same calculation you described in section 1 for all.
There are 3 output because from there in the model start the postprocessing ops that the Hailo SW doesn’t support. You need to implement this ops yourself on the CPU, and this is why we recommend to use the Hailo NMS option when possible.
The order the data enters the device matters, as the device knows to handle specific order (NHWC), so in case the model’s format order is different, some pre-infer would be done by the Hailo SW.
For FCR, it’s described in the HailoRT user guide:
The output would be aligned as regular multi-dimensional array, so it depends on how you iterate over it.