How to interpret raw output

I’m using the C library to utilize the hailo-8l hardware on the raspberry pi 5. I have gotten an image into frames and pixels, and am able to feed them into the configured .hef file, and I have output, but I don’t know how to interpret the output.

I’m using the model zoo yolov7, which parse-hef gives as
Output yolov7/yolov5_nms_postprocess FLOAT32, HAILO NMS(number of classes: 80, maximum bounding boxes per class: 80, maximum frame size: 128320)
But I don’t know what bits are what. When I run the model, I get an output like the snippet below. The first 5 make sense, classification, confidence, and 4 coords, but after all of the zeros, there is a classification, followed by 7 floating point numbers, followed by a possible classification, followed by 5 floating point numbers, and it doesn’t make any sense to me.

Example output:

{ 2e0, 2.9149818e-1, 1.3322696e-1, 1.0043838e0, 8.3363557e-1, 8.901919e-1, 8.9297575e-1, 7.028085e-2, 9.282986e-1, 8.92289e-2, 2.3529352e-1, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 1e0, 7.443513e-1, 1.754624e-1, 8.730015e-1, 2.6385126e-1, 3.3716178e-1, 1e0, 9.2661124e-1, 1.9907206e-3, 9.98977e-1, 1.4516605e-1, 6.3420063e-1, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, 0e0, -3.0316488e-13,

Hey @nicholas.young ,

Let me help explain how to read those YOLOv7 outputs you’re getting from Hailo. The data might look a bit confusing at first, but it’s actually organized in a pretty specific way.

When your model detects objects, it outputs them in what we call an NMS (Non-Maximum Suppression) format. Here’s what each piece means:

For each object detected, you’ll see a group of numbers that follows this pattern:

[class_label, confidence_score, x_min, y_min, x_max, y_max]

So if you see something like {2e0, ...}, that 2 at the start means it’s detected an object from class #2. The numbers that follow tell you:

  • How confident the model is about this detection
  • Where exactly the object is in the frame (those x_min, y_min, x_max, y_max coordinates)

After each detection’s main data, you’ll see either:

  • More detections following the same pattern
  • A bunch of zeros (this is just padding to keep the output size consistent)

The coordinates are normalized to your frame size, so they’ll be between 0 and 1. You’ll need to multiply them by your actual frame dimensions to get pixel locations.

Why are they x_min and x_max instead of “x_left” or “x_right”? Is this because of the NMS step?

I’m also seeing that the y_max is sometimes larger than the y_min? What could cause this?
Person found! Confidence: 0.47459447, 0.15992427:1.0001111, 0.8834089, 0.92156434
Person found! Confidence: 0.46784216, 0.16221726:1.0045106, 0.8850374, 0.9176428
Person found! Confidence: 0.78671217, 0.63345283:0.92201316, 0.73821384, 0.3840052