How to interpret the YOLO outputs?

I’m following the hailo tutorial notebook DFC_2_Model_Optimization_Tutorial - this ran fine out of the box, but I’ve tried to adapt this for YOLOv11n. The output has a shape [80, 5, 100], based on the documentation, this has to be 80classes, 5 data points, 100 detections per class (based on the nms yaml).

The important bits of code are all added below, if you want to see the notebook, it’s uploaded to colab, of course a lot of this won’t run in colab as is - since I’m using this locally in a docker container. Colab was just an easy way to share the notebook

If I look into the output for a specific class, and ignore the many rows of zeros, I’ll get something like this for that one class:

array([[0.39044172, 0.01051593, 0.61553806, 0.24089272, 0.91469216],
       [0.48778102, 0.8696294 , 0.6793776 , 0.9998992 , 0.5478364 ]],
      dtype=float32)

I’m unsure if I’m just getting garbage predictions or I’m misunderstanding the output format. Is this xywhn + confidence, or something else? For comparision, running model.predict() in ultralytics (with yolo11n.pt) will yield these results for this same image:

model = YOLO('yolo11n.pt')
results = model.predict(sample_dataset[0,:,:,:], imgsz=640, conf=0.2)
# Process results list
for result in results:
    boxes = result.boxes  # Boxes object for bounding box outputs
    masks = result.masks  # Masks object for segmentation masks outputs
    keypoints = result.keypoints  # Keypoints object for pose outputs
    probs = result.probs  # Probs object for classification outputs
    obb = result.obb  # Oriented boxes object for OBB outputs

boxes[boxes.cls==62]

that will yield this output:

ultralytics.engine.results.Boxes object with attributes:
cls: tensor([62., 62.], device='cuda:0')
conf: tensor([0.9115, 0.2777], device='cuda:0')
data: tensor([[6.1889e+00, 2.4986e+02, 1.5438e+02, 3.9446e+02, 9.1155e-01, 6.2000e+01],
        [5.5910e+02, 3.1260e+02, 6.4000e+02, 4.2976e+02, 2.7771e-01, 6.2000e+01]], device='cuda:0')
id: None
is_track: False
orig_shape: (640, 640)
shape: torch.Size([2, 6])
xywh: tensor([[ 80.2820, 322.1584, 148.1863, 144.6057],
        [599.5480, 371.1809,  80.9040, 117.1644]], device='cuda:0')
xywhn: tensor([[0.1254, 0.5034, 0.2315, 0.2259],
        [0.9368, 0.5800, 0.1264, 0.1831]], device='cuda:0')
xyxy: tensor([[  6.1889, 249.8556, 154.3751, 394.4613],
        [559.0960, 312.5987, 640.0000, 429.7631]], device='cuda:0')
xyxyn: tensor([[0.0097, 0.3904, 0.2412, 0.6163],
        [0.8736, 0.4884, 1.0000, 0.6715]], device='cuda:0')

The code I’ve used is as follows, the notebook is here

import torchvision as tv
import torch

def preproc(image, output_height=640, output_width=640):
    preprocess = tv.transforms.Compose([
        tv.transforms.Resize((output_height, output_width)),
    ])
    
    data = np.array(preprocess(image))
    
    return data

data_batch_size = 1500
images_path = "../data/coco/images/val2017" 
images_list = [img_name for img_name in os.listdir(images_path) if os.path.splitext(img_name)[1] == ".jpg"]
calib_dataset = np.zeros((data_batch_size, 640, 640, 3))
for idx, img_name in enumerate(sorted(images_list)):
    if idx==data_batch_size:
        break
    img = Image.open(os.path.join(images_path, img_name)).convert('RGB')
    img_preproc = preproc(img)
    calib_dataset[idx, :, :, :] = img_preproc

np.save("calib_set.npy", calib_dataset)

The above was just to setup the calibration dataset, I’ve used COCO2017 for this via manual download (using the ultralytics yaml) and only downloading the val2017 set.

# Second, we will load our parsed HAR from the Parsing Tutorial
model_name = "yolo11n"
hailo_model_har_name = f"{model_name}_hailo_model.har"
assert os.path.isfile(hailo_model_har_name), "Please provide valid path for HAR file"
runner = ClientRunner(har=hailo_model_har_name)
# By default it uses the hw_arch that is saved on the HAR. For overriding, use the hw_arch flag.
# Now we will create a model script, that tells the compiler to add a normalization on the beginning
# of the model (that is why we didn't normalize the calibration set;
# Otherwise we would have to normalize it before using it)

# this was taken from the hailo github alls script for yolov11n
alls =  """
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv54, sigmoid)
change_output_activation(conv65, sigmoid)
change_output_activation(conv80, sigmoid)
nms_postprocess("./yolo11n_nms_config.json", meta_arch=yolov8, engine=cpu)
allocator_param(width_splitter_defuse=disabled)
 """

# Load the model script to ClientRunner so it will be considered on optimization
runner.load_model_script(alls)

# Call Optimize to perform the optimization process
runner.optimize(calib_dataset)

# Save the result state to a Quantized HAR file
quantized_model_har_path = f"{model_name}_quantized_model.har"
runner.save_har(quantized_model_har_path)

I then take this an run a basic inference to check the output:

sample_dataset = np.zeros((2, 640, 640, 3))
SAMPLE_IMAGE_PATH = "../data/coco/images/val2017/000000000139.jpg"
img = Image.open(SAMPLE_IMAGE_PATH).convert('RGB')
img_preproc = preproc(img)
img_preproc = torch.reshape(img_preproc, (640,640,3))
sample_dataset[0,:,:,:] = img_preproc.numpy()

# #Notice that we use the original images, because normalization is IN the model
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    modified_res = runner.infer(ctx, sample_dataset[:1, :, :, :])

if I take a closer look into the class_id == 62, the televsion class, there are two detections:

tv_dets = modified_res[:, 62, :, :].reshape(5,100) # this is classid=62, which is a television
tv_dets.transpose()[:2, :]

"""
this outputs:
array([[0.39044172, 0.01051593, 0.61553806, 0.24089272, 0.91469216],
       [0.48778102, 0.8696294 , 0.6793776 , 0.9998992 , 0.5478364 ]],
      dtype=float32)

compared to the ground truth for this class (62) in the same image:
62 0.127641 0.505153 0.233312 0.2227
62 0.934195 0.583462 0.127109 0.184812
"""

I can’t seem to find any info regarding the output in the postprocessing docs

Any suggestions or comments on something I’ve missed?

Sine there is no edit functionality here, I’d like to add that if I take the quantized model .har file and feed it into the following:
!hailomz eval yolov11n --har /local/workspace/hailo_virtualenv/lib/python3.10/site-packages/hailo_tutorials/notebooks/yolo11n_quantized_model.har

I’ll get expected results. My only hope here is that there isn’t something in the background that overrides my .har file and uses a default .har for yolo11n. But if my quantised har was actually used, it confirms the output makes sense, I just need additional formatting?

Evaluate annotation type *bbox*
DONE (t=17.04s).
Accumulating evaluation results...
DONE (t=3.40s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.390
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.547
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.207
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.427
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.571
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.320
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.525
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.566
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.332
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.630
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.770
<Hailo Model Zoo INFO> Done 5000 images AP=39.024 AP50=54.679

Hi @natsayin_nahin

Here is how you interpret output. Please let me know if you have questions:

This is your output from hailo after tv_dets.transpose()[:2, :].

 x = np.array([[0.39044172, 0.01051593, 0.61553806, 0.24089272],
...        [0.48778102, 0.8696294 , 0.6793776 , 0.9998992 ]])

Multiply by height and width of the input.

 x * 640
array([[249.8827008,   6.7301952, 393.9443584, 154.1713408],
       [312.1798528, 556.562816 , 434.801664 , 639.935488 ]])

xy pairs are switched (the below is your pytorch results).

xyxy: tensor([[  6.1889, 249.8556, 154.3751, 394.4613],
        [559.0960, 312.5987, 640.0000, 429.7631]], device='cuda:0')

Even after this, you still need to convert back to original image size because originally you had 640x426 but it was resized to 640x640 for the model input. You can then match it with ground truth which is with respect to the original image.