Retinaface Mobilenet V1 output

Hello everyone.

Based on one of your examples, I was able to run face detection (without GStreamer) with retinaface_mobilenet_v1, lightface_slim, scrfd_500m, scrfd_2.5g or scrfd_10g.

However I’m confused by the output.
For example retinaface_mobilenet_v1:

Architecture HEF was compiled for: HAILO8L
Network group name: retinaface_mobilenet_v1, Multi Context - Number of contexts: 3
    Network name: retinaface_mobilenet_v1/retinaface_mobilenet_v1
        VStream infos:
            Input  retinaface_mobilenet_v1/input_layer1 UINT8, NHWC(736x1280x3)
            Output retinaface_mobilenet_v1/conv41 UINT8, NHWC(92x160x8)
            Output retinaface_mobilenet_v1/conv42 UINT8, NHWC(92x160x4)
            Output retinaface_mobilenet_v1/conv43 UINT8, FCR(92x160x20)
            Output retinaface_mobilenet_v1/conv32 UINT8, NHWC(46x80x8)
            Output retinaface_mobilenet_v1/conv33 UINT8, NHWC(46x80x4)
            Output retinaface_mobilenet_v1/conv34 UINT8, FCR(46x80x20)
            Output retinaface_mobilenet_v1/conv23 UINT8, NHWC(23x40x8)
            Output retinaface_mobilenet_v1/conv24 UINT8, NHWC(23x40x4)
            Output retinaface_mobilenet_v1/conv25 UINT8, FCR(23x40x20)

I guess “retinaface_mobilenet_v1/conv25” is the final output?
What is this shape 23, 40, 20?
It’s not BBoxes or anything I’ve seen before.

Also the numbers do not make sense to me. Input shape is 738, 1280, 3 and one output examples looks like this:

[[124 130 128 ... 130 126 132]
  [124 127 126 ... 124 123 125]
  [125 127 127 ... 127 124 128]
  ...
  [109 129 117 ... 143 115 143]
  [112 127 119 ... 133 117 135]
  [118 122 122 ... 128 123 128]]

My face was in the middle of the image, so my guess is those numbers are not y,x positions?

Thank you for any help!

So after some more digging:

            lNetworkGroups = self._hailoVDevice.configure(hailoHEF, dHailoCfgParams)

            self._hailoNetGrp = lNetworkGroups[0]  # type: pkHailoPlPY.ConfiguredNetwork
            self._hailoNetParams = self._hailoNetGrp.create_params()

            # lInfos:
            # [0]: 'direction', 'format', 'name', 'network_name', 'nms_shape', 'quant_info', 'shape'
            # [0].shape: tuple
            # [0].format: 'equals' method, 'flags' FormatFlags, 'order' FormatOrder, 'type' FormatType
            # [0].quant_info: 'limvals_max' float, 'limvals_min' float, 'qp_scale' float, 'qp_zp' float
            # pkHailoPl.FormatType: 'AUTO', 'FLOAT32', 'UINT16', 'UINT8'

            sLastOutputName = self._hailoNetGrp.get_sorted_output_names()[-1]

Is my assumption correct that “get_sorted_output_names()[-1]” is the correct layer which should have the detection information?

So “correct” last layer for “scrfd_500m” looks like:

output: scrfd_500m/conv40 (20, 20, 20) FormatType.UINT8 ; 0.04532748833298683 119.0

If i do more guessing :slight_smile: i would say (20, 20, 20) means up to 20 faces can be detected?
If yes what’s inside (20, 20) ?

Hey @chrime,

Glad to hear you’ve got face detection running on multiple models! Let’s clarify the confusion around the output layers and shapes for the retinaface_mobilenet_v1 model.

  1. Output Layer Interpretation:
    The layer ‘retinaface_mobilenet_v1/conv25’ with shape (23, 40, 20) is an output feature map, not the final bounding boxes or keypoints. In face detection models like RetinaFace, outputs typically represent:

    • Location (bbox) predictions
    • Face detection confidence scores
    • Landmark predictions (eyes, nose, mouth, etc.)

    The (23, 40, 20) shape can be seen as a grid: 23x40 is a downscaled spatial map of your original input image, and 20 likely combines information for multiple anchor boxes and features.

  2. Post-Processing:
    These raw outputs need post-processing, including:

    • Decoding bounding box coordinates
    • Applying Non-Maximum Suppression (NMS) to filter overlapping detections
    • Interpreting landmark and confidence scores

    The numbers you see are raw feature map values, not direct x, y coordinates for bounding boxes. You’ll need to apply specific post-processing steps (usually found in model documentation or example code) to get final bounding boxes and landmarks.

  3. Input vs Output Shape:
    Your input shape (736, 1280, 3) is processed through multiple network layers, which downscale spatial dimensions and increase feature channel depth, resulting in outputs like (23, 40, 20). This downscaling is common in convolutional networks for efficiency and to capture larger features.

Let me know if you need help with post-processing or want more details on the specific outputs!

Best regards,
Omri

I found several classes for face detection post-processing in the GitHub repository hailo_model_zoo: hailo_model_zoo\hailo_model_zoo\core\postprocessing

F.e. hailo_model_zoo.core.postprocessing.face_detection.scrfd.SCRFDPostProc

The method “tf_postproc” looks like it should do all the post-process stuff?

Thank you for your help

So my current (simplified) flow is:

hef = hailo_platform.HEF('data/models/hailo8l/face_detection/scrfd_500m.hef')

vd_prms = hailo_platform.VDevice.create_params()
v_device = hailo_platform.VDevice(vd_prms)

cfg_prms = hailo_platform.ConfigureParams.create_from_hef(hef=hef, interface=hailo_platform.HailoStreamInterface.PCIe)

net_grps = v_device.configure(hef, cfg_prms)
net_grp = net_grps[0]
net_grp_prms = net_grp.create_params()

vstr_main_input = net_grp.get_input_vstream_infos()[0]

vstr_main_output_name = net_grp.get_sorted_output_names()[-1]

vstr_main_output = None
vtrs_outputs = net_grp.get_output_vstream_infos()
for vstr_info in vtrs_outputs:
    if vstr_main_output_name in vstr_info.name:
        vstr_main_output = vstr_info
        
vstr_prms_input = hailo_platform.InputVStreamParams.make(net_grp)
vstr_prms_output = hailo_platform.OutputVStreamParams.make(net_grp)

... some capture and resize stuff ...

input_data = { vstr_main_output_name: numpy.expand_dims(np_image_scaled, axis=0) }

with hailo_platform.InferVStreams(net_grp, vstr_prms_input, vstr_prms_output) as vstr_infer:

    # returns a dict with multiple output layer names as key
    # also it's batch/frame based
    results = vstr_infer.infer(input_data)
    
    # should be correct 'main' output for each HEF model?
    result_main = results[vstr_main_output_name]
    
    # load YAML config for model, in this example: hailo_model_zoo/cfg/base/scrfd.yaml
    ... some YAML magic ...

    post_proc = SCRFDPostProc(np_image_scaled.shape, anchros=yaml_data['postprocessing']['anchros'])

    # infer 1. batch/frame
    post_proc.tf_postproc(result_main[0])

However I get an exception in the last line: face detection failed: All branches must have the same number of output nodes

So the number seems to come from the YAML config:

  anchors:
    steps:
    - 8
    - 16
    - 32

However none of the output layers has either 8, 16 or 32 in its dimension.

Soo … tf_postproc is not what I need?