Based on one of your examples, I was able to run face detection (without GStreamer) with retinaface_mobilenet_v1, lightface_slim, scrfd_500m, scrfd_2.5g or scrfd_10g.
However I’m confused by the output.
For example retinaface_mobilenet_v1:
Glad to hear you’ve got face detection running on multiple models! Let’s clarify the confusion around the output layers and shapes for the retinaface_mobilenet_v1 model.
Output Layer Interpretation:
The layer ‘retinaface_mobilenet_v1/conv25’ with shape (23, 40, 20) is an output feature map, not the final bounding boxes or keypoints. In face detection models like RetinaFace, outputs typically represent:
Location (bbox) predictions
Face detection confidence scores
Landmark predictions (eyes, nose, mouth, etc.)
The (23, 40, 20) shape can be seen as a grid: 23x40 is a downscaled spatial map of your original input image, and 20 likely combines information for multiple anchor boxes and features.
Post-Processing:
These raw outputs need post-processing, including:
Decoding bounding box coordinates
Applying Non-Maximum Suppression (NMS) to filter overlapping detections
Interpreting landmark and confidence scores
The numbers you see are raw feature map values, not direct x, y coordinates for bounding boxes. You’ll need to apply specific post-processing steps (usually found in model documentation or example code) to get final bounding boxes and landmarks.
Input vs Output Shape:
Your input shape (736, 1280, 3) is processed through multiple network layers, which downscale spatial dimensions and increase feature channel depth, resulting in outputs like (23, 40, 20). This downscaling is common in convolutional networks for efficiency and to capture larger features.
Let me know if you need help with post-processing or want more details on the specific outputs!
I found several classes for face detection post-processing in the GitHub repository hailo_model_zoo: hailo_model_zoo\hailo_model_zoo\core\postprocessing