Hailo Multi-Detection with 2 cameras

Hi everyone,

I tried the basic pipeline detection.py code from hailo-rpi5-examples on 1 USB camera detected on /dev/video4, so I changed the video source in hailo_rpi_common.py and from here it worked.

Now, I would like to use 2 cameras connected to /dev/video4 and /dev/video10 but the code failed as the basic code is set to one source video.
Thus, I changed the gstreamer_app.py and set the pipeline string call to :
pipeline_string = (
"v4l2src device=/dev/video4 ! videoconvert ! videoscale ! video/x-raw,width=640,height=480 ! queue ! identity name=identity_callback ! mixer.sink_0 "
"v4l2src device=/dev/video10 ! videoconvert ! videoscale ! video/x-raw,width=640,height=480 ! queue ! mixer.sink_1 "
“compositor name=mixer sink_0::xpos=0 sink_1::xpos=640 ! videoconvert ! fpsdisplaysink name=hailo_display”
)

The code detection.py runs with my 2 cameras but without object detections from Hailo, just a video stream.
(it only works when the pipeline is set to : self.get_pipeline_string(), but only with one camera)

Someone knows how to solve this with 2 cams and multi-detection?
I also tested the tappas code with multi_camera, but I’m not familiar with .sh script for my project. Thanks.

Hi @Theo_Vioux
Welcome to Hailo community. You can try using our PySDK which enables such scenarios: Running multiple models independently

Hi,

Thanks for your reply.
The final goal of my project is to use 4x cameras (in the long term, as I’d first like to make it work on 2 cams on 1 rpi5), possibly on different rpi5 (2x2 with 3.0 ports) to make 3D reconstruction accurately and pose estimation for depth measurement between different point in space (my cameras are RGB-Depth).

So I have a constraint to display in real time, which is why I use Hailo to improve inference, but I’d have to integrate a custom pipeline on depth, and a Re-ID that memorizes with labels a detected person (there are 4 to memorize). Does PySDK allow such a scenario? If not, what do you recommend?

Hi @Theo_Vioux
PySDK does allow such scenarios. Let me know if you need help with developing such a pipeline.

Hello,
I tested the example codes and they are really good, thank you very much for your work on them.
I managed to open the 2 streams of my RGB-D cameras with 2 different models and smoothly, but one model on 1 camera and another model on the other one.
I was wondering now if it’s possible to change the FPS of video streams that are displayed at 30 fps, and I’d like to test with 60, and possibly change the display resolution.
Also, is it possible to use 2 AI models per camera? If I want, for example, a model to do multi-human pose estimation by posing joints and a person re-identification model (in case of blind spot), all of these per camera in real time. Do you have an example of code for this, or something similar?

Otherwise, if this is too resource-intensive for real-time, and data pre-processing is required, do you have any ideas?

I was wondering now if it’s possible to change the FPS of video streams that are displayed at 30 fps, and I’d like to test with 60, and possibly change the display resolution.

The FPS and resolution of source are controlled from outside. However, please note that at certain point you may be bottlenecked by CPU (video decode, resize, display).

Also, is it possible to use 2 AI models per camera? If I want, for example, a model to do multi-human pose estimation by posing joints and a person re-identification model (in case of blind spot), all of these per camera in real time. Do you have an example of code for this, or something similar?

Yes, it is possible to use 2 AI models per camera. These can be in series or in parallel. You can see some guides we published on this topic: A Comprehensive Guide to Building a Face Recognition System, A Comprehensive Guide to Building a License Plate Recognition (LPR) Systems

Hi,

I want to use my own pipeline from my cameras (intel realsense D435) to get the depth measurement.
This is done with pyrealsense that is a pipeline for depth format in space.
Thus, I tried this code and many others but I can’t compile because I have the following error :


**for detection in result.predictions:  # Retrieve detected objects**
**                     ^^^^^^^^^^^^^^^^^^**
**AttributeError: 'DetectionResults' object has no attribute 'predictions'**

Here is my code :

import degirum as dg
import degirum_tools
import degirum_tools.streams as dgstreams
import pyrealsense2 as rs
import numpy as np
import cv2

# Serial numbers of D435 cameras
serial_numbers = ['317422075525', '335622071790']

inference_host_address = "@local"
zoo_url = 'degirum/hailo'
token = ''
device_type = ['HAILORT/HAILO8']

# AI model configurations
configurations = [
    {
        "model_name": "yolov8n_relu6_coco_pose--640x640_quant_hailort_hailo8_1",
        "display_name": "Traffic Camera",
    },
    {
        "model_name": "yolov8n_relu6_face--640x640_quant_hailort_hailo8_1",
        "display_name": "Webcam Feed",
    },
]

# Wrapper class for RealSense with a cv2.VideoCapture compatible interface
class RealSenseWrapper:
    def __init__(self, serial):
        self.pipeline = rs.pipeline()
        config = rs.config()
        config.enable_device(serial)
        config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
        self.pipeline.start(config)

    def read(self):
        frames = self.pipeline.wait_for_frames()
        color_frame = frames.get_color_frame()
        if not color_frame:
            return False, None
        img = np.asanyarray(color_frame.get_data())
        return True, img  # Simulates cv2.VideoCapture().read()

    def release(self):
        self.pipeline.stop()

# Load models
models = [
    dg.load_model(
        model_name=cfg["model_name"],
        inference_host_address=inference_host_address,
        zoo_url=zoo_url,
        token=token,
        device_type=device_type
    )
    for cfg in configurations
]

# Initialize RealSense cameras
cameras = [RealSenseWrapper(serial) for serial in serial_numbers]

# Manual capture loop (instead of dgstreams.VideoSourceGizmo)
while True:
    for i, camera in enumerate(cameras):
        ret, frame = camera.read()
        if not ret:
            print(f"Error: No image for camera {serial_numbers[i]}")
            continue

        # Apply inference model
        result = models[i].predict(frame)

        # Draw detections on the image
        if result:  # Check if result contains detections
            print(result._inference_results)

            for detection in result.predictions:  # Retrieve detected objects
                x1, y1, x2, y2 = map(int, detection.bbox)  # Convert to integers
                label = detection.label
                conf = detection.confidence

                # Draw the bounding box on the image
                cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        
                # Add the label and confidence score
                cv2.putText(frame, f"{label} {conf:.2f}", (x1, y1 - 10),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        frame_result = frame  # Now, frame_result contains the annotated image

        # Display results
        cv2.imshow(f"Camera {i+1}", frame_result)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

#Release resources
for camera in cameras:
    camera.release()
cv2.destroyAllWindows()

I had to delete the following code from your example to be able to use my pipeline :

*# define gizmos
sources = [dgstreams.VideoSourceGizmo(cfg["source"]) for cfg in configurations]
detectors = [dgstreams.AiSimpleGizmo(model) for model in models]
display = dgstreams.VideoDisplayGizmo(
    [cfg["display_name"] for cfg in configurations], show_ai_overlay=True, show_fps=True
)

# create pipeline
pipeline = (
    (source >> detector for source, detector in zip(sources, detectors)),
    (detector >> display[di] for di, detector in enumerate(detectors)),
)

# start composition
dgstreams.Composition(*pipeline).start()*

Do you have any clue ?

Hi @Theo_Vioux
The detection results object does not contain an attribute called predictions. You can simply access results as result.results which is a list of dictionaries. Also result.image_overlay gives you the image with bounding boxes and labels drawn. See Running AI Model Inference | DeGirum Docs for detailed documentation. You can also see DeGirum/hailo_examples: DeGirum PySDK with Hailo AI Accelerators for usage examples.

Hello,

Thanks, it worked fine on my own pipeline.
Now, using the cloud, the video streams could take between 30 seconds and 1 minute to launch so I’d like to switch to local.
For this I would like to use my own templates like yolov8m_pose.hef or re-identification templates that are not available on the zoo template as I wish.
I understand from degirum that you need one .hef and 2 .json templates to adapt to its tools, but some specific templates don’t have JSON, so it has to be customized.

Do you have a guide for this that could be applied to any template?

Hi @Theo_Vioux
yolov8_pose model is in fact available in our zoo and you can use its template. For re-id you can adapt the json from our face recognition model. Let me know if you need help in making the jsons.

Hi, I’m going to rely on facial recognition for now, thanks.

Which of the following 5 models is the best for real-time accuracy? (and or possibly not too greedy as with several cameras, the video stream will have to be fluid.)

# Specify the model name 
face_det_model_name = "scrfd_10g--640x640_quant_hailort_hailo8_1"
# face_det_model_name = "scrfd_2.5g--640x640_quant_hailort_hailo8l_1"
# face_det_model_name = "scrfd_500m--640x640_quant_hailort_hailo8l_1"
# face_det_model_name = "yolov8n_relu6_widerface_kpts--640x640_quant_hailort_hailo8l_1"
# face_det_model_name = "retinaface_mobilenet--736x1280_quant_hailort_hailo8l_1"

Blockquote

Concerning the download of the model locally, I saw that the zip is accompanied by a python code, what is it used for exactly, and between this file, the .json and the .json label, which one(s) should be modified to optimize the model’s performance and act on parameters.

Hi @Theo_Vioux
The python file accompanying the model is the postprocessor code. The Hailo accelerator runs the compute intensive part of the model but the conversion of model output to human readable format (like bboxes, labels, and scores) is done in postprocessor. Of the five models, we use the yolov8n version for performance reasons (speed).

I took photos of people (for the database) on the video stream who are well recognized and it’s very good. However, what would interest me more would be to use a model that would recognize several people by facial recognition for example, but without going through a database to say that such-and-such a person has such-and-such a first name, but more so that if a person is detected because he or she has a certain characteristic then it’s assigned as unknown 1 for example, and if another person is detected with other characteristics it’s unknown 2 accurately, … without going through photos of people.

Hi @Theo_Vioux
Glad to hear face recognition is working well. Regarding your idea: from one frame to another, how would you know it is the same unknown 1? You can build a dynamic database as you go along, but you will encounter lot of challenges (most of which are not ML model related). For example, you can enable tracking. However, trackers are not 100% accurate. So, you will occasionally make mistakes due to occlusions. A lot of optimizations are possible depending on use case, but there is no universal solution that we can provide that would satisfy all needs. Solutions also depend on how accurate you want the final system to be.

Yes indeed, I see these constraints too, but it would be more a solution that would allow people to be differentiated, while recognizing them with a certain label and with dynamic precision.
Are there any reliable models for this? If not, it doesn’t matter, I’d stick with the database, which is already very good.

Hi @Theo_Vioux
You can use the same model for this purpose as you can think of face embeddings model as a reid model that extracts features and these features can be used in tracking. We experimented with such pipelines but qualitatively they do not look good. I do not know what your final application is and what accuracy you need, but we can give you some pointers if you want to try.

Hi,

For my project, I’d need to do 3D reconstruction from RGB-D cams, supplemented with AI models that need to be very accurate in distance estimation, and do multi-human pose estimation, whose articulations are reconstructed in a 3D scene in real time, and then make distance measurements (could be like : https://www.youtube.com/watch?v=az22inKXK7g ) with an error if possible of 2% on the distance between points of interest and a fixed point (fictitious radioactive source). This will be to trace the dose that a person receives, since this is inversely proportional to the distance squared, so having an accurate distance is crucial.

To achieve this, each person will have a label and a dose tracker. The challenge will also be to use 4 cameras, i.e. 2 rpi5s, to send each person’s matrix of coordinates in space to another computer (so I’ll have to see what is the best communication protocol) to obtain several real-time dose curves for each person. Also, the computer will control when to start recording cam datas from rpi5 n°1 and rpi5 n°2.