Is the video we see on the display the same image that the Hailo module receives for inference? Specifically, I’d like to confirm that the video format being fed to the model is correct.
If I’m right that the model processes 640×640 input, does the pipeline downscale a 1920×1080 frame to 640×640, or can it tile/segment the image into smaller blocks and process those instead?
In many cases, the image you see is different from the one used by the model. Many pipelines involve video conversions. A common difference is that models often require square inputs, while most video is in a landscape format (e.g., 4:3 or 16:9).
That depends on the use case. Both options are valid.
Scaling: In most cases, the video is scaled to match the model’s input size. This approach works well when objects are medium to large within the frame. Scaling can even use different factors for height and width, and during training, the model can learn to detect objects that appear stretched or squished in one dimension.
Tiling: Another option is to tile an image. This method is useful for detecting small objects, such as those farther away from the camera. You can find more details about tiling in the older Tappas Tiling example. GitHub - Tappas - Tiling
Yes, we have instances that we need to detect small animals, where at a certain distance away, so i think tiling will be better for us.
So, is there a reason why titling is not used as the default? As multiple titles can be done at the same time, you can also detect the entire frame. Can this be possible?
For titling do we need to train a model? Also how can i appply this to my existing Raspbery Pi examples projects?
Detecting small objects isn’t always needed. For example, a car can be recognized even in a tiny 16×16 image, but reading its license plate requires much higher resolution. A common approach is to downscale the full image to detect larger cars, then crop the high-resolution region containing the car and plate and use a second and third model to find the license plate and then decode it.
The tilling itself is just cutting the image into smaller tiles. In many cases you can use models as they are.
The issue I am having is that I’m not able to detect animals, such as a cat, in my garden. What sort of resolution would I need for this?
Regarding what you mentioned: if I have a frame at 1920×1080 with a car and then downsample it to 640×640, isn’t it already being downsampled? Also, in this process, should I be running multiple models doing different tasks, or is it just about downsampling so that everything can run on the Hailo modules?
With tiling, instead of converting the entire 1920×1080 frame to 640×640, it processes it in smaller blocks. How can I include tiling in my existing sample project?
You will likely need to retrain a model with images from your garden. I suspect the model you are using has been trained on a dataset like COCO. Cats in these dataset likely look different (bigger, interacting with their owners, different light and background …) than cats walking trough your garden.
The model does not know what a cat is. It looks for pattern in the images.
Try holding your phone with images of cats from the internet in front of your camera. If they get detected you know the application works and you need to retrain the models with images closer to your use case.