Hi,
I am facing an issue when training YOLO models with imgsz values above 640px. My detections are correct, but the bounding boxes are shifted.
I came across this post, where someone faced the same problem: Trouble running custom yolo.hef models with imgz = 1088 - #2 by omria
Honestly, I don’t understand what is happening here and what @omria tried to explain and where I could apply this corresponding to the dataset I use (WIDERFACE), which contains images with wildly different dimensions.
What I understand so far:
According to Ultralytics, using higher imgsz values shouldn’t be a problem. The imgsz parameter during training just fixes one side of the input size of the model. For example, an imgsz=1024 simply means that the width is fixed (e.g. 1024 x XXXX), which would mean that the aspect ratio of the training data is preserved and that letterboxing and padding are automatically applied.
This is my basic standard, training command:
!yolo task=detect mode=train model=yolov8n.pt data=/root/datasets/datasets/UPSCALED_WIDERFACE_YOLO/yolodataset.yaml epochs=15 batch=4 imgsz=1024 plots=True device=0,1
My questions are:
- Why do these issues (shifted bounding boxes) only become apparent when using imgsz values above 640px?
- Is it possible that this discrepancy stems from a configuration in the toolchain that I possibly could change?
- From my understanding, there can’t be a one size fits all ratio or padding value for varying training data like WIDERFACE.
- Why should this be considered in the compilation or inference process in the first place with a model trained on such varied data?
- And why does inference work correctly for a trained model with imgsz= 640 and wildly different training data, where also padding and letterboxing seem to be applied during training (standard ultralytics configuration)?
- Has anyone encountered and resolved this issue by modifying the compile-time configuration or the post-processing pipeline?
I don’t get it…
The good news is that I could live with the status quo. Because resizing the higher output dimensions (camera feed) to the model’s expected input dimension is at the moment sufficient.
But still: I would appreciate clarification on what is happening here. Any material is welcome.
Thank you!