Optimization level 3 for custom YOLOv8s Pose model leads to high RAM usage and crash

Hello,
I’m working with a custom yolov8s_pose model that uses a different number of keypoints from the standard 17. I successfully converted the model to Hailo using optimization level 2 (default, with 1024 calibration images and CUDA). However, the results are not accurate enough, and I’m making one last attempt before switching accelerators due to the lack of clarity in the documentation.

I initially converted the ONNX model to HAR using:

hailomz parse --ckpt model.onnx yolov8s_pose

Then, following the SDK tutorial, I attempted a higher optimization level using a calibration dataset of 1024 images (all 640x640). Since the model already runs, I skipped directly to steps 4 and 5. My Python script loads the images, the required libraries, and the model. If I use:

calib_dataset = image_dataset_normalized
runner.optimize(calib_dataset)

… the optimization defaults to level 2.

I understood that to apply optimization level 3 or 4, I need to define a .alls script. So I modified the default yolov8s_pose.alls by adding the line suggested in the Hailo Dataflow Compiler tutorial. I then ran:

calib_dataset = image_dataset_normalized
runner.load_model_script("/home/user/yolov8s_pose.alls")
runner.optimize(calib_dataset)

The .alls file contains the following (only the first line was added):

model_optimization_flavor(optimization_level=3)
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv71, sigmoid)
change_output_activation(conv58, sigmoid)
change_output_activation(conv44, sigmoid)
pre_quantization_optimization(equalization, policy=disabled)
quantization_param(output_layer3, precision_mode=a16_w16)
quantization_param(output_layer6, precision_mode=a16_w16)
quantization_param(output_layer9, precision_mode=a16_w16)
post_quantization_optimization(finetune, policy=enabled, learning_rate=0.00015)

The output shows some expected changes (see below), but my machine used all 32 GB of free RAM and eventually crashed. I’m wondering if I did something wrong, because it seems excessive that one conversion step requires hardware capable of optimizing 20 Hailo chips in parallel. I hope I made a mistake, because otherwise I can’t proceed with any optimization beyond the default one, which isn’t delivering good enough results.

Partial log output:

[info] Loading model script commands to yolov8s_pose from /home/btsuser/yolov8s_pose.alls
[info] Found model with 3 input channels, using real RGB images for calibration instead of sampling random data.
[info] Starting Model Optimization
[info] Model received quantization params from the hn
[info] MatmulDecompose skipped
[info] Starting Mixed Precision
[info] Model Optimization Algorithm Mixed Precision is done (completion time is 00:00:00.41)
[info] LayerNorm Decomposition skipped
[info] Starting Statistics Collector
[info] Using dataset with 64 entries for calibration
...
info] Model Optimization Algorithm Statistics Collector is done (completion time is 00:00:24.27)
[info] Output layer yolov8s_pose/conv44 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [1.7221204018369463e-07, 8.037192310439423e-05]).
[info] Output layer yolov8s_pose/conv58 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [4.72151384656172e-08, 0.0037292195484042168]).
[info] Output layer yolov8s_pose/conv71 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [9.105431075795423e-09, 0.013369864784181118]).
[info] Starting Fix zp_comp Encoding
[info] Model Optimization Algorithm Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Matmul Equalization skipped
[info] Starting MatmulDecomposeFix
[info] Model Optimization Algorithm MatmulDecomposeFix is done (completion time is 00:00:00.00)
[info] No shifts available for layer yolov8s_pose/conv1/conv_op, using max shift instead. delta=3.2951
[info] No shifts available for layer yolov8s_pose/conv1/conv_op, using max shift instead. delta=1.6475
[info] Finetune encoding skipped
[info] Bias Correction skipped
[warning] Dataset is larger than dataset_size in Adaround. Increasing the algorithm dataset size might improve the results
[info] Starting Adaround
[info] The algorithm Adaround will use up to 7.97 GB of storage space
[info] Using dataset with 256 entries for Adaround
[info] Using dataset with 64 entries for bias correction
Adararound: 1%|   | 1/81 [00:09<11:55,  8.95s/blocks, Layers=['yolov8s_pose/conv1
...
THEN SUDDENDLY
[warning] DALI is not installed, using tensorflow dataset for layer by layer train. Using DALI will improve train time significantly. To install it use: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110 nvidia-dali-tf-plugin-cuda110
[warning] Dataset isn't shuffled without DALI. To remove this warning add the following model script command: `post_quantization_optimization(adaround, shuffle=False)
...
THEN TRAINING STARTED TILL CRUSH
Training:   0%|                                   | 0/2560 [00:00<?, ?batches/s]
...

Can you confirm if my .alls script and usage are correct? Is such high memory usage expected? How can I properly perform optimization level 3 (or 4) on a machine with 32 GB RAM?

Thanks in advance.

just to clarify: green dots are the original model, pink crosses are the .hef output with optimization level=2. To me seems that there is a strong degradation.

For the advanced optimization algorithms, we reccommend using a machine with a supported GPU.

Are you using real images for the calibration, including the required pre-processing?

You can also use the model-zoo for the optimization:
hailomz optimize --har ./my_har.har --calib-data-path /path/to/real/images

Yes. I am using a yolov8s_base model with all 1024 entries of calib path made by 640x640 pixels, as required by input size of yolov8s_pose model, taken from the validation dataset. And a GPU RTX4070, with 32GB of RAM. It seems that non-standard optimization level, e.g. 3, needs more memory. I have already tried your suggestion but uses a default optimization level=2 (possible only if you use GPU), which is not the best as you can see from my screenshot.

Thanks. I’m looking again at the output of the initial message, and it seems that clearly random image were fed. Can you double check on that?

I used 1024 images from the COCO2017 validation dataset: I selected around 650 images containing a single person and about 350 background images (animals, objects, etc.), because in order to perform pose estimation, one also needs to perform detection. Is this approach incorrect? What should I modify?

Moreover, the resolution of the images in the COCO validation dataset varies for each image and is always smaller than 640×640. To make them consistent with the fixed input resolution, I resized all images to 640×640 by adding black borders to adjust the size.

Hi again @Simone_Tortorella,
It seems that we have recently fixed a bug that might have afect your casse. Indeed, the message of the ‘random data’ was wrong from your perspective, but it hinted that the tool recognized it otherwise.
R&D have merged a fix into the DFC and it will be part of the next release (July’25).

Hi @Simone_Tortorella,

When you select optimization level 3, it enables both equalization and Adaround across all layers, using 256 images and 320 epochs by default.

If you don’t explicitly configure Adaround’s parameters, it will use the defaults described here. Adaround is the most resource-intensive optimization we offer, and the default is a high batch size and a large number of epochs, so I’d recommend lowering both. Also, since only 256 images are being used out of the 1024 you’re passing, it’s a good idea to set dataset_size explicitly.

I also noticed you’re using finetune alongside Adaround. It’s best to stick to one or the other, not both. In the end, your script should look something like this:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv71, sigmoid)
change_output_activation(conv58, sigmoid)
change_output_activation(conv44, sigmoid)
pre_quantization_optimization(equalization, policy=disabled)
quantization_param(output_layer3, precision_mode=a16_w16)
quantization_param(output_layer6, precision_mode=a16_w16)
quantization_param(output_layer9, precision_mode=a16_w16)
model_optimization_flavor(optimization_level=3)
post_quantization_optimization(adaround, policy=enabled, batch_size=4, dataset_size=1024, epochs=16, cache_compression=enabled)

Additionally:

  • Ensure that your GPU is being used
  • Install DALI, as recommended by the warning message

Can I try with optimization level 2, but more than 1024 images and more than 4 epochs? What should I add?

optimization level 2 enables qft. You can find an explanation the parameters of all of Hailo’s post-quantization algorithms in the DFC guide