Hello,
I’m working with a custom yolov8s_pose
model that uses a different number of keypoints from the standard 17. I successfully converted the model to Hailo using optimization level 2 (default, with 1024 calibration images and CUDA). However, the results are not accurate enough, and I’m making one last attempt before switching accelerators due to the lack of clarity in the documentation.
I initially converted the ONNX model to HAR using:
hailomz parse --ckpt model.onnx yolov8s_pose
Then, following the SDK tutorial, I attempted a higher optimization level using a calibration dataset of 1024 images (all 640x640). Since the model already runs, I skipped directly to steps 4 and 5. My Python script loads the images, the required libraries, and the model. If I use:
calib_dataset = image_dataset_normalized
runner.optimize(calib_dataset)
… the optimization defaults to level 2.
I understood that to apply optimization level 3 or 4, I need to define a .alls
script. So I modified the default yolov8s_pose.alls
by adding the line suggested in the Hailo Dataflow Compiler tutorial. I then ran:
calib_dataset = image_dataset_normalized
runner.load_model_script("/home/user/yolov8s_pose.alls")
runner.optimize(calib_dataset)
The .alls
file contains the following (only the first line was added):
model_optimization_flavor(optimization_level=3)
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv71, sigmoid)
change_output_activation(conv58, sigmoid)
change_output_activation(conv44, sigmoid)
pre_quantization_optimization(equalization, policy=disabled)
quantization_param(output_layer3, precision_mode=a16_w16)
quantization_param(output_layer6, precision_mode=a16_w16)
quantization_param(output_layer9, precision_mode=a16_w16)
post_quantization_optimization(finetune, policy=enabled, learning_rate=0.00015)
The output shows some expected changes (see below), but my machine used all 32 GB of free RAM and eventually crashed. I’m wondering if I did something wrong, because it seems excessive that one conversion step requires hardware capable of optimizing 20 Hailo chips in parallel. I hope I made a mistake, because otherwise I can’t proceed with any optimization beyond the default one, which isn’t delivering good enough results.
Partial log output:
[info] Loading model script commands to yolov8s_pose from /home/btsuser/yolov8s_pose.alls
[info] Found model with 3 input channels, using real RGB images for calibration instead of sampling random data.
[info] Starting Model Optimization
[info] Model received quantization params from the hn
[info] MatmulDecompose skipped
[info] Starting Mixed Precision
[info] Model Optimization Algorithm Mixed Precision is done (completion time is 00:00:00.41)
[info] LayerNorm Decomposition skipped
[info] Starting Statistics Collector
[info] Using dataset with 64 entries for calibration
...
info] Model Optimization Algorithm Statistics Collector is done (completion time is 00:00:24.27)
[info] Output layer yolov8s_pose/conv44 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [1.7221204018369463e-07, 8.037192310439423e-05]).
[info] Output layer yolov8s_pose/conv58 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [4.72151384656172e-08, 0.0037292195484042168]).
[info] Output layer yolov8s_pose/conv71 with sigmoid activation was detected. Forcing its output range to be [0, 1] (original range was [9.105431075795423e-09, 0.013369864784181118]).
[info] Starting Fix zp_comp Encoding
[info] Model Optimization Algorithm Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Matmul Equalization skipped
[info] Starting MatmulDecomposeFix
[info] Model Optimization Algorithm MatmulDecomposeFix is done (completion time is 00:00:00.00)
[info] No shifts available for layer yolov8s_pose/conv1/conv_op, using max shift instead. delta=3.2951
[info] No shifts available for layer yolov8s_pose/conv1/conv_op, using max shift instead. delta=1.6475
[info] Finetune encoding skipped
[info] Bias Correction skipped
[warning] Dataset is larger than dataset_size in Adaround. Increasing the algorithm dataset size might improve the results
[info] Starting Adaround
[info] The algorithm Adaround will use up to 7.97 GB of storage space
[info] Using dataset with 256 entries for Adaround
[info] Using dataset with 64 entries for bias correction
Adararound: 1%| | 1/81 [00:09<11:55, 8.95s/blocks, Layers=['yolov8s_pose/conv1
...
THEN SUDDENDLY
[warning] DALI is not installed, using tensorflow dataset for layer by layer train. Using DALI will improve train time significantly. To install it use: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110 nvidia-dali-tf-plugin-cuda110
[warning] Dataset isn't shuffled without DALI. To remove this warning add the following model script command: `post_quantization_optimization(adaround, shuffle=False)
...
THEN TRAINING STARTED TILL CRUSH
Training: 0%| | 0/2560 [00:00<?, ?batches/s]
...
Can you confirm if my .alls
script and usage are correct? Is such high memory usage expected? How can I properly perform optimization level 3 (or 4) on a machine with 32 GB RAM?
Thanks in advance.