Yolov8s Total Distill Loss is not decreasing when converting to hef

Hello hailo community,\

I have a question reagarding to ONNX to hef conversion. I have a custom dataset which consists of single label (car).
When I try to convert it to the hef file the total distill loss is not decreasing. I am suspecting that this why I cant
get good accuracy when I run my model in runtime in hailo8. Detection performance is pretty good when I infer directly from torch.
Can you help me about it ? How can I solve this issue ?

This how I export my trained yolov8s model:

$: yolo export model=best.pt format=onnx opset=11 imgsz=640

I also use docker enviroment for hef conversion. These are versions of each tools:

$: hailo --version
[info] Current Time: 12:16:11, 12/27/24
[info] CPU: Architecture: x86_64, Model: Intel(R) Xeon(R) W-2195 CPU @ 2.30GHz, Number Of Cores: 36, Utilization: 0.8%
[info] Memory: Total: 125GB, Available: 52GB
[info] System info: OS: Linux, Kernel: 5.19.0-46-generic
[info] Hailo DFC Version: 3.29.0
[info] HailoRT Version: 4.19.0
[info] PCIe: No Hailo PCIe device was found
[info] Running `hailo --version`
HailoRT v4.19.0
Hailo Dataflow Compiler v3.29.0

The following one is the model script that I use:

quantization_param([conv42, conv53, conv63], force_range_out=[0.0, 1.0])
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

I use this command to compile:

hailomz compile --ckpt /local/shared_with_docker/best.onnx --calib-path /local/shared_with_docker/images/ --yaml hailo_model_zoo/hailo_model_zoo/cfg/networks/yolov8s.yaml --start-node-names images --classes 1 --hw-arch hailo8
<Hailo Model Zoo INFO> Start run for network yolov8s ...
<Hailo Model Zoo INFO> Initializing the hailo8 runner...
[info] Translation started on ONNX model yolov8s
[info] Restored ONNX model yolov8s (completion time: 00:00:00.22)
[info] Extracted ONNXRuntime meta-data for Hailo model (completion time: 00:00:00.77)
[info] NMS structure of yolov8 (or equivalent architecture) was detected.
[info] In order to use HailoRT post-processing capabilities, these end node names should be used: /model.22/cv2.0/cv2.0.2/Conv /model.22/cv3.0/cv3.0.2/Conv /model.22/cv2.1/cv2.1.2/Conv /model.22/cv3.1/cv3.1.2/Conv /model.22/cv2.2/cv2.2.2/Conv /model.22/cv3.2/cv3.2.2/Conv.
[info] Start nodes mapped from original model: 'images': 'yolov8s/input_layer1'.
[info] End nodes mapped from original model: '/model.22/cv2.0/cv2.0.2/Conv', '/model.22/cv3.0/cv3.0.2/Conv', '/model.22/cv2.1/cv2.1.2/Conv', '/model.22/cv3.1/cv3.1.2/Conv', '/model.22/cv2.2/cv2.2.2/Conv', '/model.22/cv3.2/cv3.2.2/Conv'.
[info] Translation completed on ONNX model yolov8s (completion time: 00:00:01.24)
[info] Saved HAR to: /local/workspace/yolov8s.har
<Hailo Model Zoo INFO> Preparing calibration data...
[info] Loading model script commands to yolov8s from /local/workspace/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov8s.alls
[info] Loading model script commands to yolov8s from string
[info] Starting Model Optimization
[info] Using default optimization level of 2
[info] Model received quantization params from the hn
[info] Starting Mixed Precision
[info] Mixed Precision is done (completion time is 00:00:00.76)
[info] LayerNorm Decomposition skipped
[info] Starting Statistics Collector
[info] Using dataset with 64 entries for calibration
Calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:26<00:00,  2.39entries/s]
[info] Statistics Collector is done (completion time is 00:00:28.84)
[warning] The force_range command has been used, notice that its behavior was changed on this version. The old behavior forced the range on the collected calibration set statistics, but allowed the range to change during the optimization algorithms.
The new behavior forces the range throughout all optimization stages.
The old method could be restored by adding the flag weak_force_range_out=enabled to the force_range command on the following layers ['yolov8s/conv42', 'yolov8s/conv53', 'yolov8s/conv63']
[info] Starting Fix zp_comp Encoding
[info] Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Matmul Equalization skipped
[info] Finetune encoding skipped
[info] Bias Correction skipped
[info] Adaround skipped
[info] Starting Quantization-Aware Fine-Tuning
[warning] Dataset is larger than expected size. Increasing the algorithm dataset size might improve the results
[info] Using dataset with 1024 entries for finetune

127/128 [============================>.] - ETA: 0s - total_distill_loss: 16.7721 - _distill_loss_yolov8s/conv41: 1.0086 - _distill_loss_yolov8s/conv42: 1.3785 - _distill_loss_yolov8s/conv52: 0.6110 - _distill_loss_yolov8s/conv53: 3.1424 - _distill_loss_yolov8s/conv62: 0.4986 - _distill_loss_yolov8s/conv63: 3.9918 - _distill_loss_yolov8s/conv46: 2.2524 - _distill_loss_yolov8s/conv35: 1.6849 - _distill_loss_yolov8s/conv5128/128 [==============================] - ETA: 0s - total_distill_loss: 16.7585 - _distill_loss_yolov8s/conv41: 1.0093 - _distill_loss_yolov8s/conv42: 1.3750 - _distill_loss_yolov8s/conv52: 0.6116 - _distill_loss_yolov8s/conv53: 3.1264 - _distill_loss_yolov8s/conv62: 0.5003 - _distill_loss_yolov8s/conv63: 3.9918 - _distill_loss_yolov8s/conv46: 2.2532 - _distill_loss_yolov8s/conv35: 1.6854 - _distill_loss_yolov8s/conv5128/128 [==============================] - 34s 265ms/step - total_distill_loss: 16.7451 - _distill_loss_yolov8s/conv41: 1.0100 - _distill_loss_yolov8s/conv42: 1.3716 - _distill_loss_yolov8s/conv52: 0.6122 - _distill_loss_yolov8s/conv53: 3.1106 - _distill_loss_yolov8s/conv62: 0.5021 - _distill_loss_yolov8s/conv63: 3.9919 - _distill_loss_yolov8s/conv46: 2.2540 - _distill_loss_yolov8s/conv35: 1.6860 - _distill_loss_yolov8s/conv57: 2.2067
  
  
[info] Quantization-Aware Fine-Tuning is done (completion time is 00:09:19.10)
[info] Starting Layer Noise Analysis
Full Quant Analysis: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:10<00:00, 65.12s/iterations]
[info] Layer Noise Analysis is done (completion time is 00:02:14.49)
[info] Model Optimization is done
[info] Saved HAR to: /local/workspace/yolov8s.har
[info] Loading model script commands to yolov8s from /local/workspace/hailo_model_zoo/hailo_model_zoo/cfg/alls/generic/yolov8s.alls
[info] To achieve optimal performance, set the compiler_optimization_level to "max" by adding performance_param(compiler_optimization_level=max) to the model script. Note that this may increase compilation time.
[info] Loading network parameters
[info] Starting Hailo allocation and compilation flow
[info] Adding an output layer after conv41
[info] Adding an output layer after conv42
[info] Adding an output layer after conv52
[info] Adding an output layer after conv53
[info] Adding an output layer after conv62
[info] Adding an output layer after conv63
[info] Using Single-context flow
[info] Resources optimization guidelines: Strategy -> GREEDY Objective -> MAX_FPS
[info] Resources optimization params: max_control_utilization=75%, max_compute_utilization=75%, max_compute_16bit_utilization=75%, max_memory_utilization (weights)=75%, max_input_aligner_utilization=75%, max_apu_utilization=75%
[info] Using Single-context flow
[info] Resources optimization guidelines: Strategy -> GREEDY Objective -> MAX_FPS
[info] Resources optimization params: max_control_utilization=75%, max_compute_utilization=75%, max_compute_16bit_utilization=75%, max_memory_utilization (weights)=75%, max_input_aligner_utilization=75%, max_apu_utilization=75%

Validating context_0 layer by layer (100%)

 +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  + 
 +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  + 
 +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  + 
 +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  + 
 +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  + 

● Finished                                       

[info] Solving the allocation (Mapping), time per context: 59m 59s
Context:0/0 Iteration 4: Trying parallel mapping...  
          cluster_0  cluster_1  cluster_2  cluster_3  cluster_4  cluster_5  cluster_6  cluster_7  prepost 
 worker0  *          *          *          *          *          *          *          *          V       
 worker1  *          *          *          *          *          *          *          *          V       
 worker2  V          V          V          V          V          V          V          V          V       
 worker3  V          V          V          V          V          V          V          V          V       

  00:10
Reverts on cluster mapping: 0
Reverts on inter-cluster connectivity: 0
Reverts on pre-mapping validation: 0
Reverts on split failed: 0

[info] Iterations: 4
Reverts on cluster mapping: 0
Reverts on inter-cluster connectivity: 0
Reverts on pre-mapping validation: 2
Reverts on split failed: 0
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Cluster   | Control Utilization | Compute Utilization | Memory Utilization |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | cluster_0 | 100%                | 62.5%               | 57%                |
[info] | cluster_1 | 100%                | 81.3%               | 93%                |
[info] | cluster_2 | 87.5%               | 62.5%               | 33.6%              |
[info] | cluster_3 | 100%                | 87.5%               | 41.4%              |
[info] | cluster_4 | 100%                | 79.7%               | 71.9%              |
[info] | cluster_5 | 31.3%               | 35.9%               | 12.5%              |
[info] | cluster_6 | 75%                 | 87.5%               | 69.5%              |
[info] | cluster_7 | 6.3%                | 6.3%                | 1.6%               |
[info] +-----------+---------------------+---------------------+--------------------+
[info] | Total     | 75%                 | 62.9%               | 47.6%              |
[info] +-----------+---------------------+---------------------+--------------------+
[info] Successful Mapping (allocation time: 38s)
[info] Compiling context_0...
[info] Bandwidth of model inputs: 9.375 Mbps, outputs: 4.16565 Mbps (for a single frame)
[info] Bandwidth of DDR buffers: 12.5 Mbps (for a single frame)
[info] Bandwidth of inter context tensors: 0.0 Mbps (for a single frame)
[info] Building HEF...
[info] Successful Compilation (compilation time: 22s)
[info] Saved HAR to: /local/workspace/yolov8s.har
<Hailo Model Zoo INFO> HEF file written to yolov8s.hef

How many images have you used for the calibration?
How do you measure the accuracy? Single image or a complete dataset?

I have two dataset one for training and one for validation. My training dataset consist of 69650 images. My validation dataset consists of 18520 images. For calibration I have used the same validation dataset.In addition, I have used following model script for another hef conversion attempt:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
quantization_param([conv42, conv53, conv63], force_range_out=[0.0, 1.0], weak_force_range_out=enabled)
post_quantization_optimization(finetune, policy=enabled, dataset_size=16384)
model_optimization_config(calibration, batch_size=4, calibset_size=16384)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

However, total distill loss does not decrease again.

When I measure the accuracy I have used entire validation dataset for testing. But detection list is %99 empty. It detects rarely cars but everything seems okey when I infer with torch. The runtime code was taken from Hailo-Application-Code-Examples/runtime/python/detection_with_tracker at main · hailo-ai/Hailo-Application-Code-Examples · GitHub

This line is not “good”. Doing the calibration part of the optimization on the overall dataset will set the outliars as the limvals. I would remove that line.

Thank you for your fast reply. This time, I have used following model script for conversion:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
quantization_param([conv42, conv53, conv63], force_range_out=[0.0, 1.0], weak_force_range_out=enabled)
post_quantization_optimization(finetune, policy=enabled, dataset_size=16384)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

However, total distill loss very high again.Especially “_distill_loss_yolov8s/conv42”. Following one is the ouput of the conversion process.

2046/2048 [============================>.] - ETA: 0s - total_distill_loss: 12792.6229 - _distill_loss_yolov8s/conv41: 1.1548 - _distill_loss_yolov8s/conv42: 12775.4624 - _distill_loss_yolov8s/conv52: 0.6606 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.4443 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 2.5335 - _distill_loss_yolov8s/conv57: 2.3281 - _distill_loss_yolo
2047/2048 [============================>.] - ETA: 0s - total_distill_loss: 12786.3911 - _distill_loss_yolov8s/conv41: 1.1547 - _distill_loss_yolov8s/conv42: 12769.2309 - _distill_loss_yolov8s/conv52: 0.6606 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.4442 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 2.5335 - _distill_loss_yolov8s/conv57: 2.3280 - _distill_loss_yolo
2048/2048 [==============================] - ETA: 0s - total_distill_loss: 12780.1671 - _distill_loss_yolov8s/conv41: 1.1547 - _distill_loss_yolov8s/conv42: 12763.0073 - _distill_loss_yolov8s/conv52: 0.6606 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.4442 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 2.5334 - _distill_loss_yolov8s/conv57: 2.3278 - _distill_loss_yolo
2048/2048 [==============================] - 953s 465ms/step - total_distill_loss: 12773.9491 - _distill_loss_yolov8s/conv41: 1.1547 - _distill_loss_yolov8s/conv42: 12756.7898 - _distill_loss_yolov8s/conv52: 0.6606 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.4441 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 2.5334 - _distill_loss_yolov8s/conv57: 2.3276 - _distill_loss_yolov8s/conv35: 2.0392
[info] Quantization-Aware Fine-Tuning is done (completion time is 01:12:06.51)

Do you have any other suggestion regarding to this problem ?

I have a few options, first, we don’t use finetune on the yolov8s, would the results not be ok without it?

On the yolov8m, we use finetune, but with very low learning_rate:
post_quantization_optimization(finetune, policy=enabled, learning_rate=0.000025)

One last comment, on detection networks, if you want to specify the actual nodes for the finetune, better to use the ones before the last, or let the tool automatically select it.

This time, I have used the following settings.

1- Model script:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
post_quantization_optimization(finetune, policy=enabled, learning_rate=0.000025, dataset_size=16384)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

Output of conversion:

2045/2048 [============================>.] - ETA: 1s - total_distill_loss: 18.1488 - _distill_loss_yolov8s/conv41: 1.0866 - _distill_loss_yolov8s/conv42: 1.4610 - _distill_loss_yolov8s/conv52: 0.6317 - _distill_loss_yolov8s/conv53: 3.6032 - _distill_loss_yolov8s/conv62: 0.4541 - _distill_loss_yolov8s/conv63: 4.1087 - _distill_loss_yolov8s/conv57: 2.3764 - _distill_loss_yolov8s/conv35: 1.9292 - _distill_loss_yolov8s/con
2046/2048 [============================>.] - ETA: 0s - total_distill_loss: 18.1485 - _distill_loss_yolov8s/conv41: 1.0866 - _distill_loss_yolov8s/conv42: 1.4609 - _distill_loss_yolov8s/conv52: 0.6318 - _distill_loss_yolov8s/conv53: 3.6030 - _distill_loss_yolov8s/conv62: 0.4541 - _distill_loss_yolov8s/conv63: 4.1087 - _distill_loss_yolov8s/conv57: 2.3763 - _distill_loss_yolov8s/conv35: 1.9293 - _distill_loss_yolov8s/con
2047/2048 [============================>.] - ETA: 0s - total_distill_loss: 18.1479 - _distill_loss_yolov8s/conv41: 1.0866 - _distill_loss_yolov8s/conv42: 1.4608 - _distill_loss_yolov8s/conv52: 0.6318 - _distill_loss_yolov8s/conv53: 3.6025 - _distill_loss_yolov8s/conv62: 0.4541 - _distill_loss_yolov8s/conv63: 4.1086 - _distill_loss_yolov8s/conv57: 2.3762 - _distill_loss_yolov8s/conv35: 1.9293 - _distill_loss_yolov8s/con
2048/2048 [==============================] - ETA: 0s - total_distill_loss: 18.1469 - _distill_loss_yolov8s/conv41: 1.0867 - _distill_loss_yolov8s/conv42: 1.4605 - _distill_loss_yolov8s/conv52: 0.6318 - _distill_loss_yolov8s/conv53: 3.6020 - _distill_loss_yolov8s/conv62: 0.4541 - _distill_loss_yolov8s/conv63: 4.1086 - _distill_loss_yolov8s/conv57: 2.3761 - _distill_loss_yolov8s/conv35: 1.9293 - _distill_loss_yolov8s/con
2048/2048 [==============================] - 973s 475ms/step - total_distill_loss: 18.1459 - _distill_loss_yolov8s/conv41: 1.0867 - _distill_loss_yolov8s/conv42: 1.4602 - _distill_loss_yolov8s/conv52: 0.6318 - _distill_loss_yolov8s/conv53: 3.6015 - _distill_loss_yolov8s/conv62: 0.4541 - _distill_loss_yolov8s/conv63: 4.1085 - _distill_loss_yolov8s/conv57: 2.3760 - _distill_loss_yolov8s/conv35: 1.9292 - _distill_loss_yolov8s/conv46: 2.4979
[info] Quantization-Aware Fine-Tuning is done (completion time is 01:13:14.53)

2- Model script:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

Output of conversion:

124/128 [============================>.] - ETA: 1s - total_distill_loss: 24.4006 - _distill_loss_yolov8s/conv41: 1.5659 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9290 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7317 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3517 - _distill_loss_yolov8s/conv57: 3.0120 - _distill_loss_yolov8s/conv35: 2.8103
125/128 [============================>.] - ETA: 1s - total_distill_loss: 24.3983 - _distill_loss_yolov8s/conv41: 1.5656 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9287 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7310 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3515 - _distill_loss_yolov8s/conv57: 3.0111 - _distill_loss_yolov8s/conv35: 2.8104
126/128 [============================>.] - ETA: 0s - total_distill_loss: 24.3967 - _distill_loss_yolov8s/conv41: 1.5652 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9285 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7308 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3512 - _distill_loss_yolov8s/conv57: 3.0108 - _distill_loss_yolov8s/conv35: 2.8102
127/128 [============================>.] - ETA: 0s - total_distill_loss: 24.3951 - _distill_loss_yolov8s/conv41: 1.5650 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9284 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7305 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3512 - _distill_loss_yolov8s/conv57: 3.0102 - _distill_loss_yolov8s/conv35: 2.8098
128/128 [==============================] - ETA: 0s - total_distill_loss: 24.3917 - _distill_loss_yolov8s/conv41: 1.5646 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9284 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7303 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3508 - _distill_loss_yolov8s/conv57: 3.0087 - _distill_loss_yolov8s/conv35: 2.8089
128/128 [==============================] - 60s 470ms/step - total_distill_loss: 24.3884 - _distill_loss_yolov8s/conv41: 1.5643 - _distill_loss_yolov8s/conv42: 4.0000 - _distill_loss_yolov8s/conv52: 0.9283 - _distill_loss_yolov8s/conv53: 4.0000 - _distill_loss_yolov8s/conv62: 0.7301 - _distill_loss_yolov8s/conv63: 4.0000 - _distill_loss_yolov8s/conv46: 3.3505 - _distill_loss_yolov8s/conv57: 3.0072 - _distill_loss_yolov8s/conv35: 2.8081
[info] Quantization-Aware Fine-Tuning is done (completion time is 00:12:16.04)

I think I have solved the issue. I have used following model script:

normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv42, sigmoid)
change_output_activation(conv53, sigmoid)
change_output_activation(conv63, sigmoid)
model_optimization_flavor(compression_level=0, optimization_level=0)
nms_postprocess("../../postprocess_config/yolov8s_nms_config.json", meta_arch=yolov8, engine=cpu)

Do you have any technical explanation why the solution above works ? I came up with this solution based on my intuation.