Performance difference between PyTorch (.pt) and compiled Hailo (.hef) models

Hello,

I noticed a clear difference in performance between my original PyTorch model (.pt) and the compiled Hailo Executable Format (.hef) version. While the .pt model works fine, the .hef model sometimes fails to detect objects or behaves differently in terms of inference.

For example, if you take a screenshot of the same image, the .hef model in production may detect nothing, whereas the .pt model detects all objects correctly.

I initially thought that the production model wasn’t receiving images of the same input size, but even after adjusting this, it didn’t improve. (It’s possible I did something incorrectly during this adjustment.)

One possible cause could be the optimization level used during compilation. I’m currently using level 0, and my local machine has no GPU, which might affect how certain operations are executed.

I’m trying to compile on Google Colab, which has NVIDIA GPUs, but the Hailo SDK doesn’t seem to detect the GPU, likely because it expects specific Hailo hardware (Hailo-8 or Hailo-15), not generic CUDA GPUs.

Additionally, I don’t fully understand the calibration dataset requirements: how different the images should be, how many images are needed, and whether I should only include images from my production cameras.

Thank you in advance.

Hi @ALEXFER

There could be a couple of reasons for the behavior you are seeing.

  1. In your inference script, you could be passing BGR image arrays instead of RGB.
  2. During compilation, you might have not used calibration images from your use case.

One way to find the root cause is to find the mAP of the hef file on your validation set and compare it with the mAP of the pytorch checkpoint. You can also try compiling using our cloud compiler that allows you to upload some calibration images: Early Access to DeGirum Cloud Compiler

Hi,

Thank you for your insights.

Just to clarify, my dataset is black and white, so there’s no issue with RGB vs BGR.

Regarding the calibration images: do they need to be very diverse? My videos come from surveillance cameras where the backgrounds barely change (only 3–4 different backgrounds), and it’s mainly certain elements in the scene that change. In this case, how many images would you recommend for calibration?

Also, my model was optimized at level 0 because I didn’t have access to a NVIDIA GPU. Could the difference in optimization level be affecting the performance of the compiled HEF?

Finally, I have already submitted a request to access the DeGirum Cloud Compiler.

Thanks again for your help

Hey @ALEXFER,

Yes, higher optimization levels should give you better FPS from the HEF file.

Since you’re working with fixed surveillance cameras, focus on lighting variety and object diversity rather than background changes. Use 100-300 well-selected frames from actual production footage, including different times of day, object occlusion levels, motion blur, and varying object distances/sizes. Avoid using too many similar frames.

About grayscale input: Most Hailo networks expect 3-channel RGB input. If you’re using grayscale, make sure you’re either duplicating the single channel three times [gray, gray, gray] or have custom parsing for single-channel input. The key is keeping your calibration images, original model, and inference pipeline all consistently using the same format (either grayscale or 3-channel).

For calibration, how should I determine if two images are too similar? For example, if an object moves slightly between frames but the background remains the same, would these images be considered redundant or can both be used? Also, does the Hailo optimization level affect only the FPS of the HEF file, or can it also influence the model’s accuracy? The logs mention that using optimization level 0 is not recommended for production and might reduce precision—could you clarify how it impacts performance?

Figuring out if your calibration images are too similar:

Think of it this way - if you’re showing the AI the same thing over and over, it’s not really learning anything new. When you have an object that barely moves against the same background, those images are basically twins from the AI’s perspective. The neural network sees nearly identical patterns and responds the same way to both.

The whole point of calibration is to show the system different scenarios so it can handle variety. If you feed it a bunch of nearly identical images, you’re not adding value - you’re just creating clutter in your dataset. It’s like studying for a test by reading the same page multiple times instead of covering different chapters.

So if your object only moved a tiny bit and everything else stayed the same - the lighting, background, size - just pick one frame and move on. You’ll get better results.

Hailo’s optimization levels are a speed vs. accuracy trade-off:

Here’s the thing people often miss - when you crank up the optimization for speed, you’re not just changing how fast it runs. You’re actually changing how well it works.

Level 0 is like putting your car in “sport mode” - sure, you’ll go faster, but you’re burning through your engine’s precision to get there. The compiler basically says “forget the fine-tuning, let’s just go fast!” This means your model might start making sloppier decisions.

As you move up to levels 1, 2, and 3, you’re telling the system “okay, I’ll take a speed hit if it means my results are more reliable.” Level 3 is the most careful - it takes its time to make sure the model stays sharp when it gets converted to run on the chip.

Why Level 0 is basically just for testing:

When Hailo throws up those warnings about Level 0, they’re not kidding around. Your model might look fast on paper, but it’s cutting corners on the stuff that actually matters for real-world performance. It’s skipping the careful calibration work that keeps your model making good decisions.

Think of Level 0 as a “proof of concept” setting - great for showing off raw speed numbers or troubleshooting, but not something you’d want to ship to customers who actually need accurate results.

2 Likes

Hi @ALEXFER
I had a very similar problem with our custom model, that is output from HEF file at optimization level 0 was pretty much random having no relationship to the pytorch output. As omria suggested we had to go to the maximum optimization level

model_optimization_flavor(optimization_level=4)

to get reasonable results. The optimization took about 24 hours to complete on a high end CPU so it is not ideal to use this to iterate over models.
During the compile step we found we also had to max out the optimization

performance_param(compiler_optimization_level=max)
allocator_param(timeout=100h)

One other thing I found useful are the suggestions in the dataflow compiler manual section 5.3 Model Optimization Basically the idea is to run the model in the python simulator after parsing, optimization and finally compilation using

InferenceContext.SDK_NATIVE

InferenceContext.SDK_FP_OPTIMIZED

InferenceContext.SDK_QUANTIZED

The first 2 should give you pretty much identical results to your pytorch / onnx versions

hope this helps