Inference speed and batch size

Omer · October 23, 2024, 5:30am

Hi @roiyim,
Using batch size will speed up the model in case it’s complied in multiple contexts - a situation where the model is too large to fit the Hailo8 resources as a whole, so the model is “broken down” into several contexts so that only one contexts is loaded to the device each time, while the other wait on the host platform’s memory.

It might be that the higher optimization level also includes higher 4-bit compression and therefore have one of the models is compiled into single context\less contexts than the other one? this can explain why for one model using batch size is more significant.

Regards,