Hi, I am trying to figure out how to run multiple models on a single hailo chip without context switching.
I have discovered that I could either 1. compile two network groups, each with one network, into a single hef file, or 2. compile the two networks into a single network group which happens to fit into one context and also comes out as a single hef file.
I have some questions regarding each approach, and the deployment of these hef files.
if an hef file has two network groups, and each network group is single-context, does it mean it requires context switching for inference regardless of the sizes of networks in the hef file?
does the process of activating and deactivating of each network group involve context switching?
if an hef file has two networks compiled into a single network group (which also happens to fit into a single context), to my understanding, there should be no need of constantly activate and deactivate network groups since there is only one. However, if the two networks are supposed to run at different rates, e.g. network A has to run 3 times while network B runs once, how should I configure the network vstreams? Can the vstreams run simultaneously at different rates without interfering with each other?
Compiling two networks into a single network group is not intended for simultaneous execution. Its primary purpose is for post or preprocessing of the network, followed by concatenation with the backbone of the original network. Running two networks concurrently may result in synchronization issues. I recommend compiling your networks into two separate HEF files and switching between them. Then you can use the ‘Model Scheduler’ handles switching, activation, and deactivation of HEFs automatically and efficiently. Moreover, it allows you to manage model priorities at a high level. For detailed information, please refer to the HailoRT documentation in the developer zone on our site. Additionally, you can find an example demonstrating how to use and configure the Model Scheduler in the Hailo examples; kindly check the example named ‘switch_network_groups_example.cpp.’
Thank you for your response. I have some additional questions following your suggestions.
If two HEF files are being switched over on a single hailo chip, does it not mean that each model’s weights will be copied over to the chip’s memory every time that model has to run? I think this would be similar to context switching?
I think it would be easier to communicate if I share my situation.
I have two independent models
I only have one hailo chip to run both models
the two model’s weights are small enough to both fit onto one context
one model could run much less frequently than the other model, i.e. model A should run most of the time, and model B should run once in a while. (Can the model scheduler handle such situation?)
Thank you for your follow-up questions. I appreciate the opportunity to clarify.
You’re correct in noting that switching between HEF files involves transferring the model’s weights to the chip’s memory each time the model runs, akin to a form of context switching. However, the Model Scheduler provides effective control over this process. For instance, you can configure it to run Model A most of the time and only switch to Model B when there are, for example, 10 frames waiting for processing in the queue. This approach minimizes the overhead of switching and ensures a controlled interval between switches.
It’s important to consider that the mapping of a network to the Hailo chip is more efficient and yields higher frames per second (fps) when there is ample free space on the chip. Loading two networks simultaneously might result in a less optimal mapping on the chip.
Additionally, when running two independent networks on the chip concurrently, there’s a risk of one network being inactive for a prolonged interval while the other is operational, potentially leading to errors.
When considering whether to use two different HEF files or a single HEF file, the decision hinges on the specifics of your models and requirements. Factors such as the number of layers, weight size, and PCI congestion play a role in determining the optimal configuration. In 99.9 percent of cases, the scheduler is adept at efficiently managing such scenarios.
It’s worth noting that context switching through the scheduler tends to be more efficient than manual switching, as it can initiate the process before the last image exits the model, enabling parallel inference and switching.