I’m working with a setup that includes two or three Hailo-8 M.2 AI Acceleration modules, and I have a couple of questions regarding their capabilities:
Is it possible to run different neural network models on separate Hailo accelerators? For example, having one model run on one Hailo-8 module and another model on a different module simultaneously?
Can I distribute the workload of a single model across multiple Hailo accelerators? Specifically, is there a way to offload part of the neural network (e.g., some layers or weights) to one accelerator and the rest to another to achieve sequential acceleration?
I would appreciate any insights or examples on how to configure and optimize the use of multiple Hailo-8 modules for these purposes.
Running Multiple Models on Hailo-8 M.2 AI Acceleration Modules
Hailo-8 M.2 AI acceleration modules support running different neural network models simultaneously and distributing workload across multiple accelerators. Here’s how:
1. Running Different Models on Separate Hailo-8 Modules
Hailo’s architecture allows execution of different neural network models on separate accelerators concurrently.
Key Components:
HailoRT framework
Multi-Process Service
Model Scheduler
Implementation:
Assign each accelerator to a separate task
Use Multi-Process Service for resource allocation
Utilize Model Scheduler to manage different tasks and models
2. Distributing a Single Model Across Multiple Accelerators
HailoRT supports model parallelism, allowing distribution of a single neural network across multiple accelerators.
Key Features:
Sequential acceleration
Model Scheduler
Cascaded Networks Structure (TAPPAS framework)
Implementation:
Use Model Scheduler to define which portions of the model run on which device
Implement Cascaded Networks Structure for flexible workload distribution
3. Sequential Acceleration
This approach involves sequentially offloading network layers or portions of the model across multiple devices.
Key Components:
Stream Multiplexer
Model Scheduler Optimizations
Benefits:
Higher throughput
Reduced latency
Configuration
To achieve these setups:
Use appropriate APIs provided by HailoRT
Leverage HailoRT features for optimal performance
Configure the system to distribute tasks or model layers as needed
This approach allows for efficient utilization of multiple Hailo accelerators, whether running different models simultaneously or distributing a single model’s workload.