Hi @Park_Hanyu,
Welcome to the Hailo Community!
There are two separate networks, one for text encoding (used for encoding the prompts) and another for image encoding. The inference results from both are then used to get the probabilities for the prompts.