Argmax 16bit output

Hi!
I am using Hailo 8 and I am having a problem forcing 16 bit output on argmax layer.
I have a dense layer which produces a tensor with 5000 values and argmax finds maximum in that dimension.
The problem is that the compilation works but during inference it crashes saying that 8bit output is not enough.
I have tried to force 16 bit output on argmax layer (the name I read using Hailo profiler) but nothing change. I used model script with quantization Param.
It finds the node and compiles fine but the connection from argmax to output node remain 8 bit although argmax layer itself seems to became 16 bit from 8.
Is it possible to use it 16 bit? How?
Thank you

Hi @Luca_Gessi
Welcome to the Hailo community.
Is the argmax layer towards the end of the model or is it a layer somewhere in the middle? If it is toward the end, you can try specifying the preceding node as input and run argmax on cpu as it is not a very expensive op.

Hi Shashi.
Thank you for your response.
Yes, it’s true that it could be done on CPU but the CPU we are talking about is an arm one which I would like to leave as much as I can free to do other task. We have to achieve very high FPS so I want to do the most I can on Hailo 8.
The argmax is the last layer of the network.
Is this a limitation of Hailo 8? It is really strange to have just 256 output value.
Do you want other information?
Can you confirm that argmax can work with 16 bit output data?
Thank you

Hi @Luca_Gessi
I think someone from Hailo team can answer if argmax can work with 16 bit output data. I was just suggesting this as an alternative, but it looks like in your case this is not a viable alternative.

Hi @Luca_Gessi,

Could you clarify what you mean by the inference crashing? It would be helpful to have the error messages too.

Hi nina-vilela.
I share below the output when I tryed to run the model:

[HailoRT] [error] CHECK failed - Output format type UINT8 can’t represent possible range 5000 for Argmax op
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6)
[HailoRT] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6)
[HailoRT CLI] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6)
[HailoRT CLI] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6)
[HailoRT CLI] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6) - Error failed running inference
[HailoRT CLI] [error] CHECK_SUCCESS failed with status=HAILO_INVALID_OPERATION(6) - Error while running inference

After this I tryed to force 16bit argmax output but it did not work. It seems that also using model script the output of argmax remain 8 bit.
I am using HailoRT Version: 4.21.0 on target board.
Thank you

@nina-vilela , do you have any ideas?

I sent you a pm, please check your inbox :slight_smile:

1 Like

@Luca_Gessi,
I want to stress again the point that was formerly raised by @shashi. We’ve just ran a naive test to measure the time it takes to execute argmax on Raspberry pi5. It took 0.48sec to run 10k itenrations on a 5000 tensor. Effect on the htop was barely noticeable.

This is the code I’ve used:

#!/usr/bin/env python3

import numpy as np
import time
s = time.time()
for i in range(10000):
    vector = np.random.rand(5000)
    argmax_index = np.argmax(vector)

e = time.time()
print(f"Time took: {e-s:.2f} sec")

image

If this is what it takes, is this still meaningful?

Hi Nadav.
Thank you for your response.
Based on what you say → 0.48sec/10K = 48 us, around 20K FPS. For our application is not enough, here the idea to use the accelerator.
I would like to use as much as possible Hailo because we have 2 algorithm to compute.
The first one is similar to resnet50, and it has to achieve around 2500 fps with 128x128pixel images.
The second one has to run around 256 times for each image, so 2500 * 128= 320000 FPS.
I think that the solution of using CPU is not feasable. Maybe i am wrong.
I have done some tests for the second network. With batch size 63, without argmax, it takes 0.02ms to computes. Argmax done in HW should be really fast.
Thank you

OK,
So the application that you’re looking to implement consumes a stream of images/patches of 128x128 @2500FPS, and then runs 256 argmax (output tensor is 16x16x5000 ?).
This is quite extreme usecase, not similar to anything that we’ve tried before. I am not sure if this is physible on a single Hailo-8 device.

128, not 256.
We have to try, maybe it works :slight_smile:
I think that the first results are promising. For resnet50 is an example coming from first tests, we could try to simplify that if we need.
Thank you

Ok, I understand. So we need the find a way to reduce the length of the Argmax. As you can see inthe user guide, the current implementation of Argmax supports up to 64 featurs.

Maybe theres’ a way to reduce that number after the ‘resnet50’ phase?

The main problem is that the compilation of argmax cannot convert its output to 16bit and so hailo crashes runtime because it cannot manage 5000 values with 8bit output. Read above. As soon as I can I will share with @nina-vilela the files.
For now thank you for your assistance

I understand now what you are sayng. You think that a part of the problem I wrote, argmax can manage only 64 values (features) and not the whole 5000?

Yes, this is the main caveat, our current implementation can handle up to 64 values and not the whole 5000.

If you are sure about that, and nothing can be done, I think that for now we cannot use Hailo 8. I hope that next sw (if it is a sw limitation) releases will unlock this.

Just to understand and trying to overpass this limitation.
If we design a “macro” argmax with 64 argmax in parallel and their output to an other argmax it would be feasible?
64*64=4096 .
Thank you