Software suite giving No available GPU from within docker

If I run these from within the docker container

lspci | grep "NVIDIA"

0001:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

nvidia-smi
bash: nvidia-smi: command not found

And when i run optimize I get No available GPU

This is what I get from outside the docker container:

lspci | grep "NVIDIA"
0001:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

nvidia-smi
Fri Jan 17 23:01:36 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           On  | 00000001:00:00.0 Off |                  Off |
| N/A   24C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

This is running on an azure server.

Hi @rosslote,

Make sure you are using an Ubuntu 20/22 host machine when running the docker.
The Docker already contains the right CUDA and CUDNN for the specific version of the suite you are using. However, you still have take car of two things:

  • Install Nvidia driver 525 outside the docker, to have better compatibility with CUDA 11.8 (used inside the docker).
  • Make sure that to install nvidia-docker2, as explained in the Hailo AI SW Suite User Guide

My host is ubuntu 20.04 but the driver is 535. I have no control over that it seems as this is an azure ML instance. I’ve tried uninstalling it and reinstalling the correct driver but it just gives me problems.

I have installed nvidia-docker2 but that doesn’t seem to change anything.

@rosslote did you install the nvidia-docker2 before or after the Hailo Docker?
Please do the following:

  • Stick with Nvidia driver 535, if you cannot modify it. Make sure the GPU is working correctly outside the docker.
  • Make sure you followed these instructions ton install nvidia-docker2
  • Delete the docker container and create a new one
  • Look at the commands in the hailo_ai_sw_suite_docker_run.sh script which check for the GPU avaiability, as explained here: GPU not detected in Hailo Suite - #5 by klausk
    The scripts uses two internal variables NVIDIA_GPU_EXIST and NVIDIA_DOCKER_EXIST to decide whether to enable GPU support the first time the docker container is launched.
    From within the new docker, you should be able to get an output when running the following commands.
    NVIDIA_GPU_EXIST corresponds to the output of:
    lspci | grep "VGA compatible controller: NVIDIA"
    
    NVIDIA_DOCKER_EXIST corresponds to the output of:
    dpkg -l | grep 'nvidia-docker\|nvidia-container-toolkit'
    

I gave up in the end and used the DFC directly. I think it may be something to do with the GPU we were trying to use.

Hi @rosslote,
Glad that GPU works when using the DFC directly.

FYI, the reason why it did not work with the Docker is probably related to the way your GPU is enlisted with lspci:

lspci | grep "NVIDIA"

0001:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

Indeed, in order to detect the GPU correctly, the run docker script performs the following checks the first time it is launched:

lspci | grep "VGA compatible controller: NVIDIA" dpkg -l | grep 'nvidia-docker\|nvidia-container-toolkit'

This means that:

  • nvidia-docker2 must be installed before creating the container
  • the GPU device must appear as a “VGA compatible controller”. Your device - listed as a “3D controller” - was filtered out by the grep "VGA compatible controller: NVIDIA" command.

This is under investigation and will be fixed in the next releases.

If you want to use the GPU in the docker, you can we modify the grep command in the docker script to look for both “VGA compatible controller: NVIDIA” and “3D controller: NVIDIA”. Then you can create a new container.

Thanks, I’ll try that next time. Although, I’m quite sure I tried modifying the grep to pick it up and it caused it to break in other ways. I don’t have the results available to me right now though, sorry.