Its important to add the following Extensions to your talconfig.yaml for bootstrap:
GPU Talos Patch
Additionally, you will need to create the following patch file gpu.yaml in the patches folder of clustertool:
This patch file then needs to be added to the talconfig.yaml:
Adding it to your cluster
If its a fresh bootstrap you can simply follow the clustertool guide on how to bootstrap your cluster.
If it is a existing cluster you will need to run clustertool talos upgrade to add the extensions and clustertool talos apply to add the patch.
Testing
Run the following commands and see if the shown outputs are included in your command-output:
Check the Modules
Output (the numbers and hex values may be different):
Check the Extensions
Output (the numbers and hex values may be different):
Read Driver Version
Output (the numbers and hex values may be different):
Testing the GPU
Output (The Warning about PodSecurity can be ignored):
Nvidia-Device-Plugin
If all of the previous tests where successfull. Your GPU is ready to be used with the Nvidia-Device-Plugin.
An example helm-release.yaml can be seen below:
Don’t forget to add the required repository nvdp.yaml into the repositories/helm folder and adding it to the required kustomization.yaml:
Example of GPU Assignment
The following shows an example on how to add the GPU to a chart. Depending on the chart you may need to adapt the workload-name.
If you followed this guide the GPU can be assigned up to 5 different charts.
The number 1 will always be the same and wont be increased for a second chart with gpu usage.
Troubleshooting
If all the Extensions, Modules and the Driver Version is there but the GPU-Testing shows something similar to:
Then the patch wasnt addded properly. This can be fixed by manually adding the patch with the following command:
This should fix the error and should display the desired output