Adding a Nvidia-GPU to your Cluster

Prerequisites

Having your GPU isolated
Passed the GPU to your Talos Machine

Extensions

Its important to add the following Extensions to your talconfig.yaml for bootstrap:

schematic:
    customization:
        systemExtensions:
            officialExtensions:
                - siderolabs/nonfree-kmod-nvidia-lts
                - siderolabs/nvidia-container-toolkit-lts

GPU Talos Patch

Additionally, you will need to create the following patch file gpu.yaml in the patches folder of clustertool:

- op: add
  path: /machine/kernel
  value:
    modules:
     - "name": "nvidia"
     - "name": "nvidia_uvm"
     - "name": "nvidia_drm"
     - "name": "nvidia_modeset"
- op: add
  path: /machine/sysctls
  value:
    "net.core.bpf_jit_harden": 1

This patch file then needs to be added to the talconfig.yaml:

patches:
    - "@./patches/gpu.yaml"

Adding it to your cluster

If its a fresh bootstrap you can simply follow the clustertool guide on how to bootstrap your cluster. If it is a existing cluster you will need to run clustertool talos upgrade to add the extensions and clustertool talos apply to add the patch.

Testing

Run the following commands and see if the shown outputs are included in your command-output:

Check the Modules

talosctl read /proc/modules

Output (the numbers and hex values may be different):

nvidia_uvm 1884160 - - Live 0xffffffffc3f79000 (PO)
nvidia_drm 94208 - - Live 0xffffffffc42f6000 (PO)
nvidia_modeset 1531904 - - Live 0xffffffffc4159000 (PO)
nvidia 62754816 - - Live 0xffffffffc039e000 (PO)

Check the Extensions

talosctl get extensions

Output (the numbers and hex values may be different):

192.168.178.9   runtime     ExtensionStatus   3             1         nonfree-kmod-nvidia-lts        535.183.06-v1.8.4
192.168.178.9   runtime     ExtensionStatus   4             1         nvidia-container-toolkit-lts   535.183.06-v1.16.1

Read Driver Version

talosctl read /proc/driver/nvidia/version

Output (the numbers and hex values may be different):

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
GCC version:  gcc version 13.3.0 (GCC)

Testing the GPU

kubectl run \
  nvidia-test \
  --restart=Never \
  -ti --rm \
  --image nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
  nvidia-smi

Output (The Warning about PodSecurity can be ignored):

Mon Dec 16 12:54:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    On  | 00000000:00:0A.0 Off |                  N/A |
| 24%   26C    P8               5W / 125W |      1MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
pod "nvidia-test" deleted

Nvidia-Device-Plugin

If all of the previous tests where successfull. Your GPU is ready to be used with the Nvidia-Device-Plugin. An example helm-release.yaml can be seen below:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
    name: nvidia-device-plugin
    namespace: nvidia-device-plugin
spec:
    interval: 5m
    chart:
        spec:
            # renovate: registryUrl=https://charts.truechartsoci.org
            chart: nvidia-device-plugin
            version:  0.17.0
            sourceRef:
                kind: HelmRepository
                name: nvdp
                namespace: flux-system
            interval: 5m
    install:
        createNamespace: true
        crds: CreateReplace
        remediation:
            retries: 3
    upgrade:
        crds: CreateReplace
        remediation:
            retries: 3
    values:
      image:
        repository: nvcr.io/nvidia/k8s-device-plugin
        tag: v0.17.0
      runtimeClassName: nvidia
      config:
        map:
          default: |-
            version: v1
            flags:
              migStrategy: none
            sharing:
              timeSlicing:
                renameByDefault: false
                failRequestsGreaterThanOne: false
                resources:
                  - name: nvidia.com/gpu
                    replicas: 5
      gfd:
        enabled: true

Don’t forget to add the required repository nvdp.yaml into the repositories/helm folder and adding it to the required kustomization.yaml:

---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: nvdp
  namespace: flux-system
spec:
  interval: 1h
  url: https://nvidia.github.io/k8s-device-plugin
  timeout: 3m

Example of GPU Assignment

The following shows an example on how to add the GPU to a chart. Depending on the chart you may need to adapt the workload-name.

resources:
    limits:
        nvidia.com/gpu: 1
workload:
    main:
        podSpec:
        runtimeClassName: "nvidia"

If you followed this guide the GPU can be assigned up to 5 different charts. The number 1 will always be the same and wont be increased for a second chart with gpu usage.

Troubleshooting

If all the Extensions, Modules and the Driver Version is there but the GPU-Testing shows something similar to:

failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

Then the patch wasnt addded properly. This can be fixed by manually adding the patch with the following command:

talosctl patch mc --patch @gpu.yaml

This should fix the error and should display the desired output