Using GPUs on LKE

Suggest Edits

Akamai's GPU instances are available for deployment on LKE, enabling you to run your GPU-accelerated workloads on Akamai's managed Kubernetes service. These instances utilize NVIDIA RTX 4000 Ada GPUs. NVIDIA Quadro RTX 6000 cannot be deployed within LKE clusters at this time due to limited availability. This document outlines several options for installing the NVIDIA software components needed to configure GPU-enabled workloads.

Install NVIDIA software

There are two primary ways to install the software components needed to use NVIDIA GPUs within Kubernetes:

NVIDIA Kubernetes device plugin: A Daemonset that manages GPUs as consumable resources and enables you to schedule workloads.
NVIDIA GPU operator: A Kubernetes operator that automates the configuration and management of NVIDIA GPUs on Kubernetes clusters.

NVIDIA Kubernetes device plugin

This Daemonset is NVIDIA's implementation of the device plugin framework on Kubernetes and can advertise GPUs as consumable resources. The following is an example command that installs v0.17.1 of this plugin. For the latest installation instructions and versions, see the NVIDIA/k8s-device-plugin GitHub repository. You must have kubectl installed and configured to use an LKE cluster with GPU worker nodes.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

NVIDIA GPU operator

The NVIDIA GPU operator automatically configures all of the software required to use GPUs on your cluster and worker nodes. While this generally includes NVIDIA drivers and toolkit components, they both are disabled in our instructions as driver installation is automatic on LKE GPU worker nodes. To learn more about this operator and for the most recent instructions, see NVIDIA's Installing the NVIDIA GPU Operator guide.

Before continuing, both kubectl and Helm should be installed on your local machine. The kubectl context should be set to an LKE cluster using GPU worker nodes.

Add the NVIDIA Helm repository to your local machine.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
  && helm repo update

Install the GPU operator on your cluster. Since the drivers are automatically installed, they are not enabled on the operator.

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false

NVIDIA driver installation

When deploying GPU instances on LKE, the NVIDIA drivers are automatically installed. For transparency, the installation script is included below. You do not need to take any action.

if lspci | grep -qi nvidia; then
  curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  add-apt-repository -y contrib non-free non-free-firmware
  apt update
  apt install -y nvidia-driver linux-headers-cloud-amd64 nvidia-container-toolkit
  nvidia-ctk runtime configure --runtime=containerd --set-as-default
  reboot
fi

Configure workloads to use GPUs

Once NIVIDIA's software has been installed, you can configure pods to consume these GPU resources. This is done by using the nvidia.com/gpu: n key-value pair within the resource limits of your workload's manifest file, where n is the number of GPUs that should be consumed. Here's an example:

---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: app
    image: example-image
    resources:
      limits:
        memory: 24Gi
        cpu: 6
	nvidia.com/gpu: 1

For a more in-depth example of running GPU-accelerated workloads on LKE, see the Deploy a Chatbot and RAG Pipeline for AI Inferencing on LKE guide.

Updated about 1 month ago