Using GPUs on LKE
Akamai's GPU instances are available for deployment on LKE, enabling you to run your GPU-accelerated workloads on Akamai's managed Kubernetes service. These instances utilize NVIDIA GPUs, including NVIDIA RTX 4000 Ada and NVIDIA Quadro RTX 6000. This document outlines several options for installing the NVIDIA software components needed to configure GPU-enabled workloads.
Install NVIDIA software
There are two primary ways to install the software components needed to use NVIDIA GPUs within Kubernetes:
- NVIDIA Kubernetes device plugin: A Daemonset that manages GPUs as consumable resources and enables you to schedule workloads.
- NVIDIA GPU operator: A Kubernetes operator that automates the configuration and management of NVIDIA GPUs on Kubernetes clusters.
NVIDIA Kubernetes device plugin
This Daemonset is NVIDIA's implementation of the device plugin framework on Kubernetes and can advertise GPUs as consumable resources. The following is an example command that installs v0.17.1 of this plugin. For the latest installation instructions and versions, see the NVIDIA/k8s-device-plugin GitHub repository. You must have kubectl installed and configured to use an LKE cluster with GPU worker nodes.
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
NVIDIA GPU operator
The NVIDIA GPU operator automatically configures all of the software required to use GPUs on your cluster and worker nodes. While this generally includes NVIDIA drivers and toolkit components, they both are disabled in our instructions as driver installation is automatic on LKE GPU worker nodes. To learn more about this operator and for the most recent instructions, see NVIDIA's Installing the NVIDIA GPU Operator guide.
Before continuing, both kubectl and Helm should be installed on your local machine. The kubectl context should be set to an LKE cluster using GPU worker nodes.
-
Add the NVIDIA Helm repository to your local machine.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
-
Install the GPU operator on your cluster. Since the drivers are automatically installed, they are not enabled on the operator.
helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.enabled=false \ --set toolkit.enabled=false
NVIDIA driver installation
When deploying GPU instances on LKE, the NVIDIA drivers are automatically installed. For transparency, the installation script is included below. You do not need to take any action.
if lspci | grep -qi nvidia; then
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
add-apt-repository -y contrib non-free non-free-firmware
apt update
apt install -y nvidia-driver linux-headers-cloud-amd64 nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=containerd --set-as-default
reboot
fi
Configure workloads to use GPUs
Once NIVIDIA's software has been installed, you can configure pods to consume these GPU resources. This is done by using the nvidia.com/gpu: n
key-value pair within the resource limits of your workload's manifest file, where n is the number of GPUs that should be consumed. Here's an example:
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: app
image: example-image
resources:
limits:
memory: 24Gi
cpu: 6
nvidia.com/gpu: 1
For a more in-depth example of running GPU-accelerated workloads on LKE, see the Deploy a Chatbot and RAG Pipeline for AI Inferencing on LKE guide.
Updated about 10 hours ago