vNUMA topology

vNUMA Topology

Accelerator workloads are sensitive to how CPUs, system memory, and GPUs are physically connected. To optimize these layouts, virtual non-uniform memory access (vNUMA) maps this physical architecture into the virtual machine. On a dual-socket server, half the GPUs and half the memory are local to one CPU socket, while the other half are local to the second socket. The operating system, job schedulers, and GPU communication libraries need to see those groupings in order to place work and memory on the same side of the interconnect.

What the guest used to see (Legacy Topology)

Before the 8-GPU Beta, Akamai Cloud Computing instances presented a flat, uniform abstraction:

One NUMA node — all 128 vCPUs and the full pool of system memory appeared as a single undifferentiated domain.
No GPU-to-socket affinity — the guest could not tell which GPU was attached to which CPU socket or which half of system memory.
No GPU-to-memory grouping — every GPU appeared equidistant from every memory bank.

From the guest's perspective, the topology looked like this:

Legacy Topology — Single Synthetic NUMA Node
┌──────────────────────────────────────────────────────────────────────────┐
│  NUMA Node 0                                                             │
│  ├── vCPUs:        0-127  (128 cores)                                    │
│  ├── Memory:       ~1.4 TB                                               │
│  └── GPUs:         [0, 1, 2, 3, 4, 5, 6, 7]                              │
│                     (no association to vCPUs or memory banks)            │
└──────────────────────────────────────────────────────────────────────────┘

For a single-GPU workload, this abstraction is harmless. For an 8-GPU cluster running NCCL, PyTorch Distributed, or any peer-to-peer (P2P) workload, it hides the physical reality that each GPU has a preferred set of vCPUs and a preferred memory bank. The Linux scheduler, CUDA, and collective-communication libraries all optimize for locality — but they can only optimize for what they can see.

What the guest sees now (8-GPU Beta with vNUMA Topology)

With the 8-GPU NVIDIA RTX PRO™ 6000 Blackwell Server Edition Beta plans, the hypervisor exposes the physical two-socket layout as two distinct NUMA nodes inside the guest. Each node contains exactly half the vCPUs, half the memory, and half the GPUs:

vNUMA Topology — Two Distinct NUMA Nodes
┌─────────────────────────────────────┐    ┌─────────────────────────────────────┐
│  NUMA Node 0                        │    │  NUMA Node 1                        │
│  ├── vCPUs:  0-63  (64 cores)       │    │  ├── vCPUs:  64-127  (64 cores)     │
│  ├── Memory: ~704 GB                │    │  ├── Memory: ~704 GB                │
│  └── GPUs:   [0, 1, 2, 3]           │    │  └── GPUs:   [4, 5, 6, 7]           │
│                                     │    │                                     │
│  Local to CPU Socket 0              │    │  Local to CPU Socket 1              │
└─────────────────────────────────────┘    └─────────────────────────────────────┘

This is verified inside the guest with standard Linux tooling:

root@localhost:~# numactl --hardware | awk '/cpus:/ {print $1, $2, $3, NF-3, "cores"; next} 1'

Legacy 8-GPU Plans:

available: 1 nodes (0)
node 0 cpus: 128 cores
node 0 size: 1418720 MB
node 0 free: 1416997 MB
node distances:
node   0
  0:  10

vNUMA 8-GPU Plans:

available: 2 nodes (0-1)
node 0 cpus: 64 cores
node 0 size: 709544 MB
node 0 free: 353480 MB
node 1 cpus: 64 cores
node 1 size: 709559 MB
node 1 free: 269339 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

node 0 and node 1 map 1:1 to the two physical CPU sockets.
Each node exposes 64 vCPU cores and roughly half of the total system memory.
The distance matrix (10 for local, 32 for remote) tells the OS scheduler that cross-socket memory access is roughly 3.2× slower than local access.

Why the grouping matters

Instead of one undifferentiated pool of 128 vCPUs and eight GPUs, the guest now has two well-defined domains. This lets the software stack make locality-aware decisions at every layer:

Capability	Legacy Topology	vNUMA Topology
Guest NUMA visibility	Single synthetic node	Mirrors physical host nodes
vCPU / memory / GPU grouping	No association	Each GPU tied to a specific 64-core + ~700 GB domain
OS scheduler behavior	Blind to memory locality	`numactl` / `numastat` bind threads to local resources
Framework ring topology (NCCL, PyTorch)	Assumes all peers equidistant	Builds bandwidth-aware communication rings
P2P memory copies	May silently route across slower inter-socket bus	Stays inside the local PCIe domain when possible
Data pipelines	Dataloader workers on Node 1, tensors on Node 0	Aligned on the same NUMA node

Binding a workload to one NUMA node

You can pin an entire training run to the local resources of a single NUMA node so that data never crosses the slower inter-socket link:

numactl --cpunodebind=0 --membind=0 python3 train_model.py

This ensures:

The Python process and its dataloader workers run on vCPUs 0–63.
All OS allocations come from Node 0's local memory pool.
The targeted GPUs (0–3, if you also set CUDA_VISIBLE_DEVICES=0,1,2,3) remain on the same socket.

This works out of the box with unmodified Ubuntu, Debian, Rocky Linux, and other supported images. No guest-OS configuration is required; the vNUMA and PCIe topology changes are applied automatically at boot.

Updated 20 days ago

Did this page help you?