vNUMA topology

vNUMA Topology

Accelerator workloads are sensitive to how CPUs, system memory, and GPUs are physically connected. To optimize these layouts, virtual non-uniform memory access (vNUMA) maps this physical architecture into the virtual machine. On a dual-socket server, half the GPUs and half the memory are local to one CPU socket, while the other half are local to the second socket. The operating system, job schedulers, and GPU communication libraries need to see those groupings in order to place work and memory on the same side of the interconnect.

What the guest used to see (Legacy Topology)

Before the 8-GPU Beta, Akamai Cloud Computing instances presented a flat, uniform abstraction:

  • One NUMA node — all 128 vCPUs and the full pool of system memory appeared as a single undifferentiated domain.
  • No GPU-to-socket affinity — the guest could not tell which GPU was attached to which CPU socket or which half of system memory.
  • No GPU-to-memory grouping — every GPU appeared equidistant from every memory bank.

From the guest's perspective, the topology looked like this:

Legacy Topology — Single Synthetic NUMA Node
┌──────────────────────────────────────────────────────────────────────────┐
│  NUMA Node 0                                                             │
│  ├── vCPUs:        0-127  (128 cores)                                    │
│  ├── Memory:       ~1.4 TB                                               │
│  └── GPUs:         [0, 1, 2, 3, 4, 5, 6, 7]                              │
│                     (no association to vCPUs or memory banks)            │
└──────────────────────────────────────────────────────────────────────────┘

For a single-GPU workload, this abstraction is harmless. For an 8-GPU cluster running NCCL, PyTorch Distributed, or any peer-to-peer (P2P) workload, it hides the physical reality that each GPU has a preferred set of vCPUs and a preferred memory bank. The Linux scheduler, CUDA, and collective-communication libraries all optimize for locality — but they can only optimize for what they can see.

What the guest sees now (8-GPU Beta with vNUMA Topology)

With the 8-GPU NVIDIA RTX PRO™ 6000 Blackwell Server Edition Beta plans, the hypervisor exposes the physical two-socket layout as two distinct NUMA nodes inside the guest. Each node contains exactly half the vCPUs, half the memory, and half the GPUs:

vNUMA Topology — Two Distinct NUMA Nodes
┌─────────────────────────────────────┐    ┌─────────────────────────────────────┐
│  NUMA Node 0                        │    │  NUMA Node 1                        │
│  ├── vCPUs:  0-63  (64 cores)       │    │  ├── vCPUs:  64-127  (64 cores)     │
│  ├── Memory: ~704 GB                │    │  ├── Memory: ~704 GB                │
│  └── GPUs:   [0, 1, 2, 3]           │    │  └── GPUs:   [4, 5, 6, 7]           │
│                                     │    │                                     │
│  Local to CPU Socket 0              │    │  Local to CPU Socket 1              │
└─────────────────────────────────────┘    └─────────────────────────────────────┘

This is verified inside the guest with standard Linux tooling:

root@localhost:~# numactl --hardware | awk '/cpus:/ {print $1, $2, $3, NF-3, "cores"; next} 1'

Legacy 8-GPU Plans:

available: 1 nodes (0)
node 0 cpus: 128 cores
node 0 size: 1418720 MB
node 0 free: 1416997 MB
node distances:
node   0
  0:  10

vNUMA 8-GPU Plans:

available: 2 nodes (0-1)
node 0 cpus: 64 cores
node 0 size: 709544 MB
node 0 free: 353480 MB
node 1 cpus: 64 cores
node 1 size: 709559 MB
node 1 free: 269339 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10
  • node 0 and node 1 map 1:1 to the two physical CPU sockets.
  • Each node exposes 64 vCPU cores and roughly half of the total system memory.
  • The distance matrix (10 for local, 32 for remote) tells the OS scheduler that cross-socket memory access is roughly 3.2× slower than local access.

Why the grouping matters

Instead of one undifferentiated pool of 128 vCPUs and eight GPUs, the guest now has two well-defined domains. This lets the software stack make locality-aware decisions at every layer:

CapabilityLegacy TopologyvNUMA Topology
Guest NUMA visibilitySingle synthetic nodeMirrors physical host nodes
vCPU / memory / GPU groupingNo associationEach GPU tied to a specific 64-core + ~700 GB domain
OS scheduler behaviorBlind to memory localitynumactl / numastat bind threads to local resources
Framework ring topology (NCCL, PyTorch)Assumes all peers equidistantBuilds bandwidth-aware communication rings
P2P memory copiesMay silently route across slower inter-socket busStays inside the local PCIe domain when possible
Data pipelinesDataloader workers on Node 1, tensors on Node 0Aligned on the same NUMA node

Binding a workload to one NUMA node

You can pin an entire training run to the local resources of a single NUMA node so that data never crosses the slower inter-socket link:

numactl --cpunodebind=0 --membind=0 python3 train_model.py

This ensures:

  • The Python process and its dataloader workers run on vCPUs 0–63.
  • All OS allocations come from Node 0's local memory pool.
  • The targeted GPUs (0–3, if you also set CUDA_VISIBLE_DEVICES=0,1,2,3) remain on the same socket.

This works out of the box with unmodified Ubuntu, Debian, Rocky Linux, and other supported images. No guest-OS configuration is required; the vNUMA and PCIe topology changes are applied automatically at boot.