GPU Server CPU PCIe Lane Allocation and PCIe Switch Role

GPU Server CPU PCIe Lane Allocation and PCIe Switch Role

NVIDIA HGX B300 8-GPU System Block Diagram

Contents:
  • 1. Interconnect Between Dual CPUs
    • 1.1. Intel Platform (Xeon Scalable Series)
    • 1.2. AMD Platform (EPYC Series)
    • 1.3. Resources Consumed by CPU Interconnect
  • 2. NUMA Architecture and PCIe Lane “Ownership”
    • 2.1. NUMA Architecture
    • 2.2. PCIe Lane “Ownership”
    • 2.3. Configuration Recommendations (Golden Rule)
  • 3. Role of PCIe Switch
    • 3.1. Core Role: Breaking Physical Constraints for Optimal Topology
    • 3.2. Addressing GPU-CPU Ratio Issues (“Peer-to-Peer” vs “Non-Peer-to-Peer” Access)
    • 3.3. Enhancing Maintainability and Flexiblity (Important!)
  • 4. Principles for CPU PCIe Lane Allocation to PCIe Switch
    • 4.1. Typical Allocation Schemes (Mainstream Platform Example)
    • 4.2. Core Allocation Principles
    • 4.3. Reasons for Not Allocating All Lanes to GPUs

1. Interconnect Between Dual CPUs

CPUs are connected by a dedicated high-speed interconnect bus. This interconnect typically does not consume the PCIe lanes available for external devices, but it does use up internal bus resources on the CPU.

The interconnect technology depends on the CPU vendor and platform. Modern mainstream server interconnect technlogies include Intel’s Ultra Path Interconnect (UPI), and AMD’s Infinity Fabric.

Intel® Xeon® CPU Max 9470 Processor specifications

1.1. Intel Platform (Xeon Scalable Series)

  • Technology name: Intel Ultra Path Interconnect(UPI
  • Physical form: Each CPU has a dedicated UPI link port. On a dual-socket motherboard, a dense cluster of pins and a corresponding socket are located near the CPU sockets. These are used for installing a UPI interconnect cable or for directly routing the connection through the motherboard PCB layers.
  • Partial motherboard view of an NVIDIA HGX B300 8-GPU system (Intel platform)

1.2. AMD Platform(EPYC Series)

  • Technology name: Infinity Fabric (IF)
  • Architecture features: AMD EPYC use an advanced “Chiplet” desing, resulting in a more integrated interconnect.
  • Connection metheod: In a dual-socket server, the two EPYC CPUs are directly connected via dedicated Infinity Fabric xGMI links, implemented through dedicated pins and PCB wiring on the motherboard.
  • 某厂Partial motherboard view of an NVIDIA HGX B300 8-GPU AMD system (AMD platform)

Common characteristics: These technologies are ultra-high-speed, low-latency buses specifically designed for CPU-to-CPU communication. They far exceed the PCIe standard of their era in performance and are used to transimit critical data such as cache coherence traffic and memory accesses.

1.3. Resources Consumed by CPU Interconnect

1)No consumption of “user-avaiable” PCIe lanes:

  • The PCIe x16 and x8 slot lanes on the motherboard come directly from the CPU. The UPI or IF link between CPUs is an independent physical and logical path, separate from these external PCIe lanes.
  • For example, CPU spec sheets often list two numbers, like “64 PCIe lanes+3 UPI links”, 这indicating that the interconnect links are in addition to the lanes (they do not subtract from the 64 lanes available for GPUs, NVMe SSDs, NICs, etc.).

Intel Xeon CPU Max 9480

2)Consumes CPU internal resources and physical pins: 

  • While the interconnect does not use the user-available PCIe lanes, implementing UPI/IF functionality requires CPU die area, transistors, and package pins. From a system-resource perspective, this is overhead. 
  • It can be understood that within the CPU there is a “traffic hub” connecting the memory controller, PCIe controller, UPI/IF interconnect module, etc., and the interconnect module has its own dedicated “in/out” port.

2. NUMA Architecture and PCIe Lane “Ownership”

The key challenge in GPU server PCIe lane allocation is ensuring GPUs are “plugged into the right slots” and that software “runs on the right CPU”, to avoid data traveling long distances over the CPU interconnect bus. This is critical for extracting the maximum performance from the system.

2.1. NUMA Architecture

1)Dual-CPU NUMA architecture: After connecting two CPUs, the system forms a NUMA (Non-Uniform Memory Access) architecture. PCIe lane allocation across the system is strictly constrained by this NUMA setup. Key points include:

  • Local memory: Each CPU is directly connected to a portion of memory, which can be accessed with the lowest latency.
  • Remote memory: A CPU can access the memory attached to the other CPU via the interconnect bus, but with higher latency and lower speed.
  • PCIe device affinity: Each PCIe slot is hardwired to a specific CPU. If a GPU is plugged into a slot on CPU0, but the process using it runs on CPU1, then the GPU must access memory (or communicate with the CPU) across the interconnect bus, which degrades performance.

2)When configuring a dual-socket server, it is crucial for optimal performance to place device (GPUs, high-speed NICs, etc.) in appropriate slots and bind processes to the correct CPU. This ensures that most memory and I/O traffic stays local to each CPU.

2.2. PCIe Lane “Ownership”

1)In a dual-CPU system, each CPU’s PCIe lanes are independent and only available locally. This is a key consideration.

2)Resource distribution (for example, suppose each CPU provide 48 PCIe 5.0 lanes):

  • CPU 0: Has its own 48 PCIe lanes and its directly attached prtion of memory (local memory).
  • CPU 1: Has its own 48 PCIe lanes and its directly attached local memory.
  • Interconnect bus: A single high-speed UPI/IF link connects the two CPUs, allowing them to access each other’s memory and devices.

3)GPU Inserts performance differences:

  • Scenario A (ideal): Both GPUs are plugged into slots on CPU0. Aprocess running on CPU0 that uses these GPUs will have them access CPU0’s local memory, resulting in the shortest and fastest data path.
  • Scenario B (to avoid): One GPU is plugged into CPU0 and the other into CPU1. If a process on CPU0 needs to use GPU on CPU1, the data path is : CPU0 memory -> UPI/IF interconnect -> CPU 1 -> PCIe -> GPU. This adds significant latency and consumes precious interconnect bandwidth, reducing GPU compute efficiency.

2.3. Configuration Recommedations (Golden Rule)

Block Diagram of an HGX H200 NVL server motherboard

1)Check the motherboard manual: Before assembling or deploying a GPU server, consult the motherboard manual’s Motherboard Block Diagram and Motherboard Layout. These diagrams clearly indicate which CPU each physical slot is connected to.

HGX H200 NVL server Motherboard Layout

2)Balance load and optimize paths:

  • If running a single task that use multiple GPUs in parallel (e.g., training a large model), try to install all GPUs on slots belonging to the same CPU. This ensures that the process and its data reside within the smae NUMA node.
  • if running multiple independent GPU tasks, distribute them evenly across both CPUs. However, ensure that for each task, the GPUs and memory it uses are as local as possible to the CPU on which that task runs.

3)Use NUMA-aware tools: On Linux systems, use tools such as numactl, lstopo, etc., to view the NUMA topology. When launching tasks, bind the CPU and memory nodes appropriately to keep workloads and data local.

3. Role of PCIe Switch

PCIe Switch cannot increase the total number of physical lanes provided by the CPUs. Their core function is to “redistribute and aggregate” these lanes to address key issues in GPU deployments, such as topology flexibility and maintainability.

某厂Partial motherboard view of an NVIDIA HGX B300 8-GPU system (AMD platform)  with dual PCIe Switch (PEX89072)

3.1. Core role: Breaking Physical Constraints for Optimal Topology

1)Each CPU’s PCIe lanes come out though a limited number of physical interfaces (e.g., a few x16 slots). Without a switch:

  • GPUs can only be plugged into those few x16 slots on each CPU.
  • If GPUs need high-speed peer-to-peer communication(e.g., via NVLink/NVSwitch), their placement would be completely constrained by the fixed motherboard wiring, resulting in a very inflexible layout.

2)Role of PCIe Switch: The sitch allows a set of CPU lanes (for example, 16 PCIe lanes) to be “expanded” by the switch chip ointo multiple x16 downstream interfaces (e.g., 8×16 ports), although those eight ports share the bandwidth of the original 16 upstream lanes. More importantly, the switch can intelligently configure connectivity among its downstream ports. This provides an ideal, structured physical wiring foundation for NVLink interconnects between GPUs. In HGX motherboards, this complex combination of PCB routing and switch logic implements full of partial NVLink topologies among the GPUs.

3.2. Addressing GPU-GPU Ratio Issues (“Peer-to-Peer” vs “Non-Peer-to-Peer” Access)

1)CPU lane shortage: In an 8-GPU server configuration, a fundamental problem is that no singel CPU can supply enough PCIe lanes to coneect all 8 GPUs (each GPU typically needs x16 lanes).

2)Switch solution (example of a typical 8-GPU HGX):

  • Each of the two CPUs provides a portion of its PCIe lanes (e.g., each provide 48 out of 64 lanes) to one or more PCIe sitch chips. 
  • The switch aggregates these lanes and distributes them to the 8 GPU physical interfaces.
  • Result: Each GPUis physically connected to the switch , and the switch determines which CPU uplink each GPU uses.
  • Key advantage: The switch configuration allows any GPUs to access the memory of both CPUs or to have its CPIe bandwidth dynamically allocated between them. This flexibility is much greater than the traditional scheme of fixing 4 GPUs on CPU0 and 4 on CPU1, resulting in much better load balancing under the NUMA architecture.

3.3. Enhancing Maintainability and Flexibility (Important!)

1)Hot-plug support:  PCIe Switch typically provide better hot-plug (hot-swap) support, which is essential in high-availability data centers.

2)Bandwidth management and Qos: A switch can manage downstream GPU PCIe traffic with priority controls (like a network switch), preventing one GPU’s burst traffic from blocking the communication of others.

3)Fault isolation: If a downstream link or GPU fails, the switch can iolate the fault to some extent, preventing it from affecting the entire PCIe hierarchy.

In integrated AI servers like NVDIA HGX, the PCIe switch is a critical “enabling” component. It cannot magically create more PCIe lanes, but it redistributes and challenges of topology, flexibility, scalability, and manageability in large-scale, can operate at optimal performance. The introducation of the PCIe switch is a key step in evolving servers from “multiple independent GPUs” to an “integrated AI compute unit”。

4. Principles for CPU PCIe Lane Allocation to PCIe Switch

In high-end GPU servers (such as NVIDIA HGX, DGX, and various OEM 8-GPU system), the allocation of PCIe lanes from CPUs to the PCIe Switch is a carefully balanced design choice that directly affects system performance and cost.

4.1. Typical Allocation Schemes (Mainstream Platform Example)

1)A typical dual-socket high-end GPU server is usually based on:

  • CPU platform: Intel Xeon Scalable or AMD EPYC
  • GPU: 8 x NVIDIA H100/A100/H800, etc.
  • Goal: 为rovide PCIe connections for all 8 GPUs

2)Common allocation patter: Each CPU allocates 32 or 48 PCIe lanes to the PCIe switch that connect to the GPUs.

Let’s check a specific and typical diagram of PCIe lanes allocations based on Intel platform (dual 4th/5th Xeon) 8-GPU server :

其The core design principle is not to maximize the bandwidth of any single GPU, but to ensure that all GPUs can simultaneously and evenly access noth CPUs and the entire system memory, thus maximizing aggregate bandwidth.

4.2. Core allocation Principles

1)Balance principle (core): Ensure each GPU can efficiently access either CPU and its local memory in a peer-to-peer manner. 这This optimizes NUMA performance. If all lanes originate from one CPU, GPUs under the other CPU would have only “remote” memory access with much higher latency. Therefore, typically both CPUs contribute equal numbers of lanes to the switch (for example, 32 lanes each for 48 lanes each).

2)Separation of direct and switched paths:

  • Switched Path: The majority of lanes (e.g., the 32/48 per CPU mentioned above) connect GPUs via the PCIe switch, providing a flexible topology and balanced GPU access.
  • Direct Path: A portion of each CPU’s lanes must bypass the switch to directly connect critial I/O devices, such as high-speed network cards (200/400Gb Ethernet, InfiniBand NDR/QDR) that require low-latency, deterministic x16 bandwidth; local NVME storage (ultra-fast SSDs for cache); and managent/BMC interfaces. These devices achieve optimal performance when directly attached to the CPU.

3)Aggregate bandwidth matching principle: The total uplink bandwidth allocated to the switch (for example, dual sockets each providing 48 lanes of PCI 4.0=2x48x2GB/s ≈ 192 GB/s) should match the GPUs’ aggregate demand on system memory and network bandwidth. In design, ensure that the network is not the bottleneck. For instance, when 8 H100 GPUs synchronize via NVLink, the place tremendous load on CPU menory accesses; sufficient uplink PCIe bandwidth is required to “feed” them.

4)Topology optimization principle: The allocation scheme should be co-designed with the GPU interconnect topology (NVLink). In HGX designs, the placement of PCIe switches and the uplink bandwidth allocation are agrranged to complement NVLink switches, allowing GPUs to inter-communicate at high speed via NVLink while also communicating efficiently with CPUs, memory, and network though balanced PCIe uplinks.

4.3. Reasons for Not Allocating All Lanes to GPUs

NVIDIA HGX B300 8-GPU System Block Diagram

1)This is the most important trade-off:

  • If all lanes are given to GPUs: Then critical I/O like networking and storage would also have connect downstream of the switch, sharing uplink bandwidth with GPUs. This increase network latency and variability, which can be disastrous in AI training and HPC clusters.
  • If too few lanes are reserved: Network bandwidth can become a bottleneck for GPU-to-external communication, similarly limiting overall performance.

2)Example solutions:

  • NVIDIA DGX H100 system: Uses dual Intel Xeon CPUs. The total uplink PCIe bandwidth serving its eight H100 GPUs is carefully designed to balance with CPU memory bandwidth, NVLink bandwidth, and the Quantum-2 InfiniBand network bandwidth. Typically, each CPU contributes a large number of lances to the internal switch fabric.
  • AMD EPYC Platform: Since a single EPYC CPU provides a very large number of PCIe lanes (e.g., up to 128 lanes in Genoa), the allocation can be more flexible. More complex asymmetric allocations are possible, even allowing the topology to be completed with fewer switch chips.

In professional GPU servers, the number of lanes a CPU assigns to the PCIe switch is not a fixed value, but the result of a system-level design. The goal is to achieve the best bandwidth balance among GPUs, CPU memory, high-speed network, and storage, while also meeting the GPU’s need for peer access and low-latency communication. In typical dual-socket 8-GPU systems, allocating 32 or 48 lanes per CPU to the GPU switch has been validated as the “sweet spot” solution for optimizing overall performance.

Leave a Reply

Your email address will not be published. Required fields are marked *