Typical GPU Server System Topologies: Direct Attach vs. PCIe Switch Comparison

Typical GPU Server System Topologies: Direct Attach vs. PCIe Switch Comparison

Main Contents:
  • 1. Core Definitions and Topology
    • 1.1. Direct Attach
    • 1.2. PCIe Switch 89104/89144 Type (Broadcom PEX89000 Series)
  • 2. Direct Attach vs. PEX89104 vs. PEX89144
    • 2.1. Hardware and Scalability
    • 2.2. Communication Performance (Bandwidth / Latency)
    • 2.3. Software and Compatibility
    • 2.4. Application Scenarios
  • 3. Key Selection Criteria
  • 4. Conclusion

1. Core Definitions and Topology

1.1. Direct Attach

1)Topology: GPUs are attached via PCIe x16 directly to the CPU’s PCIe Root Complex, with no intervening switch chip.

2)Typical Configuration: Dual-socket CPU (e.g., Intel Xeon/AMD EPYC), each CPU provides 64–128 PCIe lanes, directly connecting 4–8 GPUs (x16 each).

3)Links: The CPU ↔ GPU connection is point-to-point; GPU-to-GPU communication must be forwarded via the CPU (PCIe tree topology).

1.2. PCIe Switch 89104/89144 Type(Broadcom PEX89000 Series)

1) Topology: The CPU connects to the PCIe switch chip via uplink ports (x16/x8), and all GPUs attach to the switch’s downlink ports, forming a star (switched) topology.

2) Chip Specifications (PCIe 5.0, 32 GT/s):

PEX89104: 104 lanes, configurable into multiple x16/x8/x4 ports, typically used for expanding 8–12 GPUs.

PEX89144: 144 lanes, for larger-scale expansion, supports 12–16 GPUs or multi-host sharing.

3) Links: CPU ↔ Switch ↔ GPU; GPUs can communicate directly through the switch without involving the CPU.

2. Direct Attach vs. PEX89104 vs. PEX89144

2.1. Hardware and Scalability

Comparison Item
Direct Attach
PCIe Switch 89104
PCIe Switch 89144
Max GPU Count
8 GPUs (limited by CPU lanes)
12–16 GPUs
16–24 GPUs
CPU Lane Usage
All lanes used(e.g. 8×16=128 Lane)
only uplink ports(e.g. 2×16=32 Lane)
only uplink ports(e.g. 2×16=32 Lane)
Remaining Lanes
Very few / None
Plenty(available for NIC, NVMe, FPGA)
Even more available
Multi-Host Support
Not supported
Support Multi-Host
Support Multi-Host
Hardware cost
Low(no Switch chip)
Medium(single / dual 89104)
High(single / dual 89144)
Power Consumption
Low
Medium(38W per chip)
Medium~High(45–50W per chip)

2.2. Communication Performance (Bandwidth / Latency)

1)CPU ↔ GPU

  • Direct Attach: No forwarding delay, full bandwidth(PCIe 5.0 x16:128 GB/s).
  • Switch: Introduces switch forwarding delay(115ns, uplink bandwidth is shared (multiple GPUs contend).

2)GPU ↔ GPU (Key difference)

  • Direct Attach: Must be forwarded through the CPU, resulting in high latency, halved bandwidth, and easy congestion.
  • Switch: The Switch forwards directly, yielding lower latency, higher bandwidth, and supports GPU-to-GPU peer-to-peer (Fabric Link).

3)Typical Data (PCIe 5.0)

  • Direct Attach: Inter-GPU communication latency≈300–500ns, effective bandwidth ≈ 30–50 GB/s.
  • PEX89104/144: Inter-GPU communication latency ≈ 150–200ns,effective bandwidth ≈ 80–100 GB/s.

2.3. Software and Compatibility

1) Direct Attach: Simple driver requirements, best compatibility, no extra configuration; virtualization (SR-IOV) support is limited.

2)Switch: Requires Switch driver/management software, support advanced features (such as NTB, ot-plug, bandwidth scheduling), and provides stronger virtulization and resource-pooling capabilities.

2.4. Application Scenarios

1)Direct Attach: Small-scale AI training/inference(≤8 GPUs), single-precision compute, general-purpose computing, cost-sensitive scenarios, or cases with minimal GPU-to-GPU communication.

2)PEX89104: Medium to large-scale AI training (8–16 GPUs), HPC, multi-GPU collaboration, requiring expansion for additinal NICs/NVMe cloud servers or compute pooling.

3)PEX89144: 超Ultra-large-scale AI training (≥16 GPUs), multi-host GPU sharing, high-density compute nodes, scenarios with extremely high scalability requirements.

3. Key Selection Criteria

1)GPU count:≤8 GPUs → Dirrect Attach;8–16 GPUs → PEX89104;≥16 GPUs → PEX89144。

2)Inter-GPUCommunication:Communication-intensive(e.g. large model training)→ prefer Switch; sparse communication → Dirrect Attach.

3)Scalability Needs:Requiring external high-speed NICs, NVMe, FPGA → prefer Switch。

4)Cost and Power:Limited budget or power-sensitive → Direct Attach; emphasis on performance and scalability → Switch.

4. Conclusion

1)Direct Attach: Simple, low-cost, low-latency, but poor scalability and inter-GPU communication bottleneck; suitable for small-scales scenarios.

2)PEX89104Balanced expansion and performance, making it a mainstream choice for medium to large-scale AI/HPC.

3)PEX89144Maximum expansion capability,  aimed at ultr-large-scale clusters and high-density deployments.

Leave a Reply

Your email address will not be published. Required fields are marked *