Plain-language- GPU Server Hardware Composition and Disassembly: What Sales Need to Know
GPU Server Overview and Exterior
Although professional GPU servers from different vendors have specific design differences, most share similar overall strutures. Understanding these vendor designs helps deepen knowledge of GPU server hardware components and structures.
Semi-dissassembled view of an NVIDIA DGX A100 server:
ASUS (vendor) HGX H100 server GPU module and front tray (partially pulled):
Frontal view / front panel of some vendor’s GPU servers:
Rear view / rear panel of some vendors’ air-cooled GPU servers:
Rear view / rear panel of a vendor’s liquid-cooled GPU server:
GPU Server Modules and Components
The following diagram show the major modules of a GPU server:
The following figure shows the individual parts within a GPU server:
Among the parts of a GPU server, the two core modules (excluding the chassis) are:
1. The GPU node (the GPU module board and attached components).
2. The Front / CPU compute node (the CPU compute board and attached components).
GPU Module and Composition
GPU module overview:
key components of a GPU module:
Key components of a GPU module:
- GPU module board (UBB, Unit Base Board): Hosts multiple GPUs intergrated into a matrix platform, providing high-speed data exchange among GPUs and between GPUs and CPUs.
- OAM GPU Module: 是An Open Accelerator Module standard-based GPU form factor (e.g., SXM A100 GPUs) that can be installed on the UBB.
- NVSwitch chips: Enable ultr-high-speed communication betweeen multiple GPUs.
- GPU heatsink / cooler: Provides thermal dissipation for the GPUs.
CPU Compute Node (Front Head) and Components
CPU compute node (front head) includes the following parts:
- 1-CPU compute node chassis cover: Mount on the CPU compute node chassis to protect internal components.
- 2-Storage controller card: Provides RAID support for SAS/SATA drives, supports RAID configuration and expansion, firmware updates, and remote setup.
- 3-Riser / Riser card: Adapter card that allows PCIe cards to be mounted into the server via riser connections.
- 4-Supercapacitor mounting bracket: Used to secure supercapacitors within the chassis.
- 5-Server management module: Provides a variety of I/O interfaces and out-of-band management (BMC/HDM) functionality.
- 6-OCP adapter module: Used to install OCP netword cars (designed according to Open Compute Project specifications).
- 7-Air shroud / airflow guide: Provides cooling airflow channels for CPU heatsinks and memory, and accommodates supercapacitor placement.
- 8-CPU heatsink shroud: Provides cooling for the CPU.
- 9-Memory (RAM): temporarily stores compute data for the CPU and exchanges data with persistent storage, supports DDR5 memory, RDIMM or LRDIMM.
- 10-CPU: Intergrates memory controller and PCIe controller, supplying the server’s primary compute capability.
- 11-Standard PCIe nework: A NIC that installs into standard PCIe slots.
- 12-Rear drive cage: Supports expansion of rear-mounted drives.
- 13-Network card adapter: Available in 4-slot and 8-slot variants to accommodate different numbers of NICs.
- 14- OCP network card: Only supports installation into the OCP adapter module.
- 15-Busbar: Connects power between the power conversion board and the PCIe switch adapter board.
- 16-Power conversion board: Supplies power to GPU nodes and reports power status to the motherboard.
- 17-Encryption / security module: Provides cryptographic services to enhance data security.
- 18-M.2 SSD card: Provides storage media for the server.
- 19-Supercapacitor: Supplies power to storage controller flash in the event of unexpected power loss to protect data.
- 20-CPU compute node power module: Provides power conversion for the CPU compute node. Supports hot-swap and 1+1 redundancy.
- 21-GPU power module: Provides power conversion for GPU nodes, fans, front drive bays, and NIC adapter modules. Support hot-swap and typically 3+3 redundancy.
- 22-PCIe Switch adapter board: Used to extend PCIe signals; houses PCIe switch to efficiently interconnect GPUs with storage and NICs.
- 23-Motherboard: One of the server’s most critical components; hosts CPU, memory, BIOS, PCIe slots, and other fundamental system components.
- 24-Compute node chassis: 将The enclosure that houses all compute node components.