Address
304 North Cardinal St.
Dorchester Center, MA 02124
Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM
Address
304 North Cardinal St.
Dorchester Center, MA 02124
Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM
AMD: AI-driven storage revolution, DPU accelerates the new trend of storage access
Full text overview
The rapid development of artificial intelligence (AI), especially the rise of generative AI, is profoundly changing the direction of storage system design and optimization. From the rapid adoption of ChatGPT to the expansion of large language models (LLMs), AI applications pose unprecedented challenges to the performance, capacity, and efficiency of storage systems. At the same time, the advent of data processing units (DPUs) offers new possibilities for solving these challenges. In this article, we will explore the trends of AI’s impact on storage systems, analyze the advantages of the AMD GPU ecosystem in the AI field, and the application opportunities of DPUs in AI scenarios, and demonstrate the significant effect of DPUs in accelerating storage access through real-world case studies.
1. The impact of AI on storage systems
The rapid development of generative AI and its wide range of applications in areas such as text, image, and video generation.
The rapid growth of model size and its need for efficient storage and compute.
The impact of the imbalance in storage and bandwidth growth on the performance of AI systems.
The innovation of frameworks like DeepSpeed Zero-Infinity and the role of MLPerf benchmarking.
The importance of high throughput and high-capacity storage in AI systems.
2. AI ecosystem for AMD GPUs
The differences between Radeon GPUs and Instinct GPUs and their application scenarios.
Features of the open source software stack with supported deep learning frameworks.
A detailed introduction to the HIP environment, compilers, debugging tools, etc.
3. DPU opportunities in AI scenarios
“Data center tax” issues due to hardware and software complexity.
Accelerate networking, storage, and compute tasks to improve data center performance and efficiency.
The launch of a number of DPU products and their application potential in AI scenarios.
4. Case Study: DPU-Accelerated Large Model Training
NVMe/TCP hardware acceleration and point – to – point communication optimization
The collaborative work of AMD EPYC CPU, MI300 GPU and MangoBoost DPU.
Performance improvements and CPU core usage optimizations in FIO and DeepSpeed workloads.
Read the harvest
L1 – Trend of the impact of AI on storage systems
The rise of AI
The chart shows the rapid development of artificial intelligence (especially generative AI) in the global market, especially the breakthrough of applications represented by ChatGPT to reach 100 million users in just a few months. The chart also shows the projected growth of the global generative AI market, which is expected to grow significantly in the coming years, especially in the service sector. Moreover, different application areas of AI, such as chatbots, image generation, and video generation, are gaining widespread applications.
Overall, AI is fast becoming an important technological force, and its use cases are expanding day by day, covering a wide range of fields, from text generation to image and video generation.
Large Language Models (LLMs)
The diagram illustrates several of the leading large language models (LLMs) and highlights their sheer size and complexity. Several important models such as ChatGPT, GPT-4, ERNIE 4.0, Claude 3 Opus, etc., as well as their sizes and parameters in different fields and applications are mainly displayed. Models range in size from hundreds of millions to hundreds of billions and continue to grow.
In addition, as the scale of the model continues to expand, how to build an efficient LLM AI system has become an important issue. Currently, AI technology is in a stage of rapid development, involving increasingly complex and diverse models.
About LifeArchitect
LifeArchitect.ai is a website created by artificial intelligence expert Dr. Alan D. Thompson dedicated to in-depth analysis and elaboration of the development and application of artificial intelligence in the post-2020 era. The website has been hailed as the “gold standard for understanding, visualizing, and enabling post-2020 AI capabilities.” It offers more than 100 papers and articles, 300 videos, and a regularly published The Memo newsletter designed to provide in-depth insights to major AI labs, government and international organizations, research institutions, and individuals interested in the AI revolution.
For those interested in AI, LifeArchitect.ai provides the following value:
AI systems are increasingly challenged by data-oriented tasks
The chart highlights the imbalance between the rapid growth of AI hardware computing power and the growth of data storage and transmission bandwidth over the past decade. While hardware computing performance has increased by a factor of 60,000, storage and interconnect bandwidth has only increased by a factor of 100 and 30. As a result, current large language models (LLMs) can no longer be processed by just one GPU card, especially when dealing with tasks that require a lot of memory and data bandwidth, and GPU memory growth is much slower than the scale growth of LLMs.
As models grow, AI systems require more storage and compute resources, and larger storage solutions may need to be adopted in the future.
===
Cite
In a recent article, Andy brought AMD GPU VP’s thoughts on GPUs business planning and HPC/AI strategy, which is worth a look, AMD’s influence in the data center intelligent computing market has surpassed Intel, and the competition with Nvidia will intensify with the cooperation of open organizations.
Storage-optimized AI frameworks and benchmarks
The chart shows two key trends:
The need for decoupled storage in AI systems
The diagram highlights that in AI systems, especially when dealing with large-scale AI models and data, the optimization and separation of storage resources becomes critical. Traditional storage methods tightly integrate compute and storage resources, while a decoupled storage architecture can improve storage throughput and capacity by separating the storage server from the GPU server.
Key takeaways:
The storage server uses multiple SSDs to effectively support the storage and access of large AI models and data.
By separating storage and computing, the system can flexibly allocate storage resources based on the needs of different AI workloads and avoid overprovisioning.
By using a network interface card (NIC) instead of using an SSD directly, you can improve the bandwidth efficiency of your GPU server and save space.
The decoupled storage architecture provides a more efficient way for AI systems to allocate storage and compute resources, especially when processing large amounts of data and complex calculations, and this architecture helps improve overall performance.
L2 – The AI Ecosystem of AMD GPUs
AMD GPUs
The chart shows AMD’s two types of GPUs:
These GPUs are mainly used in the consumer market, especially in gaming. However, they can also be applied to AI and HPC tasks, although they are not designed specifically for these areas.
These are GPUs designed for data center environments with a focus on AI and high-performance computing. They include:
In addition, the ROCm development platform is an open-source platform that supports both GPU types (Radeon and Instinct), providing developers with flexible development tools.
ROCm development platform
The diagram details the ROCm (Radeon Open Compute) software stack, emphasizing that it is an open-source platform that provides developers with access to AMD GPU computing resources. ROCm consists of multiple libraries, tools, and runtime environments, enabling developers to efficiently develop and optimize tasks such as AI and HPC.
Key takeaways:
ROCm supports multiple deep learning frameworks (such as TensorFlow, PyTorch, etc.).
ROCm offers a variety of math and communication libraries to accelerate computation for AI and HPC applications. At the same time, system management tools and performance analysis tools also help developers optimize system performance.
ROCm supports multiple operating systems and is compatible with AMD’s Radeon and Instinct GPUs, ensuring extensive hardware support.
ROCm provides developers with a complete toolchain that supports AMD’s GPU architecture, helping to improve the efficiency and scalability of GPU computing tasks.
Note
In addition to the advanced process support of hardware, GPU applications are more critical to the efficient processing of the upper-layer software library, which requires the development of a large number of professional algorithm personnel, NVidia has a first-mover advantage in this field, bringing together global high-end talents, which is why everyone is so sought after NVidia, before his commercial success, there is a lot of basic investment, not to mention the follower theory of some domestic manufacturers, hoping to obtain other people’s algorithm capabilities through violent escape.
ROCm Developer Tools
The diagram describes the development tools related to ROCm, mainly for developers to develop the tools and environments required to develop AI and HPC applications on the ROCm platform. ROCm provides developers with a variety of supports, including:
This is an environment for developing GPU-accelerated applications, including runtime libraries and kernel extensions.
These include HIPCC, a front-end compiler for C++ and Perl languages to support GPU computations; and FLANG, the Fortran compiler for LLVM.
This is a tool that converts existing CUDA code to ROCm code to help developers migrate to the ROCm platform.
These tools provide comprehensive support for developers to more easily develop efficient GPU computing applications on the ROCm platform.
L3- DPU Opportunities in AI Scenarios
The modern data center is no longer scalable
The chart illustrates the top two challenges faced by modern data centers when dealing with AI, big data, and cloud applications:
As more and more devices in the data center are added to the network (including NICS, SSDS, GPUs, NPUs, etc.), hardware complexity and management requirements increase significantly.
Emerging technologies, such as virtualization and NVMe-oF, have led to the growth of software stacks, making system management more complex.
These challenges have led to what is known as “data center tax” – a significant increase in CPU overhead and burden to support these complex tasks. For example, in Google and Facebook’s data centers, CPU overhead is 22-27% and 31-83%, respectively.
Key takeaways:
DPUs accelerate data processing for all kinds of infrastructure
The chart highlights the role of DPUs (Data Processing Units) in data centers, especially when dealing with AI, big data, and cloud applications. Designed to accelerate the management of network, storage, and compute resources, DPUs can dramatically improve data center performance, scalability, and reduce total cost of ownership (TCO) by optimizing data transfer, storage, and processing efficiency.
Key takeaways:
DPUs (including IPUs and ultra-intelligent NICs) can accelerate tasks such as virtualization, networking, and storage, and optimize the operation of hardware devices such as GPUs and NPUs.
DPUs help improve data center scalability, performance, and total cost of ownership, making them more efficient and economical.
DPUs are particularly well-suited for data-intensive tasks such as AI, big data, and cloud applications, where they can accelerate and improve the efficiency of resource management.
The introduction of DPUs offers data centers a more efficient and flexible infrastructure management solution, especially when dealing with modern and complex applications.
Note
Can it be understood in this way: from the earliest x86 CISC complex instruction set, the CPU, as a centralized data processing module, cannot keep up with the application requirements due to power consumption and execution efficiency; In the early 90s, the computing industry began to shift to RISC reduced instruction sets (distilling 80% of commonly used execution commands), resulting in the emergence of ARM and mobile computing; Today, nearly 40 years later, with the explosion of data intelligence, the limited algorithms/functions for data processing will be extracted again and defined as proprietary hardware, all of which is driven by computing efficiency and intensive applications.
Reduced Instructions vs. Complex Instructions
Reduced Instruction Set Computing (RISC) and Complex Instruction Set Computing (CISC) are two different computer architectures, each with its unique design philosophy and performance characteristics. The following are their main differences:
1. Complexity of the instruction set
2. The number and type of instructions
3. Efficiency of instruction execution
4. Design of hardware and software
A variety of DPU products entered the market
The chart illustrates DPU products from multiple vendors that are specifically designed to accelerate infrastructure and I/O processing in data centers. Unlike traditional CPUs and GPUs, DPUs are designed to handle data flow, storage operations, and network communication, thereby reducing the burden on the main processor and improving overall system efficiency.
Key takeaways:
DPUs (Data Processing Units) are primarily used to accelerate network, storage, and I/O-related tasks to improve the efficiency of data transmission and processing.
DPU products include the FPGA-based Alveo SmartNIC and ASIC-based products such as the Pensando DPU and Bluefield DPU, which provide a more efficient solution for different application scenarios.
Well-known companies such as Intel, NVIDIA, AMD, and MangoBoost are all driving the development and adoption of DPU products aimed at improving data center performance and scalability.
Application opportunities of DPUs in AI systems
The diagram illustrates the critical role of DPUs in AI systems, especially when dealing with high-bandwidth communication between GPU servers and storage data transfer, where they can provide significant performance gains. By leveraging technologies such as RDMA and NVMe over Fabric, DPUs are able to accelerate data interaction between GPUs and storage, improving the efficiency of the overall AI system.
Key takeaways:
DPUs can optimize communication between GPUs, point-to-point communication within nodes, and high-speed data transfer between GPUs and storage.
By leveraging efficient networking and storage technologies, DPUs can solve the I/O bottlenecks faced by GPU servers and further improve the overall performance of AI systems.
For example, the use of AMD GPUs and MangoBoost DPUs to optimize the communication between GPUs and remote storage demonstrates the power of DPUs in solving storage and data transfer problems.
This demonstrates the potential of DPUs in data-intensive AI applications, especially when dealing with large-scale data.
L4 – Case Study: DPU-Accelerated Training of Large Models
MangoBoost GPU Storage Acceleration (GSB) – (1) NVMe/TCP hardware
The diagram shows how the NVMe/TCP stack can be offloaded to the DPU (Data Processing Unit) to simplify the storage communication software stack and significantly improve data transfer performance. MangoBoost DPUs use hardware acceleration mechanisms, including the use of FPGAs for NVMePCIe virtualization and protocol conversion, as well as embedded ARM processors to process control paths, increasing data processing speeds throughout the storage system.
Key takeaways:
By offloading the NVMe/TCP stack to the DPU, MangoBoost improves the performance of the storage system and reduces the burden on the CPU.
The FPGA is used for protocol conversion of the data path, while the ARM processor is responsible for controlling the management of the path.
With hardware acceleration, the software stack becomes simpler, improving the efficiency and performance of the overall system.
DPU acceleration enables more efficient storage operations and data transfers, especially in data centers and storage-intensive applications.
This diagram shows how MangoBoost optimizes data transfer and storage operations through innovative hardware acceleration solutions, which has great potential for applications in areas such as AI and high-performance computing.
Note
To put it bluntly, the TCP protocol stack that originally needs to be processed at the OS level is externally placed into the DPU through the introduction of NVMe, combined with AI applications, it is to analyze clearly: in the inference process, how much CPU resources are consumed by network communication at the TCP layer, based on the DPU, can be cut from the CPU. This is also why: the data processing unit (DPU) is clearly called the data processing unit (DPU) to do the work of network communication optimization, but in addition to this, in the AI scenario, the data processing unit can definitely do more than that! This also answers why today’s DPU is just getting started.
The diagram shows how to optimize the data transfer path by enabling point-to-point communication between the DPU and the GPU, addressing the performance bottlenecks caused by CPU, memory, and PCIe contention in traditional architectures. With point-to-point communication, GPUs and DPUs can exchange data directly, reducing latency when transferring data and improving the overall efficiency of the storage system.
Key takeaways:
Enables direct data exchange between the DPU and GPU, reducing bottlenecks in the data transfer path and improving data transfer speed and efficiency.
Direct GPU-to-DPU communication avoids competition for CPU, memory, and PCIe resources, improving overall performance in data centers and compute-intensive applications.
This solution can significantly optimize the GPU data path, reduce latency, and improve the performance of the storage system.
The design of this architecture is important for applications that require efficient, low-latency data processing, such as AI, machine learning, and big data analytics.
GPU Storage Acceleration (GSB): File API
The diagram shows how MangoBoost’s hardware acceleration solution can be used to optimize the file API, so that user applications (e.g., FIO, DeepSpeed) can use hardware acceleration to increase data transfer speed, especially when processing high-performance computing and storage tasks.
Key takeaways:
With MangoBoost’s hardware acceleration, the file system and storage protocols are processed more efficiently, reducing the burden on the CPU and the network.
MangoBoost’s DPU and GPU work together to accelerate data exchange through peer-to-peer communication, improving storage and compute efficiency.
The Mango DPU works with the GPU to further optimize GPU storage acceleration and improve the performance of applications.
This architecture is suitable for application scenarios that require efficient storage and fast computing, such as deep learning, big data analytics, and other fields, especially when large amounts of data need to be accessed quickly.
Note
This article should be compared with the Qistor’s solution sorted out a few days ago, in essence, it is based on dedicated hardware to accelerate data access, Qistor is to achieve KV storage acceleration through FPGA, abstracted out of the object-oriented storage API, and here MangoBoost is a general file APIs, the former can be a subset of the latter, but Qistor is a direct storage-oriented acceleration hardware, the two should be integrated, too separated for customers to accept.
For more information about Qistor’s hardware implementation of the KV storage solution, please refer to the following topics:
MangoFile
The table illustrates the role of the MangoFile library in the data transfer process. By exchanging data directly with GPU memory and storage devices, the MangoFile library provides efficient file I/O operations. It uses ROCm and ROCK core drivers, utilizes peer-to-peer direct to reduce the burden on the CPU, and uses NVMe drivers for DMA address mapping and storage command submission to achieve fast data transfer.
Key takeaways:
MangoFile registers GPU memory, obtains the corresponding information, and performs file I/O operations.
The ROCm and ROCK core drivers enable direct peer-to-peer communication between the GPU and the storage device, optimizing the data transmission path.
With NVMe drives, MangoFile is able to efficiently manage data transfers, ensuring low latency and high efficiency.
The MangoFile library simplifies file I/O operations, makes full use of GPU storage acceleration, and improves the performance of the file system.
Test the system
The diagram shows a high-performance test system equipped with AMD’s EPYC CPUs, MI300 GPUs, and MangoBoost DPUs, which are particularly suitable for data-intensive tasks such as AI computing and storage acceleration. The DPUs and GPUs in the configuration are connected at high speed via PCIe Gen5, ensuring high bandwidth and low latency for the system.
Key takeaways:
The system is equipped with AMD EPYC CPU and AMD MI300 GPU to provide powerful support for parallel computing and graphics processing.
Accelerate data transfer and storage operations with MangoBoost DPUs, especially in scenarios that require high bandwidth, where the DPU is able to improve overall performance.
High-speed connectivity is provided for the individual components (CPU, GPU, DPU) to ensure efficient data flow.
Running Ubuntu 22.04.3 LTS, GRUB parameters suitable for large-scale data processing and virtualization are configured.
Assessment 1: Benchmarking of FIO
Control group (left): GPU system experiment group with the blessing of ordinary network card (right): GPU system with the blessing of MangoDPU
FIO Micro Benchmark – Results
The solution provides a bandwidth increase of 1.7 to 2.6 times around the network line speed bandwidth.
, and the bandwidth is significantly increased at 1MB and 2MB block sizes, indicating that the GPU storage acceleration scheme effectively reduces the CPU burden.
Assessment 2: DeepSpeed Workload – Software Setup
The diagram shows how MangoBoost DPUs can be leveraged to accelerate data exchange by modifying DeepSpeed‘s backend (switching module), especially in workloads such as high-performance computing (HPC) and deep learning. By enabling accelerated switching mode, the speed of data exchange between the GPU and storage has been significantly improved.
Key takeaways:
By enabling the MangoBoost DPU, the traditional switching module (Normal Swap Mode) is replaced with an accelerated swap mode to optimize data transfer.
The Mango vault accelerates I/O operations through hardware acceleration, and the DPU is responsible for providing efficient point-to-point data transfer, significantly improving the efficiency of data flow between storage and compute.
Leveraging the synergy of Mango DPUs and AMD GPUs, the performance of the entire DeepSpeed workload is boosted.
The value of the DeepSpeed project in AI scenarios
Reference Reading :How does DeepSpeed optimize inference performance from the storage layer?
1. Memory and Parallel
OptimizationDeepSpeed, part of Microsoft’s Large-Scale AI Initiative, contains a powerful in-memory and parallel-optimized toolkit specifically designed for efficient large-scale model training and inference on modern GPU clusters. It scales with heterogeneous memory (GPU, CPU, and NVMe), significantly improving computing efficiency.
2. Reduce the burden on the GPUDuring inference, DeepSpeed effectively reduces the burden on the GPU by offloading model parameters to NVMe storage and KV cache to CPU memory, thereby improving inference efficiency, especially when working with large-scale models.
DeepSpeed Workloads – Results
The chart illustrates the significant performance gains of MangoBoost GPU Storage Acceleration (GSB) in DeepSpeed workloads:
GPU storage acceleration provides 1.7 times more bandwidth for data transfer than traditional CPU buffers and software NVMe/TCP.
GPU storage acceleration significantly reduces the use of CPU cores, saving 25 CPU cores, making computing resources more efficient.
This optimization provides significant performance gains in AI training frameworks, especially in terms of data transfer and computational efficiency.
summary
and AMD ROCm™ Software: Powerful hardware and software support for AI workloads.
A comprehensive DPU solution, such as GPU-storage-boost, is provided to optimize data transfer and processing.
, 1.7x to 2.6x faster throughput, and 22-37 CPU cores saved.
Throughput, up to 1.7x, and save 25 CPU cores.