AMD:AI-driven storage revolution, DPU accelerates the new trend of storage access

AMD: AI-driven storage revolution, DPU accelerates the new trend of storage access

.

Full text overview

The rapid development of artificial intelligence (AI), especially the rise of generative AI, is profoundly changing the direction of storage system design and optimization. From the rapid adoption of ChatGPT to the expansion of large language models (LLMs), AI applications pose unprecedented challenges to the performance, capacity, and efficiency of storage systems. At the same time, the advent of data processing units (DPUs) offers new possibilities for solving these challenges. In this article, we will explore the trends of AI’s impact on storage systems, analyze the advantages of the AMD GPU ecosystem in the AI field, and the application opportunities of DPUs in AI scenarios, and demonstrate the significant effect of DPUs in accelerating storage access through real-world case studies.

.

1. The impact of AI on storage systems

 The rise of AI

The rapid development of generative AI and its wide range of applications in areas such as text, image, and video generation.

 Large Language Models (LLMs)

The rapid growth of model size and its need for efficient storage and compute.

 Challenges of data-oriented tasks

The impact of the imbalance in storage and bandwidth growth on the performance of AI systems.

 Storage-optimized AI framework

The innovation of frameworks like DeepSpeed Zero-Infinity and the role of MLPerf benchmarking.

 Separate storage requirements

The importance of high throughput and high-capacity storage in AI systems.

2. AI ecosystem for AMD GPUs

 AMD GPU classification

The differences between Radeon GPUs and Instinct GPUs and their application scenarios.

 ROCm development platform

Features of the open source software stack with supported deep learning frameworks.

 ROCm Developer Tools

A detailed introduction to the HIP environment, compilers, debugging tools, etc.

3. DPU opportunities in AI scenarios

 The challenges of the modern data center

“Data center tax” issues due to hardware and software complexity.

 The role of the DPU

Accelerate networking, storage, and compute tasks to improve data center performance and efficiency.

 DPU market development

The launch of a number of DPU products and their application potential in AI scenarios.

4. Case Study: DPU-Accelerated Large Model Training

 MangoBoost GPU Storage Acceleration(GSB)

NVMe/TCP hardware acceleration and point – to – point communication optimization

 Test the system configuration

The collaborative work of AMD EPYC CPU, MI300 GPU and MangoBoost DPU.

 Benchmark results

Performance improvements and CPU core usage optimizations in FIO and DeepSpeed workloads.

Read the harvest

1.Gain a deeper understanding of the impact of AI on storage systems and its future trends.
2.Learn about the benefits and applications of the AMD GPU ecosystem in AI scenarios.
3.Learn about the key role of DPUs in AI storage optimization and real-world cases.
4.See the significant performance improvement and resource optimization effect of DPUs in accelerating storage access.

.

L1 – Trend of the impact of AI on storage systems

The rise of AI

The chart shows the rapid development of artificial intelligence (especially generative AI) in the global market, especially the breakthrough of applications represented by ChatGPT to reach 100 million users in just a few months. The chart also shows the projected growth of the global generative AI market, which is expected to grow significantly in the coming years, especially in the service sector. Moreover, different application areas of AI, such as chatbots, image generation, and video generation, are gaining widespread applications.

Overall, AI is fast becoming an important technological force, and its use cases are expanding day by day, covering a wide range of fields, from text generation to image and video generation.

.

Large Language Models (LLMs)

The diagram illustrates several of the leading large language models (LLMs) and highlights their sheer size and complexity. Several important models such as ChatGPT, GPT-4, ERNIE 4.0, Claude 3 Opus, etc., as well as their sizes and parameters in different fields and applications are mainly displayed. Models range in size from hundreds of millions to hundreds of billions and continue to grow.

In addition, as the scale of the model continues to expand, how to build an efficient LLM AI system has become an important issue. Currently, AI technology is in a stage of rapid development, involving increasingly complex and diverse models.

About LifeArchitect

LifeArchitect.ai is a website created by artificial intelligence expert Dr. Alan D. Thompson dedicated to in-depth analysis and elaboration of the development and application of artificial intelligence in the post-2020 era. The website has been hailed as the “gold standard for understanding, visualizing, and enabling post-2020 AI capabilities.” It offers more than 100 papers and articles, 300 videos, and a regularly published The Memo newsletter designed to provide in-depth insights to major AI labs, government and international organizations, research institutions, and individuals interested in the AI revolution.

For those interested in AI, LifeArchitect.ai provides the following value:

1.In-depth research and analysis: The website contains a wealth of detailed research on AI models, datasets, algorithms, and applications, helping readers gain an in-depth understanding of the latest advances and trends in AI.
2.The Latest Artificial Intelligence Reports: Dr. Thompson regularly publishes annual review and forward – looking reports on artificial intelligence, such as “Integrated AI: The sky is comforting (2023 AI retrospective)”, providing readers with a comprehensive perspective on the field of artificial intelligence.
3.Multimedia resources: The website provides rich video content covering all aspects of artificial intelligence, helping readers understand complex concepts and techniques more intuitively.
4.Professional AI Model Analysis: LifeArchitect.ai provides comprehensive analysis and comparison of large language models (LLMs), including GPT-3, GPT-4, PaLM, and other models, specifically regarding the size, capabilities, and training data of these models.
5.AI tools and resources: The website provides a variety of AI tools and resources, such as ALPrompt and Datasets Table, for researchers and developers to refer to and use.

.

AI systems are increasingly challenged by data-oriented tasks

The chart highlights the imbalance between the rapid growth of AI hardware computing power and the growth of data storage and transmission bandwidth over the past decade. While hardware computing performance has increased by a factor of 60,000, storage and interconnect bandwidth has only increased by a factor of 100 and 30. As a result, current large language models (LLMs) can no longer be processed by just one GPU card, especially when dealing with tasks that require a lot of memory and data bandwidth, and GPU memory growth is much slower than the scale growth of LLMs.

As models grow, AI systems require more storage and compute resources, and larger storage solutions may need to be adopted in the future.

===

 In the upper left figure, the gray slash line compares the evolution trend of the computing power of different GPUs in the data center (you can understand the computing power scale of different GPUs), the green slash at the bottom represents the high-bandwidth memory (HBM) capacity trend, and the blue slash represents the peak rate of GPU interconnection technologies (PCIe and NVLink).
 In the figure below left, the red line is the evolving model parameter size, and as the dataset continues to grow, the green line is the standard memory capacity of the GPUs graphics card in the data center.
 The figure below on the right shows the number of parameters of different large models and the matching AMD data center GPUs

Cite

In a recent article, Andy brought AMD GPU VP’s thoughts on GPUs business planning and HPC/AI strategy, which is worth a look, AMD’s influence in the data center intelligent computing market has surpassed Intel, and the competition with Nvidia will intensify with the cooperation of open organizations.

.

Storage-optimized AI frameworks and benchmarks

The chart shows two key trends:

1.Storage-optimized AI frameworks: As AI models become larger, traditional memory and storage architectures are no longer able to effectively support these massive computing tasks. DeepSpeed Zero-Infinity is an innovative framework that makes the training of large-scale AI models more efficient and avoids memory bottlenecks by allowing data to overflow into memory and SSDs.
2.The advent of AI benchmarks: To assess the impact of storage on AI systems, benchmarks like MLPerf have emerged. These tests measure the efficiency of AI training tasks supported by storage and computing resources, helping to optimize hardware and system configurations.

.

The need for decoupled storage in AI systems

The diagram highlights that in AI systems, especially when dealing with large-scale AI models and data, the optimization and separation of storage resources becomes critical. Traditional storage methods tightly integrate compute and storage resources, while a decoupled storage architecture can improve storage throughput and capacity by separating the storage server from the GPU server.

Key takeaways:

1.High throughput and high-capacity storage

The storage server uses multiple SSDs to effectively support the storage and access of large AI models and data.

2.Flexible configuration

By separating storage and computing, the system can flexibly allocate storage resources based on the needs of different AI workloads and avoid overprovisioning.

3.Improve bandwidth utilization

By using a network interface card (NIC) instead of using an SSD directly, you can improve the bandwidth efficiency of your GPU server and save space.

The decoupled storage architecture provides a more efficient way for AI systems to allocate storage and compute resources, especially when processing large amounts of data and complex calculations, and this architecture helps improve overall performance.

.

L2 – The AI Ecosystem of AMD GPUs

AMD GPUs

The chart shows AMD’s two types of GPUs:

1.Radeon GPU

These GPUs are mainly used in the consumer market, especially in gaming. However, they can also be applied to AI and HPC tasks, although they are not designed specifically for these areas.

2.Instinct GPU

These are GPUs designed for data center environments with a focus on AI and high-performance computing. They include:

 CDNA architecture, optimized for AI and HPC applications.
 HBM (High Bandwidth Memory) for efficient data transmission.
 Infinity Fabric provides high-speed data interconnection.

In addition, the ROCm development platform is an open-source platform that supports both GPU types (Radeon and Instinct), providing developers with flexible development tools.

.

ROCm development platform

The diagram details the ROCm (Radeon Open Compute) software stack, emphasizing that it is an open-source platform that provides developers with access to AMD GPU computing resources. ROCm consists of multiple libraries, tools, and runtime environments, enabling developers to efficiently develop and optimize tasks such as AI and HPC.

Key takeaways:

1.Extensive support framework

ROCm supports multiple deep learning frameworks (such as TensorFlow, PyTorch, etc.).

2.Feature-rich libraries and tools

ROCm offers a variety of math and communication libraries to accelerate computation for AI and HPC applications. At the same time, system management tools and performance analysis tools also help developers optimize system performance.

3.Cross-platform support

ROCm supports multiple operating systems and is compatible with AMD’s Radeon and Instinct GPUs, ensuring extensive hardware support.

ROCm provides developers with a complete toolchain that supports AMD’s GPU architecture, helping to improve the efficiency and scalability of GPU computing tasks.

Note

In addition to the advanced process support of hardware, GPU applications are more critical to the efficient processing of the upper-layer software library, which requires the development of a large number of professional algorithm personnel, NVidia has a first-mover advantage in this field, bringing together global high-end talents, which is why everyone is so sought after NVidia, before his commercial success, there is a lot of basic investment, not to mention the follower theory of some domestic manufacturers, hoping to obtain other people’s algorithm capabilities through violent escape.

.

ROCm Developer Tools

The diagram describes the development tools related to ROCm, mainly for developers to develop the tools and environments required to develop AI and HPC applications on the ROCm platform. ROCm provides developers with a variety of supports, including:

1.HIP environment

This is an environment for developing GPU-accelerated applications, including runtime libraries and kernel extensions.

2.compiler

These include HIPCC, a front-end compiler for C++ and Perl languages to support GPU computations; and FLANG, the Fortran compiler for LLVM.

3.Hipify

This is a tool that converts existing CUDA code to ROCm code to help developers migrate to the ROCm platform.

4.ROCm CMAKE: A tool that simplifies the building of ROCm applications.
5.ROCgdb: A tool for debugging ROCm applications.

These tools provide comprehensive support for developers to more easily develop efficient GPU computing applications on the ROCm platform.

.

L3- DPU Opportunities in AI Scenarios

The modern data center is no longer scalable

The chart illustrates the top two challenges faced by modern data centers when dealing with AI, big data, and cloud applications:

1.Hardware complexity

As more and more devices in the data center are added to the network (including NICS, SSDS, GPUs, NPUs, etc.), hardware complexity and management requirements increase significantly.

2.Software complexity

Emerging technologies, such as virtualization and NVMe-oF, have led to the growth of software stacks, making system management more complex.

These challenges have led to what is known as “data center tax” – a significant increase in CPU overhead and burden to support these complex tasks. For example, in Google and Facebook’s data centers, CPU overhead is 22-27% and 31-83%, respectively.

Key takeaways:

 Data centers face ever-increasing hardware and software complexity.
 As more hardware and software components are introduced, the overhead of CPUs increases.
 To meet these challenges, more efficient hardware and software architecture design is required.

.

DPUs accelerate data processing for all kinds of infrastructure

The chart highlights the role of DPUs (Data Processing Units) in data centers, especially when dealing with AI, big data, and cloud applications. Designed to accelerate the management of network, storage, and compute resources, DPUs can dramatically improve data center performance, scalability, and reduce total cost of ownership (TCO) by optimizing data transfer, storage, and processing efficiency.

Key takeaways:

1.DPU acceleration function

DPUs (including IPUs and ultra-intelligent NICs) can accelerate tasks such as virtualization, networking, and storage, and optimize the operation of hardware devices such as GPUs and NPUs.

2.Boost data center performance

DPUs help improve data center scalability, performance, and total cost of ownership, making them more efficient and economical.

3.Wide range of applications

DPUs are particularly well-suited for data-intensive tasks such as AI, big data, and cloud applications, where they can accelerate and improve the efficiency of resource management.

The introduction of DPUs offers data centers a more efficient and flexible infrastructure management solution, especially when dealing with modern and complex applications.

Note

Can it be understood in this way: from the earliest x86 CISC complex instruction set, the CPU, as a centralized data processing module, cannot keep up with the application requirements due to power consumption and execution efficiency; In the early 90s, the computing industry began to shift to RISC reduced instruction sets (distilling 80% of commonly used execution commands), resulting in the emergence of ARM and mobile computing; Today, nearly 40 years later, with the explosion of data intelligence, the limited algorithms/functions for data processing will be extracted again and defined as proprietary hardware, all of which is driven by computing efficiency and intensive applications.

Reduced Instructions vs. Complex Instructions

Reduced Instruction Set Computing (RISC) and Complex Instruction Set Computing (CISC) are two different computer architectures, each with its unique design philosophy and performance characteristics. The following are their main differences:

1. Complexity of the instruction set

 RISC (Reduced Instruction Set Computing)
· The RISC architecture has a small instruction set, and each instruction performs a very simple task, usually completing an operation with one instruction.
· Each instruction is usually of a fixed length and each instruction is executed for the same period (i.e., all instructions are usually executed within a single clock cycle).
· The design idea is to improve the performance of the processor through simple and efficient instructions.
 CISC (Complex Instruction Set Computation)
· The instruction set of the CISC architecture is complex, and an instruction may contain multiple operations, or even multiple operations such as loading, storing, and adding at once.
· CISC instructions are often of variable lengths and execution times are uneven, and some instructions may require multiple clock cycles to complete.
· The design goal of CISC is to reduce the number of instructions in a program through complex instructions, thereby reducing memory usage.

2. The number and type of instructions

 RISK
· There are fewer instructions (usually dozens), each of which performs a simple and specific task.
· For example, most of the instructions in the RISC architecture are basic operations such as data transmission, addition, subtraction, and jumping.
 CISC
· Contains more instructions (often hundreds or more) that can perform more complex tasks.
· The CISC instruction set contains more complex operations such as string processing, multiplication, division, etc., and can complete multiple steps in a single instruction.

3. Efficiency of instruction execution

 RISK
· Because the instructions are simple, each instruction is usually executed at a fixed time and can be completed within one clock cycle, which is more efficient.
· RISC’s hardware design is generally simple, allowing it to make efficient use of pipelining technology and is widely used in modern processors.
 CISC
· Due to the complexity of instructions, the execution time of instructions is not fixed, and some instructions may require multiple clock cycles to complete, so the execution efficiency of a single instruction may be low.
· However, since each instruction accomplishes more operations, the program may contain fewer instructions, theoretically saving memory and increasing code density.

4. Design of hardware and software

 RISK
· Due to the simplicity of the RISC instruction set and the relatively simple implementation of the hardware, the design relies more on the speed of the hardware and the pipelining technology to improve performance.
· Software programming usually requires more instructions to accomplish a task, but each instruction is executed very quickly.
 CISC
· The CISC instruction set is complex, and the hardware needs more powerful decoding capabilities to handle a variety of instructions of different lengths and complexity.
· Software programming can accomplish more tasks with a small number of complex instructions, potentially reducing the size of the program.

.

A variety of DPU products entered the market

The chart illustrates DPU products from multiple vendors that are specifically designed to accelerate infrastructure and I/O processing in data centers. Unlike traditional CPUs and GPUs, DPUs are designed to handle data flow, storage operations, and network communication, thereby reducing the burden on the main processor and improving overall system efficiency.

Key takeaways:

1.The role of the DPU

DPUs (Data Processing Units) are primarily used to accelerate network, storage, and I/O-related tasks to improve the efficiency of data transmission and processing.

2.FPGA and ASIC products

DPU products include the FPGA-based Alveo SmartNIC and ASIC-based products such as the Pensando DPU and Bluefield DPU, which provide a more efficient solution for different application scenarios.

3.Market participants

Well-known companies such as Intel, NVIDIA, AMD, and MangoBoost are all driving the development and adoption of DPU products aimed at improving data center performance and scalability.

.

Application opportunities of DPUs in AI systems

The diagram illustrates the critical role of DPUs in AI systems, especially when dealing with high-bandwidth communication between GPU servers and storage data transfer, where they can provide significant performance gains. By leveraging technologies such as RDMA and NVMe over Fabric, DPUs are able to accelerate data interaction between GPUs and storage, improving the efficiency of the overall AI system.

Key takeaways:

1.The DPU accelerates the network and storage

DPUs can optimize communication between GPUs, point-to-point communication within nodes, and high-speed data transfer between GPUs and storage.

2.Improve system performance

By leveraging efficient networking and storage technologies, DPUs can solve the I/O bottlenecks faced by GPU servers and further improve the overall performance of AI systems.

3.Specific application cases

For example, the use of AMD GPUs and MangoBoost DPUs to optimize the communication between GPUs and remote storage demonstrates the power of DPUs in solving storage and data transfer problems.

This demonstrates the potential of DPUs in data-intensive AI applications, especially when dealing with large-scale data.

.

L4 – Case Study: DPU-Accelerated Training of Large Models

MangoBoost GPU Storage Acceleration (GSB) – (1) NVMe/TCP hardware

The diagram shows how the NVMe/TCP stack can be offloaded to the DPU (Data Processing Unit) to simplify the storage communication software stack and significantly improve data transfer performance. MangoBoost DPUs use hardware acceleration mechanisms, including the use of FPGAs for NVMePCIe virtualization and protocol conversion, as well as embedded ARM processors to process control paths, increasing data processing speeds throughout the storage system.

Key takeaways:

1.Hardware acceleration

By offloading the NVMe/TCP stack to the DPU, MangoBoost improves the performance of the storage system and reduces the burden on the CPU.

2.FPGA and ARM collaboration

The FPGA is used for protocol conversion of the data path, while the ARM processor is responsible for controlling the management of the path.

3.Simplify the software stack

With hardware acceleration, the software stack becomes simpler, improving the efficiency and performance of the overall system.

4.Efficient storage of communications

DPU acceleration enables more efficient storage operations and data transfers, especially in data centers and storage-intensive applications.

This diagram shows how MangoBoost optimizes data transfer and storage operations through innovative hardware acceleration solutions, which has great potential for applications in areas such as AI and high-performance computing.

Note

To put it bluntly, the TCP protocol stack that originally needs to be processed at the OS level is externally placed into the DPU through the introduction of NVMe, combined with AI applications, it is to analyze clearly: in the inference process, how much CPU resources are consumed by network communication at the TCP layer, based on the DPU, can be cut from the CPU. This is also why: the data processing unit (DPU) is clearly called the data processing unit (DPU) to do the work of network communication optimization, but in addition to this, in the AI scenario, the data processing unit can definitely do more than that! This also answers why today’s DPU is just getting started.

.

The diagram shows how to optimize the data transfer path by enabling point-to-point communication between the DPU and the GPU, addressing the performance bottlenecks caused by CPU, memory, and PCIe contention in traditional architectures. With point-to-point communication, GPUs and DPUs can exchange data directly, reducing latency when transferring data and improving the overall efficiency of the storage system.

Key takeaways:

1.Peer-to-peer communication

Enables direct data exchange between the DPU and GPU, reducing bottlenecks in the data transfer path and improving data transfer speed and efficiency.

2.Resolve resource contention

Direct GPU-to-DPU communication avoids competition for CPU, memory, and PCIe resources, improving overall performance in data centers and compute-intensive applications.

3.Data transfer optimization

This solution can significantly optimize the GPU data path, reduce latency, and improve the performance of the storage system.

The design of this architecture is important for applications that require efficient, low-latency data processing, such as AI, machine learning, and big data analytics.

.

GPU Storage Acceleration (GSB): File API

The diagram shows how MangoBoost’s hardware acceleration solution can be used to optimize the file API, so that user applications (e.g., FIO, DeepSpeed) can use hardware acceleration to increase data transfer speed, especially when processing high-performance computing and storage tasks.

Key takeaways:

1.Optimized file API

With MangoBoost’s hardware acceleration, the file system and storage protocols are processed more efficiently, reducing the burden on the CPU and the network.

2.Hardware acceleration

MangoBoost’s DPU and GPU work together to accelerate data exchange through peer-to-peer communication, improving storage and compute efficiency.

3.GPU works with DPU

The Mango DPU works with the GPU to further optimize GPU storage acceleration and improve the performance of applications.

This architecture is suitable for application scenarios that require efficient storage and fast computing, such as deep learning, big data analytics, and other fields, especially when large amounts of data need to be accessed quickly.

Note

This article should be compared with the Qistor’s solution sorted out a few days ago, in essence, it is based on dedicated hardware to accelerate data access, Qistor is to achieve KV storage acceleration through FPGA, abstracted out of the object-oriented storage API, and here MangoBoost is a general file APIs, the former can be a subset of the latter, but Qistor is a direct storage-oriented acceleration hardware, the two should be integrated, too separated for customers to accept.

For more information about Qistor’s hardware implementation of the KV storage solution, please refer to the following topics:

· Technological breakthrough: The hardware-accelerated key-value storage solution proposed by Qistor has achieved a revolutionary breakthrough in the storage abstraction layer by migrating core algorithms such as LSM trees to FPGA/ASIC hardware, significantly improving storage efficiency and performance.
· Application scenarios: This technology has shown great potential in high-performance computing scenarios such as AI training and vector databases, and has been verified to improve performance by 10-100 times in ultra-large-scale applications such as Twitter and Facebook.

.

MangoFile

The table illustrates the role of the MangoFile library in the data transfer process. By exchanging data directly with GPU memory and storage devices, the MangoFile library provides efficient file I/O operations. It uses ROCm and ROCK core drivers, utilizes peer-to-peer direct to reduce the burden on the CPU, and uses NVMe drivers for DMA address mapping and storage command submission to achieve fast data transfer.

Key takeaways:

1.Memory registration and read/write

MangoFile registers GPU memory, obtains the corresponding information, and performs file I/O operations.

2.ROCm/ROCK driver support

The ROCm and ROCK core drivers enable direct peer-to-peer communication between the GPU and the storage device, optimizing the data transmission path.

3.NVMe driver

With NVMe drives, MangoFile is able to efficiently manage data transfers, ensuring low latency and high efficiency.

4.File I/O operations

The MangoFile library simplifies file I/O operations, makes full use of GPU storage acceleration, and improves the performance of the file system.

.

Test the system

The diagram shows a high-performance test system equipped with AMD’s EPYC CPUs, MI300 GPUs, and MangoBoost DPUs, which are particularly suitable for data-intensive tasks such as AI computing and storage acceleration. The DPUs and GPUs in the configuration are connected at high speed via PCIe Gen5, ensuring high bandwidth and low latency for the system.

Key takeaways:

1.High-performance hardware

The system is equipped with AMD EPYC CPU and AMD MI300 GPU to provide powerful support for parallel computing and graphics processing.

2.MangoBoost DPU的引入

Accelerate data transfer and storage operations with MangoBoost DPUs, especially in scenarios that require high bandwidth, where the DPU is able to improve overall performance.

3.PCIe Gen5 and 100Gbps networking

High-speed connectivity is provided for the individual components (CPU, GPU, DPU) to ensure efficient data flow.

4.Operating system and configuration

Running Ubuntu 22.04.3 LTS, GRUB parameters suitable for large-scale data processing and virtualization are configured.

.

Assessment 1: Benchmarking of FIO

Control group (left): GPU system experiment group with the blessing of ordinary network card (right): GPU system with the blessing of MangoDPU

FIO Micro Benchmark – Results

 Data Movement Bandwidth:
· The chart on the left shows the bandwidth results for different block sizes, with GPU storage acceleration (yellow) providing higher bandwidth compared to CPU buffers and software NVMe/TCP (gray).
· GPU storage acceleration

The solution provides a bandwidth increase of 1.7 to 2.6 times around the network line speed bandwidth.

· CPU system usage exceeds 80%

, and the bandwidth is significantly increased at 1MB and 2MB block sizes, indicating that the GPU storage acceleration scheme effectively reduces the CPU burden.

 Data Movement Latency
· The chart on the right shows the latency at different block sizes. GPU storage acceleration has a 25% reduction in latency compared to CPU buffers and software NVMe/TCP solutions, showing lower latency in the average, 90th percentile latency, and 99th percentile latency.
· Especially at the 256KB block size, the latency is reduced by 20%.
 CPU Cores Used
· The third chart shows the number of CPU cores used for different block sizes. The GPU storage acceleration solution significantly reduces the use of 22 to 36 CPU cores, especially at 2MB block sizes, saving the maximum number of cores.

.

Assessment 2: DeepSpeed Workload – Software Setup

The diagram shows how MangoBoost DPUs can be leveraged to accelerate data exchange by modifying DeepSpeed‘s backend (switching module), especially in workloads such as high-performance computing (HPC) and deep learning. By enabling accelerated switching mode, the speed of data exchange between the GPU and storage has been significantly improved.

Key takeaways:

1.Accelerated switching mode

By enabling the MangoBoost DPU, the traditional switching module (Normal Swap Mode) is replaced with an accelerated swap mode to optimize data transfer.

2.Mango files and DPUs

The Mango vault accelerates I/O operations through hardware acceleration, and the DPU is responsible for providing efficient point-to-point data transfer, significantly improving the efficiency of data flow between storage and compute.

3.Optimize compute and storage interactions

Leveraging the synergy of Mango DPUs and AMD GPUs, the performance of the entire DeepSpeed workload is boosted.

The value of the DeepSpeed project in AI scenarios

Reference Reading :How does DeepSpeed optimize inference performance from the storage layer?

1. Memory and Parallel
OptimizationDeepSpeed, part of Microsoft’s Large-Scale AI Initiative, contains a powerful in-memory and parallel-optimized toolkit specifically designed for efficient large-scale model training and inference on modern GPU clusters. It scales with heterogeneous memory (GPU, CPU, and NVMe), significantly improving computing efficiency.

2. Reduce the burden on the GPUDuring inference, DeepSpeed effectively reduces the burden on the GPU by offloading model parameters to NVMe storage and KV cache to CPU memory, thereby improving inference efficiency, especially when working with large-scale models.

.

DeepSpeed Workloads – Results

The chart illustrates the significant performance gains of MangoBoost GPU Storage Acceleration (GSB) in DeepSpeed workloads:

1.Higher bandwidth

GPU storage acceleration provides 1.7 times more bandwidth for data transfer than traditional CPU buffers and software NVMe/TCP.

2.Reduce CPU core usage

GPU storage acceleration significantly reduces the use of CPU cores, saving 25 CPU cores, making computing resources more efficient.

3.Advantages in AI training frameworks

This optimization provides significant performance gains in AI training frameworks, especially in terms of data transfer and computational efficiency.

.

summary

 Efficient storage systems become a key factor in AI systems:
· In AI computing, the GPU needs to keep the computation busy, but it cannot store a large number of AI models, data, and parameters due to the limitations of the local device’s memory.
 AMD offers an advanced AI ecosystem:
· AMD Instinct™ GPU

and AMD ROCm™ Software: Powerful hardware and software support for AI workloads.

 Data Processing Unit (DPU): Improves the efficiency and performance of the storage system.
· MangoBoost

A comprehensive DPU solution, such as GPU-storage-boost, is provided to optimize data transfer and processing.

 Case Study: Llama Training with MangoBoost Storage Solution:
· Improve MicroBenchmark

, 1.7x to 2.6x faster throughput, and 22-37 CPU cores saved.

· Improve AI training storage access

Throughput, up to 1.7x, and save 25 CPU cores.

.

.

Leave a Reply

Your email address will not be published. Required fields are marked *