Analysis of AI storage requirements：from data pipelines to model optimization

Analysis of AI storage requirements: from data pipelines to model optimization

Full text overview

With the rapid development of artificial intelligence technology, AI is increasingly widely used in various industries. However, the performance of AI systems not only depends on powerful computing power, but is also closely related to the optimization of storage systems. AI workloads are multi-staged, highly parallel, and diverse, making storage requirements very different from traditional applications. From data ingestion to model inference, each phase places different demands on the storage system. Understanding these requirements can not only help optimize system performance, but also provide more efficient storage solutions for enterprise AI applications, thereby improving overall competitiveness.

1.AI data pipelines and workflows

• Data ingestion, cleaning, feature engineering, model training, evaluation and tuning, inference, and data archiving

• Analysis of storage requirements at each stage

2.AI storage requirements

• Data ingestion: Large-volume, sequential writes

• Data cleaning: large-capacity, random read/write

• Feature engineering: high-capacity, random read/write

• Model training: high-order read/write and parallel operations

• Model inference: high random reads and low write requirements

• Data archiving: Massive storage

3.Key tools and techniques

• MLPerf benchmarks

• SDXI data transmission interface

• Compute-based storage

• GPU acceleration

4.Storage networking and architectural challenges

• Performance bottlenecks

• Model extensibility

• reliability

• Storage network optimization

Read the harvest

1.Gain a deeper understanding of the multi-phase nature of AI workloads and their impact on storage systems.

2.Understand the storage requirements at each stage and optimize the storage configuration to improve the performance of AI applications.

3.Learn how key tools and technologies such as MLPerf, SDXI, and computational storage can help optimize storage.

4.Learn how to address storage network and architecture challenges to improve the overall efficiency of your AI system.

Why is the storage of AI scenes so special?

• AI is a multi-stage workload

· Most traditional workloads, such as databases, have predictable access patterns

· AI has very different workload patterns at different stages

• Optimization goals may differ

· Optimize GPU utilization, not transaction response time

· Optimize data scientist productivity

• Highly parallel operations

• The performance and capacity of different AI tasks vary widely

The difference between AI storage and traditional storage is that AI workloads are typically multi-stage, and the requirements and patterns of each stage are different. AI optimization targets are also different from traditional storage, focusing more on GPU utilization and data scientist productivity. In addition, AI tasks often involve highly parallel operations, and different AI tasks have different requirements for performance and capacity.

AI data pipelines/workflows

The different stages of AI processing are depicted in the image, which mainly includes the following parts:

1.Data Ingest: Data is fed into the system as a training data source for AI models.

2.Data Cleaning: ETL process, data cleaning, and preprocessing are required to ensure data quality and consistency.

3.Feature Engineering: Transform data into features that the model can understand, including feature creation, refining, scaling, vectorization, and more.

4.Model Training: Includes steps such as training, checkpointing, recovery, and validation of the model to optimize the initial performance of the model.

5.Model Eval and Tuning：Evaluate the model performance based on metrics such as precision measurement, adjust hyperparameters, and perform model tuning.

6.Inference: Analyze and make predictions on new data using a model that has already been trained.

7.Data Archive: Store and archive processed and trained models and data.

8.Business Value: The actual effect of AI model application is fed back to the production data and model tuning improvement, increasing business value.

Cite

In the previous article, Solidigm gave a more down-to-earth storage selection scheme around the IO characteristics of the data pipeline, which can be combined to take a look.

• Solidigm: AI data pipeline and storage solution

• This article describes the entire process of AI data processing in detail, including data ingestion, preparation, training, checkpoint preservation, inference, and archiving.

• The I/O characteristics of data read/write at different stages are highlighted, and suitable storage solutions, such as the P5336 based on QLC flash memory and the P5520 based on TLC, are recommended

Model building phase

The authors emphasize that for the AI industry, the core attention and resource investment is in the model building stage, which mainly involves the first five steps in the figure above.

In the process:

• Heavily used resources

· Data Scientist

· Computing resources

· Storage resources

· GPU resources

Target:

• Generate a trained model that goes through various stages of training and tuning to achieve optimal performance.

Note:

• The goal of the process is to generate trained models that don’t directly generate business value unless your business is a sales-based model (e.g., LLMs).

Model inference applications

• Stages of generating business value:

· Through the inference stage, the trained model is used to analyze the production data, and finally turn it into business value.

• Key Resources:

· Compute Resources

· GPU Resources

· Production Data

· Heavily used

· Use it more efficiently

• Objectives:

· Generate business value, which is the ultimate goal of the entire AI process.

Note

At the time of the SDC24 meeting, DeepSeek had not yet iterated the current V3/R1 version, and the inference demand was not fully opened. Looking at tomorrow from today, the retention and persistence of enterprise production data is an indispensable part of the follow-up insight into production laws. In the hustle and bustle of the industry to improve productivity with large models, it is necessary to think more calmly, where is the data of enterprises? In what way should I feed it to the big model?

Take data ingestion, for example

The image discusses the process of data ingestion and suggests how AI impacts every aspect of enterprise capturing, storing, and accessing business data. Although companies are already collecting data, the application of AI may change the way data is processed and stored, affecting the efficiency of data utilization.

1.Your business processes generate data today

2.Do you have storage ready for this ingested data?

• Or do you not?

Business data is already being captured, but:

• How does AI affect the data you capture?

• How does AI affect the way you store business data?

• How does AI affect the way you access business data?

Real-world example: Data flow before AI was not used

Before companies used AI, their data ingestion relied primarily on sequential writes, and a large amount of data generated was discarded. Only a small fraction of the data is read and saved randomly. This processing is likely to change with the use of AI, which can help make more efficient use of this data.

AI-powered data flows

The image shows how AI technology can be used to extract value from data.

Data generation is processed through business logic and AI-enhanced business logic, ultimately generating data insights. The saved data provides possible business insights for the future, while also reflecting the read-write nature of the data.

Note

Recently, a lot of needs have been discussed: DS is so good, can you help me check XXX data? Compared with the all-parameter model inference, the small model is actually a huge market for data filtering and insight in the production process, and the model will become a throughput machine for logs and process data.

Storage characteristics of data workflows

Data cleansing

• The raw data needs to be prepared for the use of AI

· Logs, pictures, videos, documents, etc

• The data needs to be collated before it becomes training data

· Clear the noise

· Deduplication

· standardization

· Privacy & Ethics (e.g., de-identification of PII, removal of bias, etc.)

• Data is read from ingestion storage

• The cleaned data needs to be written to the storage for feature engineering

• This process may be partially automated through AI

The legend on the left shows that the data cleaning process requires a large storage system: large capacity, sequential writing process, and random reading process.

Feature engineering

The picture illustrates the main steps of feature engineering. Data scientists play the role of translators in this process, turning raw data into numbers that AI can process. Feature engineering involves multiple steps such as exploring data, extracting features, and transforming data types. This process is typically computationally intensive and highly parallel.

===

• Data scientists act as translators

· Raw Data → Food for AI (01 sequence)

• Explore data – identify patterns, outliers, relationships, and more.

• Divide the data into training and test sets

• Feature Extraction – Converts key features into consumable essences

• Data Transformation – Transforming (vectorizing) data types

• Usually highly parallel

The legend on the left shows that the requirements of the feature engineering process for the storage system: large capacity, and the read/write process is mainly random.

Model training phase

Several aspects related to GPU and storage planning are discussed, emphasizing the importance of balancing storage and GPU performance, understanding data sources, and how to optimize for known workloads. There are also specific GPU benchmarking tools and training requirements.

===

Model training – general storage planning

• GPU drive cost

– Maximizing GPU utilization optimizes investment

• Design a balanced architecture

· Balance storage performance with GPU requirements

• Consider data sources

· You may need to have both file and object access

• If the training workload is known – match the storage performance with the workload

· AI GPU benchmarks can demonstrate the peak performance of various models

· The MLCommons MLPerf training benchmark is a great source

· Determine the size of the training samples

· Estimate the required read bandwidth by multiplying by throughput and size

• For general-purpose training, you may need to support the maximum read speed of the GPU

· Current high-end GPUs can reach 1GB per second per GPU, and this speed is increasing

Is the training sample size here a parameter scale or raw data? How is the sample size determined?

The size of the training sample usually refers to the size of the raw data, not the size of the parameters. “Sample” here refers to an instance of data used for training, typically input data (such as images, text, sounds, etc.) and their corresponding labels or targets.

Determining the size of the training sample can be considered in the following ways:

1.The type and size of the dataset: The size of the training data depends on the specific task and model. For example, for an image classification task, thousands to millions of images may be required as training samples; For text tasks, thousands of sentences or documents may be required.

2.Task complexity: More complex tasks, such as deep neural network models, usually require more training samples to obtain better generalization capabilities.

3.Availability of data sources: Sometimes the amount of data is limited or difficult to obtain, and needs to be supplemented by methods such as data augmentation, synthetic data, or transfer learning.

4.Model size and computational needs: If your model is very large, you may need more training samples to avoid overfitting and to better exploit the model’s potential.
The number of model parameters (e.g., 7B, 14B, etc.) usually refers to the total number of all trainable parameters in the model. These parameters are the weights and biases that the model learns during training. For deep learning models, especially large pre-trained models (e.g., GPT, BERT, etc.), these parameters directly affect the model’s capabilities, capacity, and performance of training and inference.

How to determine the number of model parameters:

1.Model architecture: The structure of the model and the number of layers are key factors in determining the number of parameters. For example, each layer in a Transformer model is usually composed of multiple sublayers, each with its own weight matrix, bias, and other parameters.

• Self-Attention Layer

• Feed-Forward Neural Network Layer

• Each layer contains multiple matrices and vectors that participate in the computation of the model.

• For example, in a Transformer model, each layer will have:

2.Number of neurons per layer: The number of neurons in each layer (also known as the dimension of the hidden layer) determines the number of parameters in each layer. Larger dimensions usually mean more parameters.

3.Number of layers: The depth of the model (i.e., the number of layers) is also a factor that affects the amount of parameters. With each additional layer, the number of parameters of the model increases dramatically.

4.Types of parameters: The model includes not only a weight matrix, but also biases, activation functions, and so on. Each type of parameter affects the total number of parameters.

Illustrate:

Suppose a model has 3 layers, each layer has 1000 neurons, and each neuron is connected to each neuron in the previous layer, then the number of parameters of the model is approximately:

• The number of parameters per layer

1000 (input neurons) × 1000 (output neurons) + 1000 (bias) = 1,001,000 parameters

• The total number of parameters

If there are 3 layers, the total number of parameters is 1,001,000 × 3 = 3,003,000 parameters.

For more complex models, such as GPT-3 (with 175 billion parameters), the increase in the number of parameters is mainly achieved by deepening the number of layers and increasing the number of neurons per layer.

Checkpoint mechanism

The problems that may arise during model training are discussed, especially when it comes to storage performance when it comes to checkpointing mechanisms.

Checkpoints are used to save the state of the model, such as weights and biases, so that it can be recovered if an error occurs during training. Writes to files are usually sequential, and there may be multiple sequential writes in parallel.

Training pauses can impact performance, and the recovery process often requires high sequential reads and parallel reads to recover to multiple GPUs. The performance of the storage system directly affects the efficiency of the preservation and recovery of checkpoints, which in turn affects the overall efficiency of training.

===

Model training – what can happen when something goes wrong

• Checkpoints

— Save model weights and other states

· Model weights are very expensive when training takes a long time

· The checkpoint saves the state so that training can be restarted after an error

• Checkpoint files are written sequentially

· There may be multiple sequential writes in parallel

• When training is paused – performance is money

· Checkpoint recovery is reversed

· High sequential reads, multiple parallel reads are restored to multiple GPUs

• Storage performance depends on save/restore time objectives

The legend on the left shows that the checkpointing process requires the storage system to be that capacity is not the main contradiction, and the system has high requirements for read and write performance, and all of them are sequential read and write.

Model evaluation and tuning

Key aspects of model evaluation and tuning are discussed. The evaluation measures the correctness (accuracy) of the model and its performance in dealing with errors versus correct results (precision/recall).

In addition, other commonly used assessment measures such as F1 score and AUC-ROC are presented. The tuning section emphasizes tweaking the model’s hyperparameters to improve the evaluation effect, and mentions that tuning can generate a dataset that contains the model’s parameters. Finally, the number of parameters of the model is fixed and depends on the number of weights of the model, which is critical in neural networks.

The legend on the left shows that the requirements for the storage system during model evaluation and tuning are similar to those of checkpoints.

===

Cite

Model evaluation and tuning

• assess

– Measure how well the model results match expectations

· Accuracy

– What is the probability of being correct?

· Precision/recall

– A rough measure of the frequency of errors versus errors

· Other metrics

– as F1 score and AUC-ROC (area under the curve/receiver operating characteristics)

• Tuning

– Adjust hyperparameters to improve evaluation

· Generate a dataset that contains model parameters

· An internal representation of a neural network

· The size of the model parameters is constant and depends on the number of weights

F1 score and AUC-ROC are two important metrics to evaluate the performance of machine learning models. Here’s a closer look at the two metrics:

1.F1 Score

The F1 score is the harmonic average of Precision and Recall. It is a comprehensive indicator of the accuracy and comprehensiveness of the model, especially when the categories are uneven.

Features of F1 scores:

• range

The range of F1 scores is [0, 1], with 1 being perfect precision and recall, and 0 being the worst performance.

• Balance considerations

When precision and recall are unbalanced, F1 scores can provide a more comprehensive evaluation and avoid relying too much on one metric.

Applicable Scenarios:

• F1 scores are typically used in scenarios that focus less on misclassification, such as medical diagnosis, fraud detection, and more.

1.AUC-ROC (Area Under the Curve / Receiver Operating Characteristics)

AUC (Area Under the Curve) is the area under the Receiver Operating Characteristic Curve (ROC), which is used to evaluate the performance of binary classification models.

Features of AUC-ROC:

• Independent of thresholds

AUC is the overall assessment of the model at different categorical thresholds.

• merit

It is suitable for cases where the categories are unbalanced, and the overall performance of the model can be evaluated.

Applicable Scenarios:

• AUC-ROC is often used to evaluate the ability of models in dichotomous problems, such as credit scores, disease predictions, etc. It is particularly useful for unbalanced datasets because it takes into account performance at different thresholds.

Summary:

• F1 score

Ideal for scenarios that require a balance between precision and recall, especially when the categories are unbalanced.

• AUC-ROC

It is a comprehensive evaluation indicator that can more comprehensively describe the performance of classifiers under different thresholds, especially in the case of category imbalance.

Characteristics of the inference session

The slides discuss the concept of reasoning and its application in business. Inference refers to the use of a model to infer and process production data after it has been trained to generate business value.

It includes types such as large language model-based retrieval enhancement generation (RAG), predictive analytics, computer vision, and anomaly detection (e.g., for malware or fraud). Changes in access patterns are also related to the type of inference, especially RAGs, which can generate random workloads that resemble databases. These are important for understanding the application of the model in a production environment and optimizing the inference process.

The legend on the left shows that the requirements for the storage system in the model inference process are as follows: the capacity requirements are not high, the write performance requirements are not high, and the main ones are random writes; The read performance is demanding (mainly fast retrieval from the model) and is all random reads.

Data archiving

The importance of archiving in AI was discussed. Although archiving is not typically a core part of AI, it is important for AI storage, especially in some AI applications where archiving data may be required by law or regulation.

Unlike the traditional concept of “archiving”, the archived data mentioned here may need to be extracted later for subsequent training or analysis. Archival storage doesn’t have high performance requirements, but it needs to be “fast enough” to recover quickly when needed. Often, archived data is referred to as “cold storage,” and as datasets continue to grow, archive storage expands. This type of storage requires low-cost and low-carbon footprints, and even offers zero-power storage solutions such as DNA storage and optical storage technologies.

The legend on the left shows that the storage system is required in the data archiving phase: mass storage.

Key tools and technologies for AI infrastructure

A standard for measuring computational efficiency – the MLPerf benchmark.

Several commonly used benchmarking tools are highlighted, in particular the MLPerf benchmark provided by MLCommons.

MLPerf has multiple categories, covering training, inference, storage, and more. The inference segment is also subdivided into different scenarios, such as mobile devices, micro devices, data centers, and edge devices. In addition, some benchmark results of training algorithms (AlgoPerf) are mentioned to evaluate the performance of different training algorithms. These benchmarks can help evaluate how different hardware and models perform in specific tasks.

About the MLPerf benchmark

• MLCommons is an open organization of experts from academia, industry, and other fields to advance machine learning and artificial intelligence technologies. Its mission is to promote the advancement of AI technology by creating and promoting open-source benchmarks.

• MLPerf‘s benchmarks cover multiple machine learning tasks, including training and inference. It simulates real-world machine learning workloads to test the performance of hardware and software in real-world applications.

• Objectives: The primary purpose of MLPerf is to provide a fair, repeatable, and transparent approach to the evaluation of AI hardware and software. It helps users and organizations understand how to choose the most appropriate hardware and configuration for different types of machine learning tasks.

Accelerator – SDXI

SDXI is a standard data transfer interface developed by SNIA (Storage Networking Industry Association) to simplify the movement and processing of data.

Future versions of SDXI will support additional features such as encryption/decryption and compression/decompression, further enhancing its application in the data processing process. With these additional features, SDXI optimizes the efficiency and security of data transfers, helping to improve the overall performance of storage and compute tasks.

Cite

For more technical reports on SDXI, please refer to:

• DREAM: Data Acceleration – SDXI, DPU and Storage

Main content: The article details the design background, key features, and role of SDXI (Memory-to-Memory Data Mobility and Acceleration Interface) in the storage access path. SDXI is designed to optimize data transfer, reduce CPU load, support tiered storage, and efficient data access for high-performance computing and AI applications

Accelerator – Computational storage

Computational storage technology, which is defined by SNIA (Storage Networking Industry Association) and NVMe (Non-Volatile Memory Interface).

Computational storage provides an open platform that allows computing functions to be integrated directly into the storage device, enabling computational operations to take place close to the data, reducing latency and increasing efficiency in data transfer. Common features include encryption/decryption, compression/decompression, data filtering, and preparation of training data. These features provide more efficient computational support for data processing, especially in applications such as machine learning.

Cite

For a more detailed report on computational storage, please refer to:

• Marvell: Getting Started with Computational Storage (Full Text)

This article details the definition, architecture, and performance of computational storage, emphasizing moving computing tasks to where data resides, reducing data movement and improving efficiency. The article also discusses the benefits of computational storage, including reduced network bottlenecks and energy consumption

Accelerators – Graphics Processing Units (GPUs).

The advantages of GPUs in parallel computing are discussed , especially in AI computing tasks.

The GPU’s ability to efficiently handle multiple similar computations in a matrix is at the heart of its massively parallel computing capabilities. Compared to CPUs, GPUs can perform multiple calculations at the same time, dramatically reducing compute time while improving energy efficiency. In addition, data center GPUs are often equipped with high-speed memory (HBM) to meet the demands of large-scale data processing.

===

Accelerator – GPU

• Parallel operations

· AI computing can be highly parallelized

· It is common to operate on multiple similar calculations in a matrix

· This type of computation is designed by GPUs to handle in massively parallel fashion

· CPUs can usually only do one computation at a time

• Parallel operations not only significantly reduce computation time, but also improve energy efficiency

• HBM

(high-speed memory) is typically found on GPUs in data centers

Although GPUs are extremely efficient in training, they also present some challenges when using them.

First of all, GPU programming is more complex than CPUs and requires more technical capabilities. Second, GPUs typically have higher power consumption, which also leads to higher costs and cooling requirements, increasing the complexity of use. In addition, GPU hardware is expensive, and latency can be introduced when data is transferred to and read from the GPU, which can impact compute performance.

Note

Of the three accelerators introduced here, the only ones that are more mature are GPUs, SDXI and computational storage are temporarily in the public eye because of the large investment required by software and ecology, but as the AI scene continues to mature, the advantages of the other two accelerators in data transmission efficiency will shine in the inference scenario.

Awareness of storage networking

Because storage devices and networks typically have high latency when processing data, the speed of the entire system is often limited by the performance of these components. Especially in high-performance computing and AI tasks, storage and network bottlenecks can affect data transfer and processing speeds. In order to optimize the performance of the GPU, it is important to keep the GPU constantly getting data (“feeding”).

Network architecture

The image shows the network design of an accelerated computing data center, with different network layers distinguished by color.

• Green is a management/service network, based on Ethernet, with mature technology and low requirements;

• Yellow is the interconnection network between GPU nodes, there are currently solutions such as IB/RoCE 2.0, and Hyper Ethernet (UEC) is the developing open interconnection organization;

• Purple is the interconnection network between GPU nodes, and the interconnection solutions of different hardware vendors are different, such as Nvidia is based on NVlink, AMD is based on [[UALink]].

Ultra Ethernet UEC

Introducing Ultra Ethernet, an open project developed by the Linux Foundation to provide a highly scalable, low-latency, and highly reliable network solution capable of supporting network topologies of up to one million nodes.

Hyper Ethernet integrates the latest congestion management technologies and low-latency protocols, and is designed with security in mind. The project combines the knowledge and technology of several experts to advance the development of network technology, and its specifications will be made public by the end of the year.

UALink

Introducing UALink, a technology for inter-accelerator communication designed for massive scaling. The initial focus will be on implementing memory sharing between accelerators (e.g., GPUs), specifically DDR and HBM memory.

At the same time, UALink provides a low-latency and high-bandwidth network that supports the operation of hundreds of accelerators in a single node, and guarantees simple load/store operations and software consistency. With support for transfer rates up to 200Gbps and compatibility with the Infinity Fabric protocol, UALink is capable of supporting large-scale accelerator network deployments. This technology can be complemented with other extension methods, such as Hyper Ethernet (UEC), to further improve system performance.

Several of the biggest challenges associated with storage are discussed.

First, performance issues require the storage system to interfere with the GPU as little as possible to improve compute efficiency. Second, as models continue to scale, storage solutions need to be more scalable to handle the sheer volume of data. Third, reliability is key, especially when training, as checkpoint data loss can result in huge data loss.

Finally, the slide asks the question of how SNIA (Storage Networking Industry Association) can help address these challenges, suggesting that there may be technologies or standards that can provide support.

Leave a ReplyCancel Reply

AMD：AI-driven storage revolution, DPU accelerates the new trend of storage access

Is Goldfinger real gold

PCIe Long distance Cross node Transmission Solution

One minute to help you understand the differences between mSATA and mini PCI-E？