Benchmarking Storage for AI Workloads

Choose the right storage for your AI infrastructure

6 min readJan 19, 2024

As ChatGPT and Large Language Models (LLMs) are getting popular, I noticed that in the field, we are receiving more inquiries than before from customers and prospects, about how to benchmark storage for AI workloads. In this blog, I want to write about why storage performance is important for AI, and introduce several benchmark tools and examples of that.

Note all the tests and results included in this blog are unofficial. The solo purpose of this blog is to demonstrate the usage of these test tools. Test results from the MLPer Storage tool are not verified by the MLCommons Association.

Why Benchmarking Storage for AI?

Storage is probably one of the last things many data scientists care about. Modern GPU servers normally ship with terabytes of fast local NVMe SSDs. It should just work. So why bother talking about storage? Well, large AI systems like those used to fine-train LLMs, are most likely built on remote shared storages. Not just because large datasets don’t fit on a single server, by using a shared storage, it also allows team to easily share GPU and data across the cluster. In such a system, it is important to ensure the shared storage is fast enough to deliver data to many GPUs and keep them busy.

Shared storage is not that simple compared to local SSD. Sometimes, it doesn’t just work… In fact, if not chose properly, the shared storage, and anything between it and the GPUs can become an bottleneck in an AI system.

So how do we know whether the storage is fast enough for AI? Let’s test.

Benchmarking Storage for AI workloads

Generally, an AI storage benchmark plan can be structured to encompass 3 layers of testing. These are:

General storage function and performance tests.
GPU and storage related benchmarks.
AI full stack benchmarks.

In this blog I cover only layers 1 & 2. The reason is that the 3rd layer of testing (AI full stack) involves extensive resources and is beyond the blog’s scope.

General storage performance test

Although model training is the main workload that requires high-performance storage, it is only one part of an AI system. By running general storage performance tests, even with those not directly related to AI, we make sure the storage system performs well for other workloads in the AI system.

Fio and Vdbench are two popular general storage benchmark tools. Both Fio and Vdbench support generating various I/O workloads, including random writes, sequence reads and others on the target storage. The below is a Fio test example.

fio -rw=randwrite -name=fio-randwrite -numjobs=8 -bs=4k \
    -ioengine=libaio -direct=1 -iodepth=16 \
    -size=10G -runtime=180 \
    -group_reporting \
    -directory=/mnt/fbs/fio/randwrite

This is a random write test, running in 8 parallel processes, with a block size of 4KB. The job uses an async I/O engine, with OS cache disabled. In the test, Fio writes 8 x 10GB files into a FlashBlade NFS mount directory.

Here is the output from the above example:

Starting 8 processes
fio-randwrite: Laying out IO file (1 file / 10240MiB)
fio-randwrite: Laying out IO file (1 file / 10240MiB)
...
Run status group 0 (all jobs):
  WRITE: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=22.8GiB (24.5GB), run=180005-180005msec

It’s better to monitor the storage’s performance metrics during the test. The below chart shows that the FlashBlade was running at low latency (<4ms), low throughput and relatively high IOPS (34k). Does this mean FlashBlade is slow? Absolutely not. From experience, we know this is far from what FlashBlade can deliver.

If the general benchmark result is very low compared to expectation and estimation, we should investigate and repeat the test until reaching reasonable results or conclusions. Maybe the target storage is slow, maybe not. Before moving to GPU related tests, we want to make sure the testing environment is properly set up. General storage benchmarks are very good for that.

Metadata performance test

Some AI/HPC workloads are metadata operation bound. This could be troublesome for some storages. So it is important to also run metadata benchmarks. MDTest is a tool for evaluating metadata performance of a file system. Based on MPI, MDTest is designed for testing parallel file systems, too.

The following example ran a MDTest metadata test with 2 processes against a FlashBlade NFS volume. It measures various metadata operations including directory/file creation, stat, etc. In this example, FlashBlade directory creation happened at 142,734 operations per second.

mpirun -n 2 mdtest -d /mnt/fbs/mdtest -I 20 -z 5 -b 2 -R

2 tasks, 2520 files/directories

SUMMARY rate: (of 1 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   Directory creation           3270.391       3270.391       3270.391          0.000
   Directory stat             142734.684     142734.684     142734.684          0.000
   Directory rename             1753.931       1753.931       1753.931          0.000
   Directory removal            2866.653       2866.653       2866.653          0.000
   File creation                1958.908       1958.908       1958.908          0.000
   File stat                  165996.263     165996.263     165996.263          0.000
   File read                    5887.985       5887.985       5887.985          0.000
   File removal                 2934.778       2934.778       2934.778          0.000
   Tree creation                1198.884       1198.884       1198.884          0.000
   Tree removal                 1361.163       1361.163       1361.163          0.000

To increase the load, I changed the number of tasks to 4 mpirun -n 4, directory creation jumped to 300,859 operations per second.

General storage performance test ensures that the storage is fast for general I/O workloads, but ofter time it is not clear how relevant it is to AI. For example, suppose we have a budget to buy 32 GPUs, how do we know whether the storage can deliver training data fast enough to saturate the GPUs in our LLM fine-training? This is where the second layer of testing comes in.

GPU and storage related benchmark

GPU and storage related benchmarks measure storage performance for AI workloads with computing accelerators like GPU. MLPerf Storage is a benchmark suite under the MLCommons AI engineering consortium that measures how fast a storage system is for AI training workloads. It currently supports two deep learning workloads, including U-Net3D for image segmentation and BERT for natural language processing. I want to highlight that MLPerf Storage does not require a GPU to run. It instead simulates the NVIDIA V100 GPU by putting sleep time (determined by running the workloads on real hardware) in the tests.

Steps to install and run the benchmark is well-documented on MLPerf Storage web page. The below is an example test running the U-Net3D workload with 8 GPUs against a FlashBlade NFS volume.

cd mlperf/storage
./benchmark.sh run --workload unet3d \
  --num-accelerators 8 \
  --results-dir /mnt/fbs/unet3d_results/run-5/pod-1 \
  --param dataset.data_folder=/mnt/fbs/unet3d_data \
  --param dataset.num_subfolders_train=16 \
  --param dataset.num_files_train=3515

Here is the output (Result not verified by MLCommons Association):

[INFO] Averaged metric over all epochs
[METRIC] ==========================================================
[METRIC] Training Accelerator Utilization [AU] (%): 94.3205 (0.7827)
[METRIC] Training Throughput (samples/second): 21.2567 (0.2411)
[METRIC] Training I/O Throughput (MB/second): 2971.8861 (33.7126)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================

The output shows that FlashBlade was able to supply data fast enough to saturate the 8 GPUs with a total GPU utilization (AU) of 94%. On the storage backend, FlashBlade was running at 3.67 GB/s throughput.

FlashBlade performance with MLPerf Storage

To demonstrate a failure scenario, I changed the number of simulated GPUs to 16. The test then failed with AU at 39%, meaning more than half of the time the GPUs were idle (waiting for data). *Result not verified by MLCommons Association.

./benchmark.sh run --workload unet3d \
  --num-accelerators 16 \
  --results-dir /mnt/fbs/unet3d_results_16a/run-1/pod-1 \
  --param dataset.data_folder=/mnt/fbs/unet3d_data \
  --param dataset.num_subfolders_train=16 \
  --param dataset.num_files_train=6400


[INFO] Averaged metric over all epochs
[METRIC] ==========================================================
[METRIC] Training Accelerator Utilization [AU] (%): 39.4715 (0.3201)
[METRIC] Training Throughput (samples/second): 25.4332 (0.2140)
[METRIC] Training I/O Throughput (MB/second): 3555.8004 (29.9142)
[METRIC] train_au_meet_expectation: fail

This time, FlashBlade was running at 4 GB/s, which is close to the limits of a single test pod (the test was done on Kubernetes) can drive from the FlashBlade. To succeed the 16-GPU test, I need to run the test on bigger or multiple pods, but this will be another day’s job.

Conclusion

In this blog, I started with why benchmarking storage for AI is important. I then introduced the 3 layers of testing, followed by tools and examples for general and GPU related storage benchmarks, including Fio, MDTest and MLPerf Storage. These should cover the basic and most important aspects of storage performance in an AI system.

Once we know the storage is fast enough to keep the GPUs busy, we should also consider other metrics such as easy to operate, data and system reliability, features and cost when choosing storage for AI infrastructure. Keep tuned for another blog about these topics.