Smaller is Better — Big Data System in 2023

Consolidating and accelerating big data with fast S3, Kubernetes and Spark RAPIDS

Yifeng Jiang
4 min readNov 28, 2022

A few years ago, I worked on a project to build a large-scale Hadoop & Spark cluster from scratch. We ended up using 1000+ servers to create a 50+ PBs HDFS cluster and run tens of hundreds of Hive and Spark jobs there. I enjoyed the whole process — from choosing servers, designing scalable networks, to setting up and tuning the Hadoop & Spark cluster. At that time, bigger was better. I remember I even enjoyed looking at the metrics rising up on the monitoring dashboard. But, if you ask me do I want to do it again, probably not. In 2023, if I am to build another one, I will try to build the system as small as possible, and, of course, as fast as possible. I choose to go small for a couple of reasons, including ESG (Environmental, Social, and Governance), operation, and efficiency. So how can we do that? Let’s find out in this blog.

Consolidating and accelerating storage

The key is consolidation and acceleration, for both storage and compute. I will start with the easy one, which is storage. I work for Pure Storage, a storage company specialised in all-flash SAN, file and object storage (As a data and machine learning engineer, do you work at a storage comany? Yes, I get this question all the time, I will write about that.). My opinion might be a little biased, but for me it is really an easy and obvious choice. I am replacing HDFS with FlashBlade S3 for storage consolidation and acceleration.

With the default 3x replication, the 50PB HDFS cluster I worked on before gives 16.67PB usable storage capacity, at the cost of 1,000 servers, which occupies around 50 racks in the datacenter.

To provide the same usable storage with the latest FlashBlade//S, I will only need around 12 FlashBlade chassises, which can fit in 2 racks (each FB chassis is 5 rack units), assuming power supply is sufficient. 50 vs. 2 racks, a 25x consolidation! Who wants to manage 50 racks if you can do so with 2? Not to mention the simplicity of operating the FlashBlade storage appliance, compared to the complexity of HDFS software running on 1,000 servers.

FlashBlade//S Specifications (source: Pure Storage FlashBlade//S)

Regarding performance, FlashBlade S3 is the fastest object storage I have ever seen. For big data workloads, observation from the benchmark I did previously suggests FlashBlade S3 is as fast as HDFS running on local SSD. This is also a future-proven data lake storage. While HDFS is mostly only used for big data, thanks to FlashBlade’s support of both fast S3 and NFS, I can further consolidate other workloads, such as AI/ML and cloud-native applications, into FlashBlade — a unified fast file and object storage.

Consolidating and accelerating compute

For compute, we will use Kubernetes, GPU and the Spark RAPIDS Accelerator.

In my last blog, I shared how I built an open data lakehouse with Spark, Delta Lake and Trino on FlashBlade S3. This data lakehouse system combines the strength of data lake and warehouse in a way that is open, simple, and runs anywhere. In fact, it runs on a Kubernetes cluster. I also shared my approaches and workflow for running Spark on Kubernetes in details. Ideally, I want just one computer in the datacenter — a single Kubernetes cluster across many servers, where many different workloads can share resources like processes sharing CPU and RAM in a computer.

By using Kubernetes, we are not just doing compute consolidation, but also acceleration. I have set my Kubernetes cluster to do GPU scheduling using the NVIDIA GPU Operator. This is to accelerate compute-heavy operations, including some big data and most machine learning tasks. I wrote about distributed deep learning with Horovod on Kubernetes in my previous blog. Today, I want to briefly talk about Spark RAPIDS, an accelerator for Apache Spark that leverages GPUs to accelerate processing via the RAPIDS libraries. The idea is to build a unified framework for big data ETL and ML/DL. Together with a unified fast file and object storage like FlashBlade, we can build a single pipeline, from ingest to data preparation to model training.

A single pipeline, from ingest to data preparation/ETL to model training

In a quick lab test on a small dataset (200GB), using the Nvidia Decision Support(NDS) benchmark, with 6 x A10 GPU cards, the overall runtime was 4x fast, compared to CPU-only processing. I feel positive that with bigger datasets, faster GPUs, newer software, and further tunings, the speed gain is likely to be even bigger. Assuming a 4x acceleration, that is a big efficiency boost. Reducing the number of computer servers to a quarter is a big win. Stay tuned for a follow-up blog on the details of getting started and benchmarking with Spark RAPIDS.

Conclusion

Just like today’s smart phones are more powerful than computer servers 20 years ago, I envision big data systems in a few years will be much smaller in datacenter footprints, but way more powerful in capacity, performance and computing power.

Smaller is better, for big data systems in 2023.

--

--

Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.