Comparing Big Data Performance with Different Data Lake Storages

Big data benchmarks using TPC-DS and YCSB with HDFS, FlashBlade S3, and Amazon S3

8 min readJun 16, 2022

Every big data user I talk to has a data lake and data warehouse use case. It typically starts with Hadoop, using HDFS as the data lake and Apache Spark for distributed processing. A data warehouse is always there because everyone likes SQL. The big trend in this area is embracing an as-a-service model and architecture. This is probably influenced by the cloud, but it does not stop at the cloud; hybrid cloud, or even on-premise big data users are doing the same.

The key point that enables such an architecture is to separate compute and storage. This is obvious in the cloud. There are big benefits to doing so, including flexibility to operate and scale compute and storage independently, better resource utilisation, cost-saving, and enabling big data analytics in virtual machines and containers. However, one common concern I heard quite often, especially from on-premise customers is, how about performance — won’t it be slow by using remote storage in big data? In this blog, I will describe two benchmarks I did in order to help customers compare end-to-end big data system performance with different data lake storage. I use two benchmarks, TPC-DS for Hive and YCSB for HBase, to compare the performance of different types of workloads on three data lake storages: HDFS, FlashBlade S3, and Amazon S3.

The conclusions drawn from the benchmarks are:

There is no such thing that remote storage must be slow for big data. Object storage like FlashBlade S3 is as fast as or even faster than local HDFS for some workloads.
Data locality in big data is overrated in 2022 when networks are fast. Data locality is more important for latency-oriented workloads such as HBase read.
Not every object storage’s performance is the same. FlashBlade S3 delivers 1.5~4.5x speed for Hive and 20x~40x for HBase, compared to Amazon S3.

I also noticed that compared to HDFS, FlashBlade is faster with Hive but slower with HBase. Why does it behave like that? Let me describe the test methodology, observations, and charts, before finally answering this question.

Test Methodology and Environment

I use Cloudera CDP to install and configure Hive and HBase to use one of the data lake storage. I run 3 rounds of tests on the same storage, and take an average from the results of 3 runs as the final for that storage. I then collect and compare results from each storage. Tests were executed on 1TB dataset for each benchmark.

Tests for FlashBlade and HDFS were performed in our on-premise lab, with 30 blades (2 chassis) FlashBlade, and HDFS running on local SATA SSD. I got multiple 40Gbps network links from FlashBlade to the hosts. I use 10 hosts in total, running 1 CDP master and 9 worker nodes. For each host, I have:

2 x AMD EPYC 8-core processors, in total 32 vCPU at 3194MHz.
128GB DRAM
980GB SATA SSD running the OS and HDFS
40Gbps network

For Amazon S3, I use 10 x m6a.8xlarge EC2 instances, which have the closet spec to the hosts in our lab.

On the software, I use CentOS 7.9 in the lab, and Ubuntu 20.04 on EC2 because CentOS7 AMI does not support AMD CPU at the time. I install Cloudera CDP base 7.1.7 on the nodes. I configured Hive and HBase to use FlashBlade S3 and Amazon S3 through the Hadoop S3A connector, which is shipped with CDP. I did minimum Hive on Tez configuration change:

tez.am.resource.memory.mb=4096
hive.tez.container.size=8096
hive.auto.convert.join.noconditionaltask.size=1252698795

For everything else, including HDFS and HBase, I use the default CDP configurations.

TPC-DS for Hive Results and Observations

The TPC-DS Benchmark is an industry-standard benchmark that models a decision support system. It is widely used to measure big data system performance. It specifies the table schema, data generation, and 99 SQL queries to execute.

I use hive-testbench (open sourced by Hortonworks) which implements the TPC-DS specification in Hive, to generate TPC-DS data and run the 99 queries.

Make minimum changes to test S3 using Hive external tables.
Generate 1TB TPC-DS dataset (scale factor of 1000) in ORC format.
Execute queries in sequential in each run.

Key observations for these tests are, FlashBlade S3 is at the same performance level as HDFS. Compared to Amazon S3, FlashBlade S3 delivers 1.5~4.5x speed for Hive.

Average execution time across succeeded queries

For small queries, HDFS tends to be slightly faster (around 5%) than FlashBlade. For big queries, FlashBlade tends to be faster, sometimes several times faster, than HDFS. FlashBlade delivers 1.5~4.5x speed for high I/O queries, compared to Amazon S3.

While FlashBlade succeeded with all 99 queries, HDFS failed two (q17 and q29) due to out of memory error. This is most likely because some cache mechanisms are enabled by default when using Hive with HDFS, which could compete against query execution for memory. For Amazon S3, q25 failed to complete due to timeout(2.5h). I also noticed that FlashBlade was running at 20~30% load during the tests, indicating that it can sustain more I/O with more compute hosts.

Below is a table of all the 99 TPC-DS queries’ execution time with the three data lake storage.

TPC-DS for Hive on Tez — query execution time comparison between data lake storage

YCSB for HBase Results and Observations

Open-sourced by Yahoo!, The YCSB Benchmark is an industry-standard benchmark that evaluates the performance of different key-value data stores for a common set of workloads. It supports many key-value stores including HBase. When testing on FlashBlade S3 and Amazon S3, HBase was configured to store its table files (HFiles) on S3, while keeping its write-ahead-logs (WALs) on HDFS, which is the minimum configurations required to use HBase with S3.

I use YCSB to generate table data and run the 6 common workloads with the following parameters:

1 HBase master, 9 region servers
Generate 1TB dataset: 1 billion records, 1KB each
Data load and each of the workload execution run with 32 threads in parallel on the master node
Run 10 million operations for each workload. Time out after 15 minutes.

The YCSB Benchmark has 6 built-in common key-value store workloads

Workload A 50% Read and 50% Update
Workload B 95% Read and 5% Update
Workload C 100% Read
Workload D Read/update/insert ratio: 95/0/5
Workload E Scan/insert ratio: 95/5
Workload F Read/read-modify-write ratio: 50/50

Plus, with the 100% insert data load, YCSB measures performance for all major HBase operations in throughput(operations/sec) and latency(microseconds), including average, 95th percentile, and 99th percentile latency.

I have excluded Amazon S3 results from the charts below because the gap between it and the other two storages is too big (up to 100x for some HBase operations) that it is difficult to view. For example, HBase READ latency with Amazon S3 is 153ms, comparing to 3.8ms with FlashBlade and 1.5ms with HDFS.

Comparing to HDFS, FlashBlade S3 throughput is lower for HBase, but not for INSERT. HDFS delivers 1.5x throughput on average across workloads.

Average HBase throughput across workloads

For HBase data load, which is all INSERT, HDFS and FlashBlade’s throughput are almost the same.

Although higher than HDFS, both FlashBlade and HDFS deliver low latency for HBase READ & SCAN.

FlashBlade latency for READ is 2.4x high as compared to HDFS: 3.8ms vs. 1.5ms.

95th percentile latency for HBase READ, averaged across workloads

FlashBlade latency for SCAN is 4x high as compared to HDFS: 16ms vs 4ms.

Although it seems to be big gaps in the charts, actually both are at very low latency. It is only a difference of several milliseconds (the charts are microseconds), which is not a big deal for most real-world workloads.

It is interesting to notice that FlashBlade latency for HBase UPDATE and INSERT is lower than HDFS, although slightly. Both FlashBlade and HDFS deliver low latency for HBase UPDATE and INSERT.

FlashBlade latency for UPDATE is slightly lower as compared to HDFS: 0.60ms vs. 0.64ms.

95th percentile latency for HBase UPDATE

FlashBlade latency for INSERT is 12% lower as compared to HDFS: 0.74ms vs 0.84ms.

95th percentile latency for HBase INSERT

During YCSB tests, FlashBlade was at 30~50% load, indicating that it can support more I/O.

Below is a table of YCSB for HBase throughput(ops/s) and latency(microsecond) with the three data lake storages.

YCSB for HBase — throughput(ops/s) and latency(microsecond) comparison between data lakes

Performance in Big Data System

Both FlashBlade and HDFS are much faster than Amazon S3, in the two benchmarks. However, between FlashBlade and HDFS, their performance for Hive and HBase are different, as shown in the results above. Why does this happed? What impacts performance in a big data system? My answer would be everything in the stack — server hardware, OS, network, storage, the big data software stack, and data applications. And the details in each of these layers matter. For example, things that could impact performance in a data lake storage includes, but not limited to:

Local or remote storage
HDD vs. SSD vs. direct-flash (as used in FlashBade)
Large or small I/O, sequential vs. random I/O
HDFS on Linux filesystem vs. S3 on highly optimised storage OS (Purity//FB)

With these in mind, let me try to answer why FlashBlade is faster with Hive but slower with HBase when compared to HDFS.

Hive is throughput-oriented, with large I/O.
HBase is latency-oriented, with small I/O for read, and large I/O for write.
FlashBalde (or any remote storage) cannot beat local SSD latency (law of physics), but it is possible that remote storage delivers higher throughput. In fact, FlashBlade delivers high throughput at low latency for big data systems.
HDFS I/O is not always local. Actually, in a busy cluster, node or even rack level local I/O barely happens. In some real-world large clusters, I have experienced increased overall performance by disabling I/O locality (there are YARN configurations to set how long to wait for local I/O before scheduling tasks to remote nodes).

Designing a big data system is all about trade-offs among performance, availability, resilience, scalability, simplicity, flexibility, cost, and others. FlashBlade S3 is similar to or a little slower than HDFS with local SSD, but it is extremely simple, efficient, and flexible. Amazon S3 may have chosen scalability and resilience over performance. HDFS is quite complex to operate, it is less flexible, but it may deliver the best performance for some big data workloads.

When choosing a data lake storage, instead of asking what is the fastest, I think the right question we should ask is the best, and everyone’s answer could be different.

So, what do you think is the best data lake storage that best suits your needs? See you in my next blog.