Accelerating Apache Spark with RAPIDS on GPU
Getting started, and benchmarking Spark RAPIDS on Kubernetes and fast S3
Quite a lot of big data and most machine learning workloads are compute-bond. While machine learning, especially deep learning, has been leveraging GPU to accelerate the computation for years, GPU acceleration for big data is relatively new. It is good to see active developments of Spark RAPIDS in this area.
Spark RAPIDS is an accelerator for Apache Spark that leverages GPUs to accelerate computation via the RAPIDS libraries. The idea is to build a unified framework that consolidates and accelerates both big data and machine learning workloads. Recently, I have integrated Spark RAPIDS into my lab environment with Kubernetes and FlashBlade. So that I can run a single pipeline, from ingest to data preparation/ETL to model training, in a single environment.
In my previous blog, I briefly expressed my idea of consolidating and accelerating big data with Kubernetes, GPU and Spark RAPIDS, and FlashBlade. Today I will explain how to get started with this setup. I will also share more details about benchmarking Spark RAPIDS on Kubernetes and FlashBlade S3.
Getting started with Spark RAPIDS on Kubernetes
I have written about how to use Apache Spark with Kubernetes in my previous blog. To add GPU support on top of that, aka adding Spark RAPIDS support, we will need:
- Build the Spark image using CUDA enabled base images, such as the nvidia/cuda images.
- Retrieve the RAPIDS Spark jar, and put it under
Spark/jars
directory. - Copy the GPU discovery script to
spark/bin
directory.
A bonus tip here is to always set timezone to UTC in the Spark image. As I will show later, this could make a big performance difference in Spark jobs, because as of Spark RAPIDS version 22.10, for timestamp data, only data with UTC zone can run on GPUs.
Let’s test the image with the Spark Pi example. Add or modify the following GPU-related settings in spark-pi.yaml
to enable GPU acceleration.
spec:
type: Scala
mode: cluster
image: "<your-image-repo>/apache-spark:3.3.0-rapids22.10"
mainClass: org.apache.spark.examples.SparkPi
...
executor:
instances: 2
cores: 1
memory: 4G
gpu:
name: "nvidia.com/gpu"
quantity: 1
sparkConf:
"spark.executor.resource.gpu.amount": "1"
"spark.rapids.memory.pinnedPool.size": "2G"
"spark.executor.resource.gpu.discoveryScript": "/opt/spark/bin/getGpusResources.sh"
"spark.executor.resource.gpu.vendor": "nvidia.com"
"spark.plugins": "com.nvidia.spark.SQLPlugin"
"spark.rapids.sql.explain": "NOT_ON_GPU"
In this example, I gave 1 GPU to each of the Spark executor. The job should output logs like the below:
2023-02-10 07:34:35,932 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator 22.10.0 using cudf 22.10.0.
2023-02-10 07:34:35,965 WARN rapids.RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
2023-02-10 07:34:35,970 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2023-02-10 07:34:35,970 WARN rapids.RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
...
Pi is roughly 3.1327356636783183
Another example to test with Spark SQL and dataframe.
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df=sqlContext.createDataFrame([1,2,3], "int").toDF("value")
df.createOrReplaceTempView("df")
sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()
sqlContext.sql("SELECT * FROM df WHERE value<>1").show()
sc.stop()
This time, I set spark.rapids.sql.explain=ALL
, to log what can or cannot run on GPU, like the below.
2022-11-14 17:14:40,858 INFO rapids.GpuOverrides: Plan conversion to the GPU took 20.62 ms
2022-11-14 17:14:40,896 INFO rapids.GpuOverrides: GPU plan transition optimization took 6.38 ms
2022-11-14 17:14:41,112 WARN rapids.GpuOverrides:
*Exec <FilterExec> will run on GPU
*Expression <And> (isnotnull(value#0) AND NOT (value#0 = 1)) will run on GPU
*Expression <IsNotNull> isnotnull(value#0) will run on GPU
*Expression <Not> NOT (value#0 = 1) will run on GPU
*Expression <EqualTo> (value#0 = 1) will run on GPU
! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
@Expression <AttributeReference> value#0 could run on GPU
2022-11-14 17:14:41,143 INFO rapids.GpuOverrides: Plan conversion to the GPU took 52.40 ms
2022-11-14 17:14:41,161 INFO rapids.GpuOverrides: GPU plan transition optimization took 14.54 ms
== Physical Plan ==
GpuColumnarToRow false
+- GpuFilter (gpuisnotnull(value#0) AND NOT (value#0 = 1)), true
+- GpuRowToColumnar targetsize(2147483647)
+- *(1) Scan ExistingRDD[value#0]
Note that there are some incompatibilities with Apache Spark. At the time of writing this blog, Spark RAPIDS is still under active development. It may not be ready for some people, but it seems to be promising and exciting, as you will see in my benchmark result.
Benchmarking Spark RAPIDS
To compare Spark performance running on and off GPU, I use the NVIDIA NDS benchmark, which is derived from the TPC-DS benchmark.
Here is my test environment:
- Storage: FlashBlade S3 (previous generation).
- Network: Multiple 40Gbps links from FlashBlade to the hosts.
- Compute: 6 Spark executor pods on Kubernetes. Each pod has 4 vCPU, 24GB RAM. For the on GPU test only, 1 x NVIDIA A10 GPU is given to each pod.
- Spark RAPIDS 22.10 with Apache Spark 3.3.0.
In each of the Spark executor pod, it is set to run 2 tasks on the same executor sharing a single GPU.
"spark.rapids.sql.concurrentGpuTasks": "2" # the number of tasks that are actively sharing the GPU
"spark.executor.resource.gpu.amount": "1"
"spark.task.resource.gpu.amount": "0.5" # 2 tasks on the same executor to share a single GPU
Testing with 200GB NDS dataset in this environment, Spark with GPU is 4x faster, compared to CPU only runs.
During the test, I set spark.rapids.sql.explain=NOT_ON_GPU
. This helps figure out what operations were not on GPU. For example, in my first run, I noticed that timestamp operation (which is very common) on non-UTC zone data were not on GPU. As mentioned above, I changed timezone to UTC in the Spark image. A small change, but a big speed up.
I didn’t compare the infrastructure cost, but on the Spark RAPIDS homepage, it claims 50% cost saving with GPU/TPU on public cloud.
Conclusion
Before the benchmark, I was a little skeptical about Spark on GPU performance, because unlike machine learning workloads, many Spark operations are not necessarily GPU friendly. Gaining 4x speed up in Spark in such a small configuration is amazing. I am sure with bigger dataset and more powerful GPUs, the speed up will be even higher.
Spark cluster with tens to hundreds servers are common in enterprises. Imagine you can run your Spark jobs 4x plus faster, without adding new server. This could be game changing, because otherwise you will need 4x servers, rack spaces, cables and everything associated. It is true you must add some GPU cards, but you need GPU for your AI / machine leaning workloads anyway.
It is especially beneficial when accelerating Spark with Spark RAPIDS on GPU powered Kubernetes cluster, together with fast object storage like FlashBlade. Because Kubernetes makes it very easy and flexible to share CPU and GPU resource, and FlashBlade eliminates I/O bottlenecks between the compute and storage.
Keep GPU busy, make Spark faster.