Join top rated executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Study A lot more
Following the initial increase of Hadoop, information groups throughout industries have adopted Apache Spark as the go-to framework for distributed huge information processing. The open up-supply platform has largely changed Hadoop’s Mapreduce by enabling quicker in-memory processing of datasets, and dealing with use cases that Hadoop could not deal with. Spark is also much more accessible in terms of APIs, and backed with enough fault tolerance.
Nonetheless, with the sum of details in the world predicted to expand to 221 zettabytes by 2026, it’s challenging for organizations to get a grip on the details they have. At recent processing speeds, companies will experience latencies in enterprise programs like analytics. And if they transfer to improve speeds, the expenditures increase.
That’s why groups ought to glimpse at the choice of accelerating Spark with GPUs, through Rapids, reported Sameer Raheja, senior director of engineering at Nvidia, at the ongoing GTC 2023 conference.
>>Follow VentureBeat’s ongoing Nvidia GTC spring 2023 coverage<<
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
GPU-accelerated Apache Spark
To handle future data demands with Spark, Raheja suggested running the framework with Nvidia GPUs. A plugin jar like Rapids Accelerator for Apache Spark, he said, can allow Spark batch processing to run on GPUs without any code changes.
This, he said, will not only enable teams to run massive data jobs faster at a lower cost than is possible with CPUs, it will also drive power savings.
Rapids Accelerator for Apache Spark combines the power of the Rapids cuDF library and the scale of the Spark distributed computing framework. The Rapids Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and remote direct memory access capabilities.
Using the Nvidia decision support benchmark — an adaptation of the industry-standard TPC-DS benchmark, with 100 modified queries — the company compared a Rapids-based GPU-accelerated Google cloud dataproc Spark distribution with one based on CPUs. The GPU nodes did a power run of all 100 queries in just 31 minutes, versus 176 minutes taken by the CPU nodes.
Since the GPU run took less time, it also proved to be more affordable than CPU nodes, costing just $7.20 as against $32.52 for the CPU run. The GPU run was five times more power-efficient.
“For anyone who’s running big data workloads and managing a budget … performance, cost and efficiency are key factors, and Rapids Accelerator for Spark addresses all three,” Raheja emphasized.
He added that similar benchmark results were witnessed on other clouds and Spark distributions with configurations closely matching that of Dataproc. For example, Rapids-accelerated AWS EMR distribution saw a 42% cost savings, while AWS Databricks Photon and Azure Databricks Photon delivered 39% and 34% cost savings, respectively.
How it works
The key to these benefits is Apache Spark 3, which brings column-based processing and resource-aware custom resource scheduling capabilities. This allows teams to schedule tasks on accelerator resources like GPUs.
“You can continue to write your application in the APIs you’re familiar with — SQL, Python, R, Java and Scala. Spark provides distributed and scale-up compute power Spark 3.x provides resource-aware scheduling and the Rapids Accelerator for Apache Spark plugin provides transparency for applications to run on Nvidia GPUs, enabling acceleration in cooperation with [the] Spark core engine’s built-in processor,” Raheja said.
Currently, the Rapids Spark accelerator is available on and built into Amazon EMR, Cloudera CDP, Databricks ML runtime, Azure Synapse Analytics, Google Cloud Dataproc, and open-source Apache Spark 3.x distributions, either on-premises or in the cloud.
The 2023 Nvidia GTC event runs through March 23.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.