8.5 C
Canberra
Wednesday, December 3, 2025

Run Apache Spark and Apache Iceberg write jobs 2x quicker with Amazon EMR


Amazon EMR runtime for Apache Spark gives a high-performance runtime surroundings whereas sustaining API compatibility with open supply Apache Spark and Apache Iceberg desk format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue use the optimized runtimes.

On this publish, we display the write efficiency advantages of utilizing the Amazon EMR 7.12 runtime for Spark and Iceberg compares to open supply Spark 3.5.6 with Iceberg 1.10.0 tables on a 3TB merge workload.

Write Benchmark Methodology

Our benchmarks display that Amazon EMR 7.12 can run 3TB merge workloads over 2 instances quicker than open supply Spark 3.5.6 with Iceberg 1.10.0, delivering important enhancements for knowledge ingestion and ETL pipelines whereas offering the superior options of Iceberg together with ACID transactions, time journey, and schema evolution.

Benchmark workload

To guage the write efficiency enhancements in Amazon EMR 7.12, we selected a merge workload that displays widespread knowledge ingestion and ETL patterns. The benchmark consists of 37 fundamental merge operations on TPC-DS 3TB tables, testing the efficiency of INSERT, UPDATE, and DELETE operations. The workload is impressed by established benchmarking approaches from the open supply neighborhood, together with Delta Lake’s merge benchmark methodology and the LST-Bench framework. We mixed and tailored these approaches to create a complete take a look at of Iceberg write efficiency on AWS. We additionally began with an preliminary concentrate on copy-on-write efficiency solely.

Workload traits

The benchmark executes 37 fundamental sequential merge queries that modify TPC-DS reality tables. The 37 queries are organized into three classes:

  • Inserts (queries m1-m6): Including new data to tables with various knowledge volumes. These queries use supply tables with 5-100% new data and 0 matches, testing pure insert efficiency at totally different scales.
  • Upserts (queries m8-m16): Modifying current data whereas inserting new ones. These upsert operations mix totally different ratios of matched and non-matched data—for instance, 1% matches with 10% inserts, or 99% matches with 1% inserts—representing typical eventualities the place knowledge is each up to date and augmented.
  • Deletes (queries m7, m17-m37): Eradicating data with various selectivity. These vary from small, focused deletes affecting 5% of information and rows to large-scale deletions, together with partition-level deletes that may be optimized to metadata-only operations.

The queries function on the desk state created by earlier operations, simulating actual ETL pipelines the place subsequent steps rely on earlier transformations. For instance, the primary six queries insert between 607,000 and 11.9 million data into the web_returns desk. Later queries then replace and delete from this modified desk, testing read-after-write efficiency. Supply tables have been generated by sampling the TPC-DS web_returns desk with managed match/non-match ratios for constant take a look at situations throughout the benchmark runs.

The merge operations range in scale and complexity:

  • Small operations affecting 607,000 data
  • Giant operations modifying over 12 million data
  • Selective deletes requiring file rewrites
  • Partition-level deletes optimized to metadata operations

Benchmark configuration

We ran the benchmark on an identical {hardware} for each Amazon EMR 7.12 and open supply Spark 3.5.6 with Iceberg 1.10.0:

  • Cluster: 9 r5d.4xlarge situations (1 major, 8 employees)
  • Compute: 144 complete vCPUs, 1,152 GB reminiscence
  • Storage: 2 x 300 GB NVMe SSD per occasion
  • Catalog: Hadoop Catalog
  • Knowledge format: Parquet information on Amazon S3
  • Desk format: Apache Iceberg (default: copy-on-write mode)

Benchmark outcomes

We in contrast benchmark outcomes for Amazon EMR 7.12 to open supply Spark 3.5.6 and Iceberg 1.10.0. We ran the 37 merge queries in three sequential iterations, and the common runtime throughout these iterations was taken for comparability. The next desk exhibits the outcomes averaged throughout three iterations:

Amazon EMR 7.12 (seconds) Open Supply Spark 3.5.6 + Iceberg 1.10.0 (seconds) Speedup
443.58 926.63 2.08x

The typical runtime for the three iterations on Amazon EMR 7.12 with Iceberg enabled was 443.58 seconds, demonstrating a 2.08x velocity improve in comparison with open supply Spark 3.5.6 and Iceberg 1.10.0. The next determine presents the entire runtimes in seconds.

The next desk summarizes the metrics.

Metric Amazon EMR 7.12 on EC2 Open supply Spark 3.5.6 and Iceberg 1.10.0
Common runtime in seconds 443.58 926.63
Geometric imply over queries in seconds 6.40746 18.50945
Price* $1.58 $2.68

*Detailed price estimates are mentioned later on this publish.

The next chart demonstrates the per-query efficiency enchancment of Amazon EMR 7.12 relative to open supply Spark 3.5.6 and Iceberg 1.10.0. The extent of the speedup varies from one question to a different, with the quickest as much as 13.3 instances quicker for question m31, with Amazon EMR outperforming open supply Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order primarily based on the efficiency enchancment seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Efficiency optimizations in Amazon EMR

Amazon EMR 7.12 achieves over 2x quicker write efficiency by means of systematic optimizations throughout the write execution pipeline. These enhancements span a number of areas:

  • Metadata-only delete operations: When deleting total partitions, EMR can now optimize these operations to metadata-only modifications, eliminating the necessity to rewrite knowledge information. This considerably reduces the time and price for partition-level delete operations.
  • Bloom filter joins for merge operations: Enhanced be part of methods utilizing bloom filters cut back the quantity of knowledge that must be learn and processed throughout merge operations, notably benefiting queries with selective predicates.
  • Parallel file write out: Optimized parallelism throughout the write part of merge operations improves throughput when writing filtered outcomes again to Amazon S3, lowering total merge operation time. We balanced the parallelism with learn efficiency for total optimized efficiency on the complete workload.

These optimizations work collectively to ship constant efficiency enhancements throughout numerous write patterns. The result’s considerably quicker knowledge ingestion and ETL pipeline execution whereas sustaining Iceberg’s ACID assurances and knowledge consistency of Iceberg.

Price comparability

Our benchmark gives the entire runtime and geometric imply knowledge to evaluate the efficiency of Spark and Iceberg in a fancy, real-world determination assist situation. For extra insights, we additionally study the fee side. We calculate price estimates utilizing formulation that account for EC2 On-Demand situations, Amazon Elastic Block Retailer (Amazon EBS), and Amazon EMR bills.

  • Amazon EC2 price (consists of SSD price) = variety of situations * r5d.4xlarge hourly price * job runtime in hours
    • 4xlarge hourly price = $1.152 per hour
  • Root Amazon EBS price = variety of situations * Amazon EBS per GB-hourly price * root EBS quantity dimension * job runtime in hours
  • Amazon EMR price = variety of situations * r5d.4xlarge Amazon EMR price * job runtime in hours
    • 4xlarge Amazon EMR price = $0.27 per hour
  • Complete price = Amazon EC2 price + root Amazon EBS price + Amazon EMR price

The calculations reveal that the Amazon EMR 7.12 benchmark yields a 1.7x price effectivity enchancment over open supply Spark 3.5.6 and Iceberg 1.10.0 in operating the benchmark job.

Metric Amazon EMR 7.12 Open supply Spark 3.5.6 and Iceberg 1.10.0
Runtime in seconds 443.58 926.63
Variety of EC2 situations(Contains major node) 9 9
Amazon EBS Measurement 20gb 20gb
Amazon EC2(Complete runtime price) $1.28 $2.67
Amazon EBS price $0.00 $0.01
Amazon EMR price $0.30 $0
Complete price $1.58 $2.68
Price financial savings Amazon EMR 7.12 is 1.7 instances higher Baseline

Run open supply Spark benchmarks on Iceberg tables

We used separate EC2 clusters, every geared up with 9 r5d.4xlarge situations, for testing each open supply Spark 3.5.6 and Amazon EMR 7.12 for Iceberg workload. The first node was geared up with 16 vCPU and 128 GB of reminiscence, and the eight employee nodes collectively had 128 vCPU and 1024 GB of reminiscence. We performed checks utilizing the Amazon EMR default settings to showcase the everyday person expertise and minimally adjusted the settings of Spark and Iceberg to take care of a balanced comparability.

The next desk summarizes the Amazon EC2 configurations for the first node and eight employee nodes of kind r5d.4xlarge.

EC2 Occasion vCPU Reminiscence (GiB) Occasion storage (GB) EBS root quantity (GB)
r5d.4xlarge 16 128 2 x 300 NVMe SSD 20 GB

Benchmarking directions

Comply with the steps under to run the benchmark:

  1. For the open supply run, create a Spark cluster on Amazon EC2 utilizing Flintrock with the configuration described beforehand.
  2. Setup the TPC-DS supply knowledge with Iceberg in your S3 bucket.
  3. Construct the benchmark utility jar from the supply to run the benchmarking and get the outcomes.

Detailed directions are supplied within the emr-spark-benchmark GitHub repository.

Summarize the outcomes

After the Spark job finishes, retrieve the take a look at end result file from the output S3 bucket at s3:///benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv. This may be completed both by means of the Amazon S3 console by navigating to the required bucket location or through the use of the Amazon Command Line Interface (AWS CLI). The Spark benchmark utility organizes the information by making a timestamp folder and inserting a abstract file inside a folder labeled abstract.csv. The output CSV information include 4 columns with out headers:

  • Question title
  • Median time
  • Minimal time
  • Most time

With the information from three separate take a look at runs with one iteration every time, we are able to calculate the common and geometric imply of the benchmark runtimes.

Clear up

To assist forestall future fees, delete the assets you created by following the directions supplied within the Cleanup part of the GitHub repository.

Abstract

Amazon EMR is constantly enhancing the EMR runtime for Spark when used with Iceberg tables, attaining write efficiency that’s over 2 instances quicker than open supply Spark 3.5.6 and Iceberg 1.10.0 with EMR 7.12 on 3TB merge workloads. This represents a big enchancment for knowledge ingestion and ETL pipelines, serving to to ship 1.7x price discount whereas sustaining the ACID assurances of Iceberg. We encourage you to maintain updated with the most recent Amazon EMR releases to completely profit from ongoing efficiency enhancements.

To remain knowledgeable, subscribe to the RSS feed for the AWS Large Knowledge Weblog, the place you will discover updates on the EMR runtime for Spark and Iceberg, in addition to tips about configuration greatest practices and tuning suggestions.


In regards to the authors

Atul Felix Payapilly is a software program improvement engineer for Amazon EMR at Amazon Net Providers.

Akshaya KP is a software program improvement engineer for Amazon EMR at Amazon Net Providers.

Hari Kishore Chaparala is a software program improvement engineer for Amazon EMR at Amazon Net Providers.

Giovanni Matteo is the Senior Supervisor for the Amazon EMR Spark and Iceberg group.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles