9.4 C
Canberra
Friday, October 24, 2025

Optimize Amazon EMR runtime for Apache Spark with EMR S3A


With the Amazon EMR 7.10 runtime, Amazon EMR has launched EMR S3A, an improved implementation of the open supply S3A file system connector. This enhanced connector is now routinely set because the default S3 file system connector for Amazon EMR deployment choices, together with Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, and Amazon EMR on AWS Outposts, sustaining full API compatibility with open supply Apache Spark.

Within the Amazon EMR 7.10 runtime for Apache Spark, the EMR S3A connector reveals efficiency similar to EMRFS for learn workloads, as demonstrated by TPC-DS question benchmark. The connector’s most vital efficiency good points are evident in write operations, with a 7% enchancment in static partition overwrites and a 215% enchancment for dynamic partition overwrites when in comparison with EMRFS. On this publish, we showcase the improved learn and write efficiency benefits of utilizing Amazon EMR 7.10.0 runtime for Apache Spark with EMR S3A as in comparison with EMRFS and the open supply S3A file system connector.

Learn workload efficiency comparability

To judge the learn efficiency, we used a check surroundings based mostly on Amazon EMR runtime model 7.10.0 operating Spark 3.5.5 and Hadoop 3.4.1. Our testing infrastructure featured an Amazon Elastic Compute Cloud (Amazon EC2) cluster comprised of 9 r5d.4xlarge cases. The first node has 16 vCPU and 128 GB reminiscence, and the eight core nodes have a complete of 128 vCPU and 1024 GB reminiscence.

The efficiency analysis was performed utilizing a complete testing methodology designed to offer correct and significant outcomes. For the supply knowledge, we selected the three TB scale issue, which incorporates 17.7 billion data, roughly 924 GB of compressed knowledge partitioned in Parquet file format. The setup directions and technical particulars might be discovered within the GitHub repository. We used Spark’s in-memory knowledge catalog to retailer metadata for TPC-DS databases and tables.

To supply a good and correct comparability between EMR S3A vs. EMRFS and open supply S3A implementations, we applied a three-phase testing method:

  • Section 1: Baseline efficiency:
    • Established a baseline utilizing default Amazon EMR configuration with EMR’s S3A connector
    • Created a reference level for subsequent comparisons
  • Section 2: EMRFS evaluation:
    • Maintained the default file system as EMRFS
    • Preserved different configuration settings
  • Section 3: Open supply S3A testing:
    • Modified solely the hadoop-aws.jar file by changing it with the open supply Hadoop S3A 3.4.1 model
    • Maintained similar configurations throughout different elements

This managed testing surroundings was essential for our analysis for the next causes:

  • We might isolate the efficiency impression particularly to the S3A connector implementation
  • It eliminated potential variables that would skew the outcomes
  • It offered correct measurements of efficiency enhancements between Amazon’s S3A implementation and the open supply different

Take a look at execution and outcomes

All through the testing course of, we maintained consistency in check situations and configurations, ensuring any noticed efficiency variations could possibly be instantly attributed to the S3A connector implementation variations. A complete of 104 SparkSQL queries had been run in 10 iterations sequentially, and a mean of every question’s runtime in these 10 iterations was used for comparability. The typical of the ten iterations’ runtime on the Amazon EMR 7.10 runtime for Apache Spark with EMR S3A was 1116.87 seconds, which is 1.08 instances sooner than open supply S3A and comparable with EMRFS. The next determine illustrates the full runtime in seconds.

The next desk summarizes the metrics.

Metric OSS S3A EMRFS EMR S3A
Common runtime in seconds 1208.26 1129.64 1116.87
Geometric imply over queries in seconds 7.63 7.09 6.99
Complete price * $6.53 $6.40 $6.15

*Detailed price estimates are mentioned later on this publish.

The next chart demonstrates the per-query efficiency enchancment of EMR S3A relative to open supply S3A on the Amazon EMR 7.10 runtime for Apache Spark. The extent of the speedup varies from one question to a different, with the quickest as much as 1.51 instances sooner for q3, with Amazon EMR S3A outperforming open supply S3A. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based mostly on the efficiency enchancment seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Learn price comparability

Our benchmark outputs the full runtime and geometric imply figures to measure the Spark runtime efficiency. The fee metric can present us with further insights. Value estimates are computed utilizing the next formulation. They think about Amazon EC2, Amazon Elastic Block Retailer (Amazon EBS), and Amazon EMR prices, however don’t embrace Amazon Easy Storage Service (Amazon S3) GET and PUT prices.

  • Amazon EC2 price (embrace SSD price) = variety of cases * r5d.4xlarge hourly price * job runtime in hours
    • r5d.4xlarge hourly price = $1.152 per hour
  • Root Amazon EBS price = variety of cases * Amazon EBS per GB-hourly price * root EBS quantity measurement * job runtime in hours
  • Amazon EMR price = variety of cases * r5d.4xlarge Amazon EMR price * job runtime in hours
    • r5d.4xlarge Amazon EMR price = $0.27 per hour
  • Complete price = Amazon EC2 price + root Amazon EBS price + Amazon EMR price

The next desk summarizes these prices.

Metric EMRFS EMR S3A OSS S3A
Runtime in hours 0.5 0.48 0.51
Variety of EC2 cases 9 9 9
Amazon EBS measurement 0 gb 0 gb 0 gb
Amazon EC2 price $5.18 $4.98 $5.29
Amazon EBS price $0.00 $0.00 $0.00
Amazon EMR price $1.22 $1.17 $1.24
Complete price $6.40 $6.15 $6.53
Value financial savings Baseline EMR S3A is 1.04 instances higher than EMRFS EMR S3A is 1.06 instances higher than OSS S3A

Write workload efficiency comparability

We performed benchmark checks to evaluate the write efficiency of the Amazon EMR 7.10 runtime for Apache Spark.

Static desk/partition overwrite

We evaluated the static desk/partition overwrite write efficiency of the completely different file system by executing the next INSERT OVERWRITE Spark SQL question. The SELECT * FROM vary(...) clause generated knowledge at execution time. This produced roughly 15 GB of information throughout precisely 100 Parquet information in Amazon S3.

SET rows=4e9; -- 4 Billion
SET partitions=100;
INSERT OVERWRITE DIRECTORY 's3://${bucket}/perf-test/${trial_id}'
USING PARQUET SELECT * FROM vary(0, ${rows}, 1, ${partitions});

The check surroundings was configured as follows:

  • EMR cluster with emr-7.10.0 launch label
  • Single m5d.2xlarge occasion (main group)
  • Eight m5d.2xlarge cases (core group)
  • S3 bucket in the identical AWS Area because the EMR cluster
  • The trial_id property used a UUID generator to keep away from battle between check runs

Outcomes

After operating 10 trials for every file system, we captured and summarized question runtimes within the following chart. Whereas EMR S3A averaged solely 26.4 seconds, the EMRFS and open supply S3A averaged 28.4 seconds and 31.4 seconds—a 1.07 instances and 1.19 instances enchancment, respectively.

Dynamic partition overwrite

We additionally evaluated the write efficiency by executing the next INSERT OVERWRITE dynamic partition Spark SQL question, which joins TPC-DS 3TB partitioned Parquet knowledge of the desk web_sales and date_dim tables, which inserts roughly 2,100 partitions, the place every partition incorporates one Parquet file with a mixed measurement of roughly 31.2 GB in Amazon S3.

SET spark.sql.sources.partitionOverwriteMode=DYNAMIC;
INSERT OVERWRITE TABLE  PARTITION(wsdt_year,wsdt_month, wsdt_day) 
SELECT ws_order_number,ws_quantity,ws_list_price,ws_sales_price,
ws_net_paid_inc_ship_tax,ws_net_profit,dt.d_year as wsdt_year,dt.d_moy 
as wsdt_month,dt.d_dom as wsdt_day FROM web_sales, date_dim dt 
WHERE ws_sold_date_sk = d_date_sk;

The check surroundings was configured as follows:

  • EMR cluster with emr-7.10.0 launch label
  • Single r5d.4xlarge occasion (grasp group)
  • 5 r5d.4xlarge cases (core group)
  • Roughly 2,100 partitions with one Parquet file every
  • Mixed measurement of roughly 31.2 GB in Amazon S3

Outcomes

After operating 10 trials for every file system, we captured and summarized question runtimes within the following chart. Whereas EMR S3A averaged solely 90.9 seconds, the EMRFS and open supply S3A averaged 286.4 seconds and 1,438.5 seconds—a 3.15 instances and 15.82 instances enchancment, respectively.

Abstract

Amazon EMR constantly enhances its Apache Spark runtime and S3A connector, delivering steady efficiency enhancements that assist massive knowledge prospects execute analytics workloads extra cost-effectively. Past efficiency good points, the strategic shift to S3A introduces essential benefits, together with enhanced standardization, improved cross-platform portability, and sturdy community-driven assist—all whereas sustaining or surpassing the efficiency benchmarks established by the earlier EMRFS implementation.

We advocate that you simply keep up-to-date with the most recent Amazon EMR launch to make the most of the most recent efficiency and have advantages. Subscribe to the AWS Large Information Weblog’s RSS feed to be taught extra concerning the Amazon EMR runtime for Apache Spark, configuration finest practices, and tuning recommendation.


In regards to the authors

Giovanni Matteo Fumarola

Giovanni Matteo Fumarola

Giovanni is the Senior Supervisor for the Amazon EMR Spark and Iceberg group. He’s an Apache Hadoop Committer and PMC member. He has been focusing within the massive knowledge analytics area since 2013.

Sushil Kumar Shivashankar

Sushil Kumar Shivashankar

Sushil is the Engineering Supervisor for the Amazon EMR Hadoop and Flink staff at Amazon Internet Companies. With a deal with massive knowledge analytics since 2014, he leads improvement, optimizations, and development methods for Hadoop and Flink enterprise in Amazon EMR.

Narayanan Venkateswaran

Narayanan Venkateswaran

Narayanan is a Senior Software program Growth Engineer within the Amazon EMR group. He works on creating Hadoop elements in Amazon EMR. He has over 20 years of labor expertise within the business throughout a number of firms, together with Solar Microsystems, Microsoft, Amazon, and Oracle. Narayanan additionally holds a PhD in databases with a deal with horizontal scalability in relational shops.

Syed Shameerur Rahman

Syed Shameerur Rahman

Syed is a Software program Growth Engineer at Amazon EMR. He’s desirous about extremely scalable, distributed computing. He’s an lively contributor of open supply initiatives like Apache Hive, Apache Tez, Apache ORC, and Apache Hadoop, and has contributed vital options and optimizations. Throughout his free time, he enjoys exploring new locations and attempting new meals.

Rajarshi Sarkar

Rajarshi Sarkar

Rajarshi is a Software program Growth Engineer at Amazon EMR. He works on cutting-edge options of Amazon EMR and can be concerned in open supply initiatives corresponding to Apache Hive, Iceberg, Trino, and Hadoop. In his spare time, he likes to journey, watch motion pictures, and hang around with buddies.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles