11.2 C
Canberra
Wednesday, December 3, 2025

Apache Spark encryption efficiency enchancment with Amazon EMR 7.9


The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that’s 100% API suitable with open supply Apache Spark. With Amazon EMR launch 7.9.0, the EMR runtime for Apache Spark introduces vital efficiency enhancements for encrypted workloads, supporting Spark model 3.5.5.

For compliance and safety necessities, many shoppers have to allow Apache Spark’s native storage encryption (spark.io.encryption.enabled = true) along with Amazon Easy Storage Service (Amazon S3) encryption (akin to server-side encryption (SSE) or AWS Key Administration Service (AWS KMS)). This function encrypts shuffle recordsdata, cached knowledge, and different intermediate knowledge written to native disk throughout Spark operations, defending delicate knowledge at relaxation on Amazon EMR cluster situations.

Industries topic to rules such because the Well being Insurance coverage Portability and Accountability Act (HIPAA) for healthcare, Fee Card Trade Information Safety Commonplace (PCI-DSS) for monetary providers, Normal Information Safety Regulation (GDPR) for private knowledge, and Federal Threat and Authorization Administration Program (FedRAMP) for presidency typically require encryption of all knowledge at relaxation, together with short-term recordsdata on native storage. Whereas Amazon S3 encryption protects knowledge in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill knowledge that Spark writes to native disk throughout distributed processing—knowledge that by no means reaches Amazon S3 however may comprise delicate info extracted from supply datasets. Usually, encrypted operations require further computational overhead that may affect total job efficiency.

With the built-in encryption optimizations of Amazon EMR 7.9.0, prospects may see vital efficiency enhancements of their Apache Spark functions with out requiring any utility modifications. In our efficiency benchmark checks, derived from TPC-DS efficiency checks at 3 TB scale, we noticed as much as 20% sooner efficiency with the EMR 7.9 optimized Spark runtime in comparison with Spark with out these optimizations. Particular person outcomes could range relying on particular workloads and configurations.

On this publish, we analyze the outcomes from our benchmark checks evaluating the Amazon EMR 7.9 optimized Spark runtime in opposition to Spark 3.5.5 with out encryption optimizations. We stroll by means of an in depth price evaluation and supply step-by-step directions to breed the benchmark.

Outcomes noticed

To guage the efficiency enhancements, we used an open supply Spark efficiency check utility derived from the TPC-DS efficiency check toolkit. We ran the checks on two nine-node (eight core nodes and one main node) r5d.4xlarge Amazon EMR 7.9.0 clusters, evaluating two configurations:

  • Baseline: EMR 7.9.0 cluster with a bootstrap motion putting in Spark 3.5.5 with out encryption optimizations
  • Optimized: EMR 7.9.0 cluster utilizing the EMR Spark 3.5.5 runtime with encryption optimizations

Each checks used knowledge saved in Amazon Easy Storage Service (Amazon S3). All knowledge processing was configured identically aside from the Spark runtime model.

To keep up benchmarking consistency and guarantee a constant, equal comparability, we disabled Dynamic Useful resource Allocation (DRA) in each check configurations. This method eliminates variability from dynamic scaling and so we are able to measure pure computational efficiency enhancements.

The next desk exhibits the full job runtime for all queries (in seconds) within the 3 TB question dataset between the baseline and Amazon EMR 7.9 optimized configurations:

Configuration Complete runtime (seconds) Geometric imply (seconds) Efficiency enchancment
Baseline (Spark 3.5.5 with out optimization) 1,485 10.24
EMR 7.9 (with encryption optimization) 1,176 8.15 20% sooner

We noticed that our TPC-DS checks with the Amazon EMR 7.9 optimized Spark runtime accomplished about 20% sooner based mostly on whole runtime and 20% sooner based mostly on geometric imply in comparison with the baseline configuration.

The encryption optimizations in Amazon EMR 7.9 ship efficiency advantages by means of:

  • Improved shuffle and decryption operations decreasing overhead throughout knowledge trade with out compromising safety
  • Higher reminiscence administration for intermediate outcomes

Price evaluation

The efficiency enhancements of the Amazon EMR 7.9 optimized Spark runtime immediately translate to decrease prices. We realized an roughly 20% price financial savings working the benchmark utility with encryption optimizations in comparison with the baseline configuration, due to diminished hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Retailer (Amazon EBS) utilizing Normal Function SSD (gp2).

The next desk summarizes the price comparability within the us-east-1 AWS Area:

Configuration Runtime (hours) Estimated price Complete EC2 situations Complete vCPU Complete reminiscence (GiB) Root machine (EBS)
Baseline: Spark 3.5.5 with out optimization, 1 main and eight core nodes 0.41 $5.28 9 144 1152 64 GiB gp2
Amazon EMR 7.9 with optimization, 1 main and eight core nodes 0.33 $4.25 9 144 1152 64 GiB gp2

Price breakdown

Formulation used:

  • Amazon EMR price – Variety of situations × EMR hourly charge × Runtime hours
  • Amazon EC2 price – Variety of situations × EC2 hourly charge × Runtime hour)
  • Amazon EBS price(EBS price per GB per 30 days ÷ hours in a month) × EBS quantity dimension × variety of situations × runtime hours

Observe: EBS is priced month-to-month ($0.1 per GB per 30 days), so we divide by 730 hours to transform to an hourly charge. EMR and EC2 are already priced hourly, so no conversion is required.

Baseline configuration (0.41 hours):

  • Amazon EMR price – 9 × $0.27 × 0.41 = $1.00
  • Amazon EC2 price – 9 × $1.152 × 0.41 = $4.25
  • Amazon EBS price – ($0.1/730 × 64 × 9 × 0.41) = $0.032
  • Complete price – $5.28

EMR 7.9 optimized configuration (0.33 hours):

  • Amazon EMR price – (9 × $0.27 × 0.33) = $0.80
  • Amazon EC2 price – (9 × $1.152 × 0.33) = $3.42
  • Amazon EBS price – ($0.1/730 × 64 × 9 × 0.33) = $0.025
  • Complete price: $4.25

Complete price financial savings: 20% per benchmark run, which scales linearly along with your manufacturing workload frequency.

Arrange EMR benchmarking

For detailed directions and scripts, see the companion GitHub repository.

Conditions

To arrange Amazon EMR benchmarking, begin by finishing the next prerequisite steps:

  1. Configure your AWS Command Line Interface (AWS CLI) by working aws configure to level to your benchmarking account,
  2. Create an S3 bucket for check knowledge and outcomes.
  3. Copy the TPC-DS 3TB supply knowledge from a publicly accessible dataset to your S3 bucket utilizing the next command:
    aws s3 cp s3://blogpost-sparkoneks-us-east-1/weblog/BLOG_TPCDS-TEST-3T-partitioned s3:///BLOG_TPCDS-TEST-3T-partitioned --recursive

    Exchange with the title of the S3 bucket you created in step 2.

  4. Construct or obtain the benchmark utility JAR file (spark-benchmark-assembly-3.3.0.jar)
  5. Guarantee you’ve gotten applicable AWS Id Entry Administration (IAM) roles for EMR cluster creation and Amazon S3 entry

Deploy the baseline EMR cluster (with out optimization)

Step 1: Launch EMR 7.9.0 cluster with bootstrap motion

The baseline configuration makes use of a bootstrap motion to put in Spark 3.5.5 with out encryption optimizations. Now we have made the bootstrap script publicly accessible in an S3 bucket to your comfort.

Create the default Amazon EMR roles:

aws emr create-default-roles

Now create the cluster:

aws emr create-cluster 
  --name "EMR-7.9-Baseline-Spark-3.5.5" 
  --release-label emr-7.9.0 
  --applications Title=Spark 
  --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole  
  --service-role EMR_DefaultRole
  --instance-groups 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
  --bootstrap-actions 
    Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Title="set up spark 3.5.5 with out encryption optimization" 
  --use-default-roles 
  --log-uri s3:///logs/baseline/

Observe: The bootstrap script is offered in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 with out the encryption optimizations current within the Amazon EMR runtime.

Step 2: Submit the benchmark job to the baseline cluster

Subsequent submit the Spark job utilizing the next instructions:

aws emr add-steps 
  --cluster-id    
  --steps 'Kind=Spark,Title="EMR-7.9-Baseline-Spark-3.5.5 Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=false","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Deploy the optimized EMR cluster (with encryption optimization)

Step 1: Launch EMR 7.9.0 cluster with Spark runtime

The optimized configuration makes use of the EMR 7.9.0 Spark runtime with none bootstrap actions:

aws emr create-cluster 
  --name "EMR-7.9-Optimized-Native-Spark" 
  --release-label emr-7.9.0 
  --applications Title=Spark 
  --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole 
  --service-role EMR_DefaultRole
  --instance-groups 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
  --use-default-roles 
  --log-uri s3:///logs/optimized/

Instance:

aws emr create-cluster 
--name "EMR-7.9-Optimized-Native-Spark" 
--release-label emr-7.9.0 
--applications Title=Spark 
--ec2-attributes SubnetId=subnet-08a5f71f92bc8a801 
--instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
--bootstrap-actions 
Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Title="set up spark 3.5.5 with out encryption optimization" 
--use-default-roles 
--log-uri s3://aws-logs-123456789012-us-west-2/elasticmapreduce/

Step 2: Submit the benchmark job to optimized cluster

ext submit the Spark job utilizing the next instructions:

aws emr add-steps 
  --cluster-id   
  --steps 'Kind=Spark,Title="EMR-7.9-Optimized-Native-Spark Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=true","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Benchmark command parameters defined

The Amazon EMR Spark step makes use of the next parameters:

  • EMR step configuration:
    • Kind=Spark: Specifies this can be a Spark utility step
    • Title=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable title for the step
    • ActionOnFailure=CONTINUE: Proceed with different steps if this one fails
  • Spark submit arguments:
    • –deploy-mode shopper: Run the driving force on the grasp node (not cluster mode)
    • –class com.amazonaws.eks.tpcds.BenchmarkSQL: Important class for the TPC-DS benchmark
  • Utility parameters:
    • JAR file: s3:///jar/spark-benchmark-assembly-3.3.0.jar
    • Enter knowledge: s3:///weblog/BLOG_TPCDS-TEST-3T-partitioned (3 TB TPC-DS dataset)
    • Output location: s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT (S3 path for outcomes)
    • TPC-DS instruments path: /choose/tpcds-kit/instruments(native path on EMR nodes)
    • Format: parquet (output format)
    • Scale issue: 3000 (3 TB dataset dimension)
    • Iterations: 3 (run every question 3 instances for averaging)
    • Accumulate outcomes: false (don’t acquire outcomes to driver)
    • Question record: "q1-v2.4,q10-v2.4,...,ss_max-v2.4" (all 104 TPC-DS queries)
    • Last parameter: true (allow detailed logging and metrics)
  • Question protection:
    • All 104 normal TPC-DS benchmark queries (q1-v2.4 by means of q99-v2.4)
    • Plus the ss_max-v2.4 question for added testing
    • Every question runs 3 instances to calculate common efficiency

Summarize the outcomes

  1. Obtain the check outcome recordsdata from each output S3 areas:
    # Baseline outcomes
    aws s3 cp s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT/timestamp=xxxx/abstract.csv/xxx.csv ./baseline-results.csv
       
    # Optimized outcomes
    aws s3 cp s3:///weblog/OPTIMIZED_TPCDS-TEST-3T-RESULT/timestamp=xxxx/abstract.csv/xxx.csv ./optimized-results.csv

  2. The CSV recordsdata comprise 4 columns (with out headers):
    • Question title
    • Median time (seconds)
    • Minimal time (seconds)
    • Most time (seconds)
  3. Calculate efficiency metrics for comparability:
    • Common time per question: AVERAGE(median, min, max) for every question
    • Complete runtime: Sum of all median instances
    • Geometric imply: GEOMEAN(common instances) throughout all queries
    • Speedup: Calculate the ratio between baseline and optimized for every question
  4. Create comparability evaluation:Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%

Testing configuration particulars

The next desk summarizes the check setting used for this publish:

Parameter Worth
EMR launch emr-7.9.0 (each configurations)
Baseline Spark model 3.5.5 (put in by means of bootstrap motion)
Baseline bootstrap script s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public)
Optimized spark model Amazon EMR Spark runtime
Cluster dimension 9 nodes (1 main and eight core)
Occasion sort r5d.4xlarge
vCPUs per node 16
Reminiscence per node 128 GB
Occasion storage 600 GB SSD
EBS quantity 64 GB gp2 (2 volumes per occasion)
Complete vCPUs 144 (9 × 16)
Complete reminiscence 1152 GB (9 × 128)
Dataset TPC-DS 3TB (Parquet format)
Queries 104 queries (TPC-DS v2.4)
Iterations 3 runs per question
DRA Disabled for constant benchmarking

Clear up

To keep away from incurring future prices, delete the sources you created:

  1. Terminate each EMR clusters:
    aws emr terminate-clusters --cluster-ids  

  2. Delete S3 check outcomes if not wanted:
    aws s3 rm s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3:///weblog/OPTIMIZED_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3:///logs/ --recursive

  3. Take away IAM roles if created particularly for testing

Key findings

  • As much as 20% efficiency enchancment utilizing the Amazon EMR 7.9’s Spark runtime with no code modifications required
  • 20% price financial savings due to diminished runtime
  • Important features for shuffle-heavy, join-intensive workloads
  • 100% API compatibility with open supply Apache Spark
  • Easy migration from customized Spark builds to EMR runtime
  • Simple benchmarking utilizing publicly accessible bootstrap scripts

Conclusion

You possibly can run your Apache Spark workloads as much as 20% sooner and at decrease price with out making any modifications to your functions through the use of the Amazon EMR 7.9.0 optimized Spark runtime. This enchancment is achieved by means of quite a few optimizations within the EMR Spark runtime, together with enhanced encryption dealing with, improved knowledge serialization, and optimized shuffle operations.

To be taught extra about Amazon EMR 7.9 and greatest practices, see the EMR documentation. For configuration steering and tuning recommendation, subscribe to the AWS Large Information Weblog.

Associated sources:

In the event you’re working Spark workloads on Amazon EMR at this time, we encourage you to check the EMR 7.9 Spark runtime along with your manufacturing workloads and measure the enhancements particular to your use case.


Concerning the authors

Sonu Kumar Singh

Sonu Kumar Singh

Sonu is a Senior Options Architect with greater than 13 years of expertise, with a specialization in Analytics and Healthcare area. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and progress. He enjoys it when one thing he designed or created brings a optimistic affect.

Roshin Babu

Roshin Babu

Roshin is a Sr. Specialist Options architect at AWS, the place he collaborates with the gross sales workforce to help public sector shoppers. His position focuses on creating modern options that remedy advanced enterprise challenges whereas driving elevated adoption of AWS analytics providers. When he’s not working, Roshin is captivated with exploring new locations, discovering nice meals, and having fun with soccer each as a participant and fan.Polaris Jhandi

Polaris Jhandi

Polaris Jhandi

Polaris is a Cloud Utility Architect with AWS Skilled Companies. He has a background in AI/ML and massive knowledge. He’s at the moment working with prospects emigrate their legacy mainframe functions to the AWS Cloud.Zheng Yuan

Zheng Yuan

Zheng Yuan

Zheng is a Software program Engineer on the Amazon EMR Spark workforce, the place he focuses on enhancing the efficiency of the Spark execution engine throughout numerous use circumstances.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles