Analyze Amazon EMR on Amazon EC2 cluster utilization with Amazon Athena and Amazon QuickSight

October 28, 2024

53

Gaining granular visibility into application-level prices on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters presents a possibility for patrons in search of methods to additional optimize useful resource utilization and implement honest value allocation and chargeback fashions. By breaking down the utilization of particular person functions working in your EMR cluster, you possibly can unlock a number of advantages:

Knowledgeable workload administration – Utility-level value insights empower organizations to prioritize and schedule workloads successfully. Useful resource allocation choices could be made with a greater understanding of value implications, probably enhancing total cluster efficiency and cost-efficiency.
Value optimization – With granular value attribution, organizations can determine cost-saving alternatives for particular person functions. They’ll right-size underutilized sources or prioritize optimization efforts for functions which are driving excessive utilization and prices.
Clear billing – In multi-tenant environments, organizations can implement honest and clear value allocation fashions based mostly on particular person utility useful resource consumption and related prices. This fosters accountability and permits correct chargebacks to tenants.

On this submit, we information you thru deploying a complete resolution in your Amazon Internet Companies (AWS) surroundings to research Amazon EMR on EC2 cluster utilization. Through the use of this resolution, you’ll achieve a deep understanding of useful resource consumption and related prices of particular person functions working in your EMR cluster. It will enable you to optimize prices, implement honest billing practices, and make knowledgeable choices about workload administration, finally enhancing the general effectivity and cost-effectiveness of your Amazon EMR surroundings. This resolution has been solely examined on Spark workloads working on EMR on EC2 that makes use of YARN as its useful resource supervisor. It hasn’t been examined on workloads from different frameworks that run on YARN, akin to HIVE or TEZ.

Resolution overview

The answer works by working a Python script on the EMR cluster’s major node to gather metrics from the YARN useful resource supervisor and correlate them with value utilization particulars from the AWS Value and Utilization Studies (AWS CUR). The script activated by a cronjob makes HTTP requests to the YARN useful resource supervisor to gather two sorts of metrics from paths /ws/v1/cluster/metrics for cluster metrics and /ws/v1/cluster/apps for utility metrics. The cluster metrics include utilization info of cluster sources, and the appliance metrics include utilization info of an utility or job. These metrics are saved in an Amazon Easy Storage Service (Amazon S3) bucket.

There are two YARN metrics that seize the useful resource utilization info of an utility or job.

memorySeconds – That is the reminiscence (in MB) allotted to an utility occasions the variety of seconds the appliance ran
vcoreSeconds – That is the variety of YARN vcores allotted to an utility occasions the variety of seconds utility ran

The answer makes use of memorySeconds to derive the price of working the appliance or job. It may be modified to make use of vcoreSeconds as a substitute if vital.

The metadata of the YARN metrics collected in Amazon S3 is created, saved, and represented as database and tables in AWS Glue Knowledge Catalog, which is in flip obtainable to Amazon Athena for additional processing. Now you can write SQL queries in Athena to correlate the YARN metrics with the price utilization info from AWS CUR to derive the detailed value breakdown of your EMR cluster by infrastructure and utility. This resolution creates two corresponding Athena views of the respective value breakdown that may change into the info supply to Amazon QuickSight for visualization.

The next diagram reveals the answer structure.

EMR Cluster Usage Utility Solution Architecture

Stipulations

To carry out the answer, you want the next stipulations:

Affirm {that a} CUR is created in your AWS account. It wants an S3 bucket to retailer the report recordsdata. Observe the steps described in Creating Value and Utilization Studies to create the CUR on the AWS Administration Console. When creating the report, be certain the next settings are enabled:

- Embody useful resource IDs
- Time granularity is about to hourly
- Report information integration to Athena

It might take as much as 24 hours for AWS to start out delivering stories to your S3 bucket. Thereafter, your CUR will get up to date not less than one time a day.

The answer wants Athena to run queries in opposition to the info from the CUR utilizing normal SQL. To automate and streamline the combination of Athena with CUR, AWS supplies an AWS CloudFormation template, crawler-cfn.yml, which is robotically generated in the identical S3 bucket throughout CUR creation. Observe the directions in Organising Athena utilizing AWS CloudFormation templates to combine Athena with the CUR. This template will create an AWS Glue database that references to the CUR, an AWS Lambda occasion and an AWS Glue crawler that will get invoked by S3 occasion notification to replace the AWS Glue database each time the CUR will get up to date.
Be certain to activate the AWS generated value allocation tag, aws:elasticmapreduce:job-flow-id. This allows the sector, resource_tags_aws_elasticmapreduce_job_flow_id, within the CUR to be populated with the EMR cluster ID and is utilized by the SQL queries within the resolution. To activate the price allocation tag from the administration console, observe these steps:
- Register to the payer account’s AWS Administration Console and open the AWS Billing and Value Administration console
- Within the navigation pane, select Value Allocation Tags
- Below AWS generated value allocation tags, select the aws:elasticmapreduce:job-flow-id tag
- Select Activate. It might take as much as 24 hours for tags to activate.

The next screenshot reveals an instance of the aws:elasticmapreduce:job-flow-id tag being activated.

CostAllocationTag

Now you can check out this resolution on an EMR cluster in a lab surroundings. If you happen to’re not already acquainted with EMR, observe the detailed directions offered in Tutorial: Getting began with Amazon EMR to launch a brand new EMR cluster and run a pattern Spark job.

Deploying the answer

To deploy the answer, observe the steps within the subsequent sections.

Putting in scripts to the EMR cluster

Obtain two scripts from the GitHub repository and save them into an S3 bucket:

emr_usage_report.py – Python script that makes the HTTP requests to YARN Useful resource Supervisor
emr_install_report.sh – Bash script that creates a cronjob to run the python script each minute

To put in the scripts, add a step to the EMR cluster by means of the console or AWS Command Line Interface (AWS CLI) utilizing aws emr add-step command.

Substitute:

REGION with the AWS Areas the place the cluster is working (for instance, Europe (Eire) eu-west-1)
MY-BUCKET with the title of the bucket the place the script is saved (for instance, my.artifact.bucket)
MY_REPORT_BUCKET with the bucket title the place you need to acquire YARN metrics (for instance, my.report.bucket)

aws emr add-steps 
--cluster-id j-XXXXXXXXXXXXX 
--steps Sort=CUSTOM_JAR,Title="Set up YARN reporter",Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3:///emr-install_reporter.sh,s3:///emr_usage_reporter.py,MY_REPORT_BUCKET]

Now you can run some Spark jobs in your EMR cluster to start out producing utility utilization metrics.

Launching the CloudFormation stack

When the stipulations are met and you’ve got the scripts deployed in order that your EMR clusters are sending YARN metrics to an S3 bucket, the remainder of the answer could be deployed utilizing CloudFormation.

Earlier than launching the stack, add a replica of this QuickSight definition file into an S3 bucket required by the CloudFormation template to construct the preliminary evaluation in QuickSight. When prepared, proceed to launch your stack to provision the remaining sources of the answer.

Select

This robotically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign up as wanted and be sure to create the stack in your meant Area.

The CloudFormation stack requires a couple of parameters, as proven within the following screenshot.

CloudFormationStack

The next desk describes the parameters.

Parameter	Description
Stack title	A significant title for the stack; for instance, `EMRUsageReport`
S3 configuration
`YARNS3BucketName`	Title of S3 bucket the place YARN metrics are saved
Value Utilization Report configuration
`CURDatabaseName`	Title of Value Utilization Report database in AWS Glue
`CURTableName`	Title of Value Utilization Report desk in AWS Glue
AWS Glue Database configuration
`EMRUsageDBName`	Title of AWS Glue database to be created for the EMR Value Utilization Report
`EMRInfraTableName`	Title of AWS Glue desk to be created for infrastructure utilization metrics
`EMRAppTableName`	Title of AWS Glue desk to be created for utility utilization metrics
QuickSight configuration
`QSUserName`	Title of QuickSight consumer in default namespace to handle the EMR Utilization Report sources in QuickSight.
`QSDefinitionsFile`	S3 URI of the definition JSON file for the EMR Utilization Report.

Enter the parameter values from the previous desk.
Select Subsequent.
On the subsequent display, enter any vital tags, an AWS Id and Entry Administration (IAM) position, stack failure, or superior choices if vital. In any other case, you possibly can go away them as default.
Select Subsequent.
Evaluation the main points on the ultimate display and choose the test bins confirming AWS CloudFormation may create IAM sources with customized names or require CAPABILITY_AUTO_EXPAND.
Select Create.

The stack will take a few minutes to create the remaining sources for the answer. After the CloudFormation stack is created, on the Outputs tab, you will discover the main points of the sources created.

Reviewing the correlation outcomes

The CloudFormation template creates two Athena views containing the correlated value breakdown particulars of the YARN cluster and utility metrics with the CUR. The CUR aggregates value hourly and subsequently correlation to derive the price of working an utility is prorated based mostly on the hourly working value of the EMR cluster.

The next screenshot reveals the Athena view for the correlated value breakdown particulars of YARN cluster metrics.

CorrelationResults

The next desk describes the fields within the Athena view for YARN cluster metrics.

Area	Sort	Description
`cluster_id`	string	ID of the cluster.
`household`	string	Useful resource sort of the cluster. Doable values are compute occasion, elastic map cut back occasion, storage and information switch.
`billing_start`	timestamp	Begin billing hour of the useful resource.
`usage_type`	string	A particular sort or unit of the useful resource akin to BoxUsage:m5.xlarge of compute occasion.
`value`	string	Value related to the useful resource.

The next screenshot reveals the Athena view for the correlated value breakdown particulars of YARN utility metrics.

CostBreakdownYARNAppMetrics

The next desk describes the fields within the Athena view for YARN utility metrics.

Area	Sort	Description
`cluster_id`	string	ID of the cluster
`id`	string	Distinctive identifier of the appliance run
`consumer`	string	Person title
`title`	string	Title of the appliance
`queue`	string	Queue title from YARN useful resource supervisor
`finalstatus`	string	Ultimate standing of utility
`applicationtype`	string	Sort of the appliance
`startedtime`	timestamp	Begin time of the appliance
`finishedtime`	timestamp	Finish time of the appliance
`elapsed_sec`	double	Time taken to run the appliance
`memoryseconds`	bigint	The reminiscence (in MB) allotted to an utility occasions the variety of seconds the appliance ran
`vcoreseconds`	int	The variety of YARN vcores allotted to an utility occasions the variety of seconds utility ran
`total_memory_mb_avg`	double	Whole quantity of reminiscence (in MB) obtainable to the cluster within the hour
`memory_sec_cost`	double	Derived unit value of memoryseconds
`application_cost`	double	Derived value related to the appliance based mostly on memoryseconds
`total_cost`	double	Whole value of sources related to the cluster for the hour

Constructing your individual visualization

In QuickSight, the CloudFormation template creates two datasets that reference Athena views as information sources and a pattern evaluation. The pattern evaluation has two sheets, EMR Infra Spend and EMR App Spend. They’ve a prepopulated bar chart and pivot tables to reveal how you need to use the datasets to construct your individual visualization to current the price breakdown particulars of your EMR clusters.

EMR Infra Spend sheet references to the YARN cluster metrics dataset. There’s a filter for date vary choice and a filter for cluster ID choice. The pattern bar chart reveals the consolidated value breakdown of the sources for every cluster in the course of the interval. The pivot desk breaks them down additional to indicate their each day expenditure.

The next screenshot reveals the EMR Infra Spend sheet from pattern evaluation created by the CloudFormation template.

EMR App Spend sheet references to the YARN utility metrics. There’s a filter for date vary choice and a filter for cluster ID choice. The pivot desk on this sheet reveals how you need to use the fields within the dataset to current the price breakdown particulars of the cluster by customers to look at the functions that had been run, whether or not they had been accomplished efficiently or not, the time and period of every run, and the derived value of the run.

The next screenshot reveals the EMR App Spend sheet from pattern evaluation created by the CloudFormation template.

Cleanup

If you happen to not want the sources you created throughout this walkthrough, delete them to stop incurring further expenses. To wash up your sources, full the next steps:

On the CloudFormation console, delete the stack that you simply created utilizing the template
Terminate the EMR cluster
Empty or delete the S3 bucket used for YARN metrics

Conclusion

On this submit, we mentioned the right way to implement a complete cluster utilization reporting resolution that gives granular visibility into the useful resource consumption and related prices of particular person functions working in your Amazon EMR on EC2 cluster. Through the use of the ability of Athena and QuickSight to correlate YARN metrics with value utilization particulars out of your Value and Utilization Report, this resolution empowers organizations to make knowledgeable choices. With these insights, you possibly can optimize useful resource allocation, implement honest and clear billing fashions based mostly on precise utility utilization, and finally obtain better cost-efficiency in your EMR environments. This resolution will enable you to unlock the complete potential of your EMR cluster, driving steady enchancment in your information processing and analytics workflows whereas maximizing return on funding.

Concerning the authors

Boon Lee Eu is a Senior Technical Account Supervisor at Amazon Internet Companies (AWS). He works carefully and proactively with Enterprise Help clients to offer advocacy and strategic technical steering to assist plan and obtain operational excellence in AWS surroundings based mostly on greatest practices. Primarily based in Singapore, Boon Lee has over 20 years of expertise in IT & Telecom industries.

Kyara Labrador is a Sr. Analytics Specialist Options Architect at Amazon Internet Companies (AWS) Philippines, specializing in huge information and analytics. She helps clients in designing and implementing scalable, safe, and cost-effective information options, in addition to migrating and modernizing their huge information and analytics workloads to AWS. She is captivated with empowering organizations to unlock the complete potential of their information.

Vikas Omer is the Head of Knowledge & AI Resolution Structure for ASEAN at Amazon Internet Companies (AWS). With over 15 years of expertise within the information and AI area, he’s a seasoned chief who leverages his experience to drive innovation and enlargement within the area. Vikas is captivated with serving to clients and companions succeed of their digital transformation journeys, specializing in cloud-based options and rising applied sciences.

Lorenzo Ripani is a Huge Knowledge Resolution Architect at AWS. He’s captivated with distributed programs, open supply applied sciences and safety. He spends most of his time working with clients around the globe to design, consider and optimize scalable and safe information pipelines with Amazon EMR.

Analyze Amazon EMR on Amazon EC2 cluster utilization with Amazon Athena and Amazon QuickSight

Resolution overview

Stipulations

Deploying the answer

Putting in scripts to the EMR cluster

Launching the CloudFormation stack

Reviewing the correlation outcomes

Constructing your individual visualization

Cleanup

Conclusion

Concerning the authors

Related Articles

20 years within the AWS Cloud – how time flies!

Faux ‘Trusted Sender’ Labels Misused in New Apple Mail Phishing Scheme

ePropelled unveils enlargement at International Innovation Centre in Coventry – sUAS Information

LEAVE A REPLY Cancel reply

Latest Articles

20 years within the AWS Cloud – how time flies!

Faux ‘Trusted Sender’ Labels Misused in New Apple Mail Phishing Scheme

ePropelled unveils enlargement at International Innovation Centre in Coventry – sUAS Information

Potential Twin-Channel Encryption with Silicon Metasurfaces

A multi-armed robotic for aiding with agricultural duties

ABOUT US