Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glue

December 16, 2025

28

The newly launched Apache Spark troubleshooting agent can remove hours of handbook investigation for information engineers and scientists working with Amazon EMR or AWS Glue. As an alternative of navigating a number of consoles, sifting via intensive log information, and manually analyzing efficiency metrics, now you can diagnose Spark failures utilizing easy pure language prompts. The agent robotically analyzes your workloads and delivers actionable suggestions. remodeling a time-consuming troubleshooting course of right into a streamlined, environment friendly expertise.

On this put up, we present you the way the Apache Spark troubleshooting agent helps analyze Apache Spark points by offering detailed root causes and actionable suggestions. You’ll discover ways to streamline your troubleshooting workflow by integrating this agent along with your current monitoring options throughout Amazon EMR and AWS Glue.

Apache Spark powers crucial ETL pipelines, real-time analytics, and machine studying workloads throughout hundreds of organizations. Nevertheless, constructing and sustaining Spark purposes stays an iterative course of the place builders spend vital time troubleshooting. Spark utility builders encounter operational challenges due to a couple completely different causes:

Complicated connectivity and configuration choices to quite a lot of sources with Spark – Though this makes Spark a well-liked information processing platform, it typically makes it difficult to seek out the foundation reason for inefficiencies or failures when Spark configurations aren’t optimally or accurately configured.
Spark’s in-memory processing mannequin and distributed partitioning of datasets throughout its staff – Though good for parallelism, this typically makes it tough for customers to establish inefficiencies. This ends in sluggish utility execution or root reason for failures brought on by useful resource exhaustion points similar to out of reminiscence and disk exceptions.
Lazy analysis of Spark transformations – Though lazy analysis optimizes efficiency, it makes it difficult to precisely and shortly establish the appliance code and logic that brought about the failure from the distributed logs and metrics emitted from completely different executors.

Apache Spark troubleshooting agent structure

This part describes the parts of the troubleshooting agent and the way they hook up with your improvement atmosphere. The troubleshooting agent gives a single conversational entry level on your Spark purposes throughout Amazon EMR, AWS Glue, and Amazon SageMaker Notebooks. As an alternative of navigating completely different consoles, APIs, and log places for every service, you work together with one Mannequin Context Protocol (MCP) server via pure language utilizing any MCP-compatible AI assistant of your alternative, together with customized brokers you develop utilizing frameworks similar to Strands Brokers.

Working as a completely managed cloud-hosted MCP server, the agent removes the necessity to preserve native servers whereas maintaining your information and code remoted and safe in a single-tenant system design. Operations are read-only and backed by AWS Identification and Entry Administration (IAM) permissions; the agent solely has entry to sources and actions your IAM function grants. Moreover, instrument calls are robotically logged to AWS CloudTrail, offering full auditability and compliance visibility. This mixture of managed infrastructure, granular IAM controls, and CloudTrail integration confirms your Spark diagnostic workflows stay safe, compliant, and totally auditable.

The agent builds on years of AWS experience operating tens of millions of Spark purposes at scale. It robotically analyzes Spark Historical past Server information, distributed executor logs, configuration patterns, and error stack traces and extracts related options and alerts to floor insights that may in any other case require handbook correlation throughout a number of information sources and deep understanding of Spark and repair internals.

Getting began

Full the next steps to get began with the Apache Spark troubleshooting agent.

Conditions

Confirm you meet or have accomplished the next stipulations.

System necessities:

Python 3.10 or larger
Set up the uv package deal supervisor. For directions, see putting in uv.
AWS Command Line Interface (AWS CLI) (model 2.30.0 or later) put in and configured with applicable credentials.

IAM permissions: Your AWS IAM profile wants permissions to invoke the MCP server and entry your Spark workload sources. The AWS CloudFormation template within the setup documentation creates an IAM function with the required permissions. You may also manually add the required IAM permissions.

Arrange utilizing AWS CloudFormation

First, deploy the AWS CloudFormation template offered within the setup documentation. This template robotically creates the IAM roles with the permissions required to invoke the MCP server.

Deploy the template inside the similar AWS Area you run your workloads in. For this put up, we’ll use us-east-1.

From the AWS CloudFormation Outputs tab, copy and execute the atmosphere variable command:

export SMUS_MCP_REGION=us-east-1 && export IAM_ROLE=arn:aws:iam::111122223333:function/spark-troubleshooting-role-xxxxxx

Configure your AWS CLI profile:

aws configure set profile.smus-mcp-profile.role_arn ${IAM_ROLE}
aws configure set profile.smus-mcp-profile.source_profile default
aws configure set profile.smus-mcp-profile.area ${SMUS_MCP_REGION}

Arrange utilizing Kiro CLI

You should use Kiro CLI to work together with the Apache Spark troubleshooting agent straight out of your terminal.

Set up and configuration:

Set up Kiro CLI.

Add each MCP servers, utilizing the atmosphere variables from the earlier Arrange utilizing AWS CloudFormation part:

# Add Spark Troubleshooting MCP Server
kiro-cli-chat mcp add 
    --name "sagemaker-unified-studio-mcp-troubleshooting" 
    --command "uvx" 
    --args "["mcp-proxy-for-aws@latest","https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-troubleshooting/mcp", "--service", "sagemaker-unified-studio-mcp", "--profile", "smus-mcp-profile", "--region", "${SMUS_MCP_REGION}", "--read-timeout", "180"]" 
    --timeout 180000 
    --scope world
# Add Spark Code Advice MCP Server
kiro-cli-chat mcp add 
    --name "sagemaker-unified-studio-mcp-code-rec" 
    --command "uvx" 
    --args "["mcp-proxy-for-aws@latest","https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-code-recommendation/mcp", "--service", "sagemaker-unified-studio-mcp", "--profile", "smus-mcp-profile", "--region", "${SMUS_MCP_REGION}", "--read-timeout", "180"]" 
    --timeout 180000 
    --scope world

Confirm your setup by operating the /instruments command in Kiro CLI to see the obtainable Apache Spark troubleshooting instruments.

Arrange utilizing Kiro IDE

Kiro IDE gives a visible improvement atmosphere with built-in AI help for interacting with the Apache Spark troubleshooting agent.

Set up and configuration:

Set up Kiro IDE.
MCP configuration is shared throughout Kiro CLI and Kiro IDE. Open the command palette utilizing Ctrl + Shift + P (Home windows / Linux) or Cmd + Shift + P (macOS) and Seek for Kiro: Open MCP Config
Confirm the contents of your mcp.json match the Arrange utilizing Kiro CLI part.

Utilizing the troubleshooting agent

Subsequent, we offer 3 reference architectures for options to make use of the troubleshooting agent in your current workflows with ease. We additionally present the reference code and AWS CloudFormation templates for these architectures within the Amazon EMR Utilities GitHub repository.

Answer 1 – Conversational troubleshooting: Troubleshooting failed Apache Spark purposes with Kiro CLI

When Spark purposes fail throughout your information platform, your debugging method would sometimes contain navigating completely different consoles for Amazon EMR, Amazon EC2, Amazon EMR Serverless, and AWS Glue, manually reviewing Spark Historical past Server logs, checking error stack traces, analyzing useful resource utilization patterns, then correlating this data to seek out the foundation trigger and repair. The Apache Spark troubleshooting agent automates this whole workflow via pure language, offering a unified troubleshooting expertise throughout the three platforms. Merely describe your failed purposes, for instance:

# Amazon EMR-EC2
Debug my failing Amazon EMR-EC2 step. Cluster id: 'j-xxxxx' Step id: 's-xxxxx'
# Amazon EMR Serverless
Troubleshoot my Amazon EMR Serverless job. Utility id: 'xxxxx' Job run id: 'xxxxx'
# AWS Glue
Analyze my failed AWS Glue job. Job identify: 'my-etl-job' Job run id: 'jr_xxxxx'

The agent robotically extracts Spark occasion logs and metrics, analyzes the error patterns, and gives a transparent root trigger clarification together with suggestions, all via the identical conversational interface. The next video demonstrates the whole troubleshooting workflow throughout Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue utilizing Kiro CLI:

Answer 2 – Agent-driven notifications: Combine the Apache Spark troubleshooting agent right into a monitoring workflow

Along with troubleshooting from the command line, the troubleshooting agent can plug into your monitoring infrastructure to offer improved failure notifications.

Manufacturing information pipelines require instant visibility when failures happen. Conventional monitoring programs can warn you when a Spark job fails, however diagnosing the foundation trigger nonetheless requires handbook investigation and an evaluation of what went incorrect earlier than remediation can start.

With the Apache Spark troubleshooting agent, you possibly can combine it into your current monitoring workflows to obtain root causes and proposals as quickly as you obtain a failure notification. Right here, we show two integration patterns that lead to computerized root trigger evaluation inside your current workflows.

Apache Airflow Integration

This primary integration sample makes use of Apache Airflow callbacks to robotically set off troubleshooting when Spark job operators fail.

When any Amazon EMR, Amazon EC2, Amazon EMR Serverless, or AWS Glue job operator fails in an Apache Airflow DAG,

A callback invokes the Spark troubleshooting agent inside a separate DAG.
The Spark troubleshooting agent analyzes the difficulty, establishes the foundation trigger, and identifies code repair suggestions.
The Spark troubleshooting agent sends a complete diagnostic report back to a configured Slack channel.

The answer is offered within the Amazon EMR Utilities GitHub repository (documentation) for instant integration into your current Apache Airflow deployments with a 1-line change to your Airflow DAGs. The next video demonstrates this integration:

Amazon EventBridge integration

For event-driven architectures, this second sample makes use of Amazon EventBridge to robotically invoke the troubleshooting agent when Spark jobs fail throughout your AWS atmosphere.

This integration makes use of an AWS Lambda operate that interacts with the Apache Spark troubleshooting agent via the Strands MCP Consumer.

When Amazon EventBridge detects failures from Amazon EMR-EC2 steps, Amazon EMR Serverless job runs, or AWS Glue job runs, it triggers the AWS Lambda operate which:

Makes use of the Apache Spark troubleshooting agent to research the failure
Identifies the foundation trigger and generates code repair suggestions
Constructs a complete evaluation abstract
Sends the abstract to Amazon SNS
Delivers the evaluation to your configured locations (e-mail, Slack, or different SNS subscribers)

This serverless method gives centralized failure evaluation throughout all of your Spark platforms with out requiring modifications to particular person pipelines. The next video demonstrates this integration:

A reference implementation of this answer is offered within the Amazon EMR Utilities GitHub repository (documentation).

Answer 3 – Clever Dashboards: Use the Apache Spark troubleshooting agent with Kiro IDE to visualise account stage utility failures: what failed, why failed and tips on how to repair

Understanding the well being of your Spark workloads throughout a number of platforms requires consolidating information from Amazon EMR (each EC2 and Serverless) and AWS Glue. Groups sometimes construct customized monitoring options by writing scripts to question a number of APIs, mixture metrics, and generate stories which may be time consuming and require lively upkeep.

With Kiro IDE and the Apache Spark troubleshooting agent, you possibly can construct complete monitoring dashboards conversationally. As an alternative of writing customized code to mixture workload metrics, you possibly can describe what you need to observe, and the agent generates a whole dashboard exhibiting total efficiency metrics, error class distributions for failures, success charges throughout platforms, and important failures requiring instant consideration. In contrast to conventional dashboards that solely present conventional KPIs and metrics on what utility failed, this dashboard makes use of the Spark troubleshooting agent to offer insights to customers on why the purposes failed, and how they are often fastened. The next video demonstrates constructing a multi-platform monitoring dashboard utilizing Kiro IDE:

The immediate used inside the demo:

Construct complete monitoring dashboard for all of my Amazon EMR-EC2 steps, Amazon EMR Serverless jobs, and AWS Glue jobs for the final 30 days. Area: us-east-2. 
Execution Plan:
1. Listing all of my Spark purposes throughout these providers from the final 30 days. You possibly can retailer any intermediate ends in information on this folder as .json, however VALIDATE outputs earlier than transferring onto the subsequent step. It is crucial to verify the outcomes earlier than contemplating this achieved. You possibly can write python script helpers to attain this. Deal with throttling and different exceptions gracefully. Be sure to cowl all platforms: Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue.
2. Use the spark-troubleshooting-mcp to assemble failure insights for every of my purposes. Save this as .json as effectively. 
3. Then, use this data to assist construct the dashboard as HTML. Identify the file dashboard.html.
Dashboard Necessities:
- Data from all of my Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue purposes needs to be current
- total success charges throughout platforms
- error class distributions for failures as a pie chart
- failures from final 30 days requiring consideration with root causes and proposals. Embody error class and present the foundation causes and proposals as they're returned by the spark-troubleshooting-mcp
- configuration comparisons per every platform. Configuration consists of variations, employee varieties / DPUs, and many others.

Clear up

To keep away from incurring future AWS costs, delete the sources you created throughout this walkthrough:

Delete the AWS CloudFormation stack.
For those who created an Amazon EventBridge rule for integration, delete these sources.

Conclusion

On this put up, we demonstrated how the Apache Spark troubleshooting agent transforms hours of handbook investigation into pure language conversations, considerably decreasing troubleshooting time from hours to minutes and making Spark experience accessible to all. By integrating pure language diagnostics into your current improvement instruments—whether or not Kiro CLI, Kiro IDE, or different MCP-compatible AI assistants—your groups can concentrate on constructing progressive purposes as an alternative of debugging failures.

Particular thanks

A particular because of everybody who contributed from engineering and science to the launch of the Spark troubleshooting agent and the distant MCP service: Tony Rusignuolo, Anshi Shrivastava, Martin Ma, Hirva Patel, Pranjal Srivastava, Weijing Cai, Rupak Ravi, Bo Li, Vaibhav Naik, XiaoRun Yu, Tina Shao, Pramod Chunduri, Ray Liu, Yueying Cui, Savio Dsouza, Kinshuk Pahare, Tim Kraska, Santosh Chandrachood, Paul Meighan and Rick Sears.

A particular because of all of our companions who contributed to the launch of the Spark troubleshooting agent and the distant MCP service: Karthik Prabhakar, Suthan Phillips, Basheer Sheriff, Kamen Sharlandjiev, Archana Inapudi, Vara Bonthu, McCall Peltier, Lydia Kautsky, Larry Weber, Jason Berkovitz, Jordan Vaughn, Amar Wakharkar, Subramanya Vajiraya, Boyko Radulov and Ishan Gaur.

Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glue

Apache Spark troubleshooting agent structure

Getting began

Conditions

Arrange utilizing AWS CloudFormation

Arrange utilizing Kiro CLI

Arrange utilizing Kiro IDE

Utilizing the troubleshooting agent

Answer 1 – Conversational troubleshooting: Troubleshooting failed Apache Spark purposes with Kiro CLI

Answer 2 – Agent-driven notifications: Combine the Apache Spark troubleshooting agent right into a monitoring workflow

Apache Airflow Integration

Amazon EventBridge integration

Answer 3 – Clever Dashboards: Use the Apache Spark troubleshooting agent with Kiro IDE to visualise account stage utility failures: what failed, why failed and tips on how to repair

Clear up

Conclusion

Particular thanks

In regards to the authors

Related Articles

Turning AI Safety into Companion Development

Dutch police arrest man for “hacking” after by chance sending him confidential information

DJI mounts authorized problem in opposition to US regulator over ‘arbitrary’ safety ban – sUAS Information

LEAVE A REPLY Cancel reply

Latest Articles

Turning AI Safety into Companion Development

Dutch police arrest man for “hacking” after by chance sending him confidential information

DJI mounts authorized problem in opposition to US regulator over ‘arbitrary’ safety ban – sUAS Information

X-365 Precision XY Gantry: Configurable & Inexpensive Movement

Researchers Break Open AI’s Black Field—and Use What They Discover Inside to Management It

ABOUT US