Agentic AI for observability and troubleshooting with Amazon OpenSearch Service

April 3, 2026

1

Amazon OpenSearch Service powers observability workflows for organizations, giving their Website Reliability Engineering (SRE) and DevOps groups a single pane of glass to mixture and analyze telemetry information. Throughout incidents, correlating alerts and figuring out root causes demand deep experience in log analytics and hours of guide work. Figuring out the basis trigger stays largely guide. For a lot of groups, that is the bottleneck that delays service restoration and burns engineering sources.

We lately confirmed the right way to construct an Observability Agent utilizing Amazon OpenSearch Service and Amazon Bedrock to cut back Imply time to Decision (MTTR). Now, Amazon OpenSearch Service brings many of those capabilities to the OpenSearch UI—no extra infrastructure required. Three new agentic AI options are provided to streamline and speed up MTTR:

An Agentic Chatbot that may entry the context and the underlying information that you just’re taking a look at, apply agentic reasoning, and use instruments to question information and generate insights in your behalf.
An Investigation Agent that deep-dives throughout sign information with hypothesis-driven evaluation, explaining its reasoning at each step.
An Agentic Reminiscence that helps each brokers, so their accuracy and velocity enhance the extra you utilize them.

On this submit, we present how these capabilities work collectively to assist engineers go from alert to root trigger in minutes. We additionally stroll by a pattern situation the place the Investigation Agent routinely correlates information throughout a number of indices to floor a root trigger speculation.

How the agentic AI capabilities work collectively

These AI capabilities are accessible from OpenSearch UI by an Ask AI button, as proven within the following diagram, which provides an entry level for the Agentic Chatbot.

Agentic Chatbot

To open the chatbot interface, select Ask AI.

The chatbot understands the context of the present web page, so it understands what you’re taking a look at earlier than you ask a query. You possibly can ask questions on your information, provoke an investigation, or ask the chatbot to clarify an idea. After it understands your request, the chatbot plans and makes use of instruments to entry information, together with producing and operating queries within the Uncover web page, and applies reasoning to provide a data-driven reply. It’s also possible to use the chatbot within the Dashboard web page, initiating conversations from a specific visualization to get a abstract as proven within the following picture.

Investigation agent

Many incidents are too advanced to resolve with one or two queries. Now you may get the assistance of the investigation agent to deal with these advanced conditions. The investigation agent makes use of the plan-execute-reflect agent, which is designed for fixing advanced duties that require iterative reasoning and step-by-step execution. It makes use of a Massive Language Mannequin (LLM) as a planner and one other LLM as an executor. When an engineer identifies a suspicious commentary, like an error fee spike or a latency anomaly, they’ll ask the investigation agent to analyze. One of many essential steps the investigation agent performs is re-evaluation. The agent, after executing every step, reevaluates the plan utilizing the planner and the intermediate outcomes. The planner can alter the plan if obligatory or skip a step or dynamically add steps primarily based on this new data. Utilizing the planner, the agent generates a root trigger evaluation report led by the most certainly speculation and proposals, with full agent traces displaying each reasoning step, all findings, and the way they assist the ultimate hypotheses. You possibly can present suggestions, add your individual findings, iterate on the investigation purpose, and evaluate and validate every step of the agent’s reasoning. This method mirrors how skilled incident responders work, however completes routinely in minutes. It’s also possible to use the “/examine” slash command to provoke an investigation immediately from the chatbot, constructing on an ongoing dialog or beginning with a distinct investigation purpose.

Agent in motion

Automated question era

Think about a scenario the place you’re an SRE or DevOps engineer and obtained an alert {that a} key service is experiencing elevated latency. You log in to the OpenSearch UI, navigate to the Uncover web page, and choose the Ask AI button. With none experience within the Piped Processing Language (PPL) question language, you enter the query “discover all requests with latency better than 10 seconds”. The chatbot understands the context and the info that you just’re taking a look at, thinks by the request, generates the appropriate PPL command, and updates it within the question bar to get you the outcomes. And if the question runs into any errors, the chatbot can study in regards to the error, self-correct, and iterate on the question to get the outcomes for you.

Investigation and investigation administration

For advanced incidents that usually require manually analyzing and correlating a number of logs for the potential root trigger, you may select Begin Investigation to provoke the investigation agent. You possibly can present a purpose for the investigation, together with any context or speculation that you just need to instruct the investigation. For instance, “establish the basis reason for widespread excessive latency throughout providers. Use TraceIDs from sluggish spans to correlate with detailed log entries within the associated log indices. Analyze affected providers, operations, error patterns, and any infrastructure or application-level bottlenecks with out sampling”.

The agent, as a part of the dialog, will provide to analyze any problem that you just’re attempting to debug.

The agent units objectives for itself together with some other related data like indices, related time vary, and different, and asks to your affirmation earlier than making a Pocket book for this investigation. A Pocket book is a manner throughout the OpenSearch UI to develop a wealthy report that’s dwell and collaborative. This helps with the administration of the investigation and permits for reinvestigation at a later date if obligatory.

After the investigation begins, the agent will carry out a fast evaluation by log sequence and information distribution to floor outliers. Then, it’s going to plan for the investigation right into a collection of actions, after which performs every motion, comparable to question for a particular log kind and time vary. It’s going to replicate on the outcomes at each step, and iterate on the plan till it reaches the most certainly hypotheses. Intermediate outcomes will seem on the identical web page because the agent works with the intention to comply with the reasoning in actual time. For instance, you discover that the Investigation Agent precisely mapped out the service topology and used it as a key middleman steps for the investigation.

Because the investigation completes, the investigation agent concludes that the most certainly speculation is a fraud detection timeout. The related discovering exhibits a log entry from the fee service: “forex quantity is just too huge, ready for fraud detection”. This matches a identified system design the place massive transactions set off a fraud detection name that blocks the request till the transaction is scored and assessed. The agent arrived at this discovering by correlating information throughout two separate indices, a metrics index the place the unique period information lived, and a correlated log index the place the fee service entries have been saved. The agent linked these indices utilizing hint IDs, connecting the latency measurement to the precise log entry that defined it.

After reviewing the speculation and the supporting proof, you discover the end result affordable and aligns together with your area information and previous experiences with related points. Now you can settle for the speculation and evaluate the request move topology for the affected traces that have been supplied as a part of the speculation investigation.

Alternatively, if you happen to discover that the preliminary speculation wasn’t useful, you may evaluate the choice speculation on the backside of the report and choose any of the choice hypotheses if there’s one which’s extra correct. It’s also possible to set off a re-investigation with extra inputs, or corrections from earlier enter in order that the Investigation Agent can rework it.

Getting began

You need to use any of the brand new agentic AI options (limits apply) within the OpenSearch UI for free of charge. You will discover the brand new agentic AI options prepared to make use of in your OpenSearch UI functions, except you’ve gotten beforehand disabled AI options in any OpenSearch Service domains in your account. To allow or disable the AI options, you may navigate to the small print web page of the OpenSearch UI software in AWS Administration Console and replace the AI settings from there. Alternatively, you may as well use the registerCapability API to allow the AI options or use the deregisterCapability API to disable them. Study extra at Agentic AI in Amazon OpenSearch Providers.

The agentic AI characteristic makes use of the identification and permissions of the logged in customers for authorizing entry to the linked information sources. Make it possible for your customers have the mandatory permissions to entry the info sources. For extra data, see Getting Began with OpenSearch UI.

The investigation outcomes are saved within the metadata system of OpenSearch UI and encrypted with a service managed key. Optionally, you may configure a buyer managed key to encrypt all the metadata with your individual key. For extra data, see Encryption and Buyer Managed Key with OpenSearch UI.

The AI options are powered by Claude Sonnet 4.6 mannequin in Amazon Bedrock. Study extra at Amazon Bedrock Knowledge Safety.

Conclusion

The brand new agentic AI capabilities introduced for Amazon OpenSearch Service assist scale back Imply Time to Decision by offering context-aware agentic chatbot for help, hypothesis-driven investigations with full explainability, and agentic reminiscence for context consistency. With the brand new agentic AI capabilities, your engineering group can spend much less time writing queries and correlating alerts, and extra time performing on confirmed root causes. We invite you to discover these capabilities and experiment together with your functions right now.

Agentic AI for observability and troubleshooting with Amazon OpenSearch Service

How the agentic AI capabilities work collectively

Agentic Chatbot

Investigation agent

Agent in motion

Automated question era

Investigation and investigation administration

Getting began

Conclusion

In regards to the authors

Related Articles

Qualcomm joins MassRobotics, to assist startups with Dragonwing Robotics Hub

Analysts weigh in on potential Amazon-Globalstar deal

Sources element Fidji Simo’s strikes at OpenAI, together with spearheading the TBPN acquisition and pushing OpenAI to chop Sora and keep away from different...

LEAVE A REPLY Cancel reply

Latest Articles

Qualcomm joins MassRobotics, to assist startups with Dragonwing Robotics Hub

Analysts weigh in on potential Amazon-Globalstar deal

Sources element Fidji Simo’s strikes at OpenAI, together with spearheading the TBPN acquisition and pushing OpenAI to chop Sora and keep away from different...

3D Printed Bone Grafts From Georgetown Researchers May Change Conventional Implants – 3DPrint.com

What Employers Anticipate Past Primary AI Instrument Utilization?

ABOUT US