Amazon OpenSearch Ingestion 101: Set CloudWatch alarms for key metrics

February 9, 2026

26

Amazon OpenSearch Ingestion is a completely managed, serverless knowledge pipeline that simplifies the method of ingesting knowledge into Amazon OpenSearch Service and OpenSearch Serverless collections. Some key ideas embrace:

Supply – Enter part that specifies how the pipeline ingests the information. Every pipeline has a single supply which might be both push-based and pull-based.
Processors – Intermediate processing models that may filter, rework, and enrich information earlier than supply.
Sink – Output part that specifies the vacation spot(s) to which the pipeline publishes knowledge. It may publish information to a number of locations.
Buffer – It’s the layer between the supply and the sink. It serves as momentary storage for occasions, decoupling the supply from the downstream processors and sinks. Amazon OpenSearch Ingestion additionally provides a persistent buffer choice for push-based sources
Useless-letter queues (DLQs) – Configures Amazon Easy Storage Service (Amazon S3) to seize information that fail to jot down to the sink, enabling error dealing with and troubleshooting.

This end-to-end knowledge ingestion service may help you acquire, course of, and ship knowledge to your OpenSearch environments with out the necessity to handle underlying infrastructure.

This publish offers an in-depth take a look at establishing Amazon CloudWatch alarms for OpenSearch Ingestion pipelines. It goes past our really helpful alarms to assist determine bottlenecks within the pipeline, whether or not that’s within the sink, the OpenSearch clusters knowledge is being despatched to, the processors, or the pipeline not pulling or accepting sufficient from the supply. This publish will aid you proactively monitor and troubleshoot your OpenSearch Ingestion pipelines.

Overview

Monitoring your OpenSearch Ingestion pipelines is essential for catching and addressing points early. By understanding the important thing metrics and establishing the correct alarms, you possibly can proactively handle the well being and efficiency of your knowledge ingestion workflows. Within the following sections, we offer particulars about alarm metrics for various sources, displays, and sinks. The particular values for the brink, interval, and datapoints to alarm used for alarms can differ based mostly on the person use case and necessities.

Conditions

To create an OpenSearch Ingestion pipeline, seek advice from Creating Amazon OpenSearch Ingestion pipelines. For creating CloudWatch alarms, seek advice from Create a CloudWatch alarm based mostly on a static threshold.

You’ll be able to allow logging for OpenSearch Ingestion Pipeline, which captures numerous log messages throughout pipeline operations and ingestion exercise, together with errors, warnings, and informational messages. For particulars on enabling and monitoring pipeline logs, seek advice from Monitoring pipeline logs

Sources

The entry level of your pipeline is usually the place monitoring ought to start. By setting applicable alarms for supply parts, you possibly can shortly determine ingestion bottlenecks or connection points. The next desk summarizes key alarm metrics for various sources.

Supply	Alarm	Description	Really useful Motion
HTTP/ OpenTelemetry	`requestsTooLarge.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The request payload measurement of the shopper (knowledge producer) is bigger than the utmost request payload measurement, ensuing within the standing code HTTP 413. The default most request payload measurement is 10 MB for HTTP sources and 4 MB for OpenTelemetry sources. The restrict for the HTTP sources might be elevated for the pipelines with persistent buffer enabled.	The chunk measurement for the shopper might be diminished in order that the request payload doesn’t exceed the utmost measurement. You’ll be able to look at the distribution of payload sizes of incoming requests utilizing the `payloadSize.sum` metric.
HTTP	`requestsRejected.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The request was despatched to the HTTP endpoint of the OpenSearch Ingestion pipeline by the shopper (knowledge producer), however the request wasn’t accepted by the pipeline, and it rejected the request with the standing code 429 within the response.	For persistent points, take into account growing the minimal OCUs for the pipeline to allocate further sources for request processing.
Amazon S3	`s3ObjectsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The pipeline is unable to learn some objects from the Amazon S3 supply.	Seek advice from REF-003 in Reference Information under.
Amazon DynamoDB	`Distinction for totalOpenShards.max - activeShardsInProcessing.worth` Threshold: >0 Statistic: Most (totalOpenShards.max) and Sum (activeShardsInProcessing.worth) Datapoints to Alarm: 3 out of three.Extra Observe: refer REF-004 for extra particulars on configuring this particular alarm.	It displays alignment between complete open shards that must be processed by the pipeline and lively shards at present in processing. The `activeShardsInProcessing.worth` will go down periodically as shards shut however ought to by no means misalign from ‘totalOpenShards.max’ for longer than a few minutes.	If the alarm is triggered, you possibly can take into account stopping and beginning the pipeline, this feature resets the pipeline’s state, and the pipeline will restart with a brand new full export. It’s non-destructive, so it does not delete your index or any knowledge in DynamoDB. When you don’t create a contemporary index earlier than you do that, you may see a excessive variety of errors from model conflicts as a result of the export tries to insert older paperwork than the present _version within the index. You’ll be able to safely ignore these errors. For root trigger evaluation on the misalignment, you possibly can attain out to AWS Assist
Amazon DynamoDB	`dynamodb.changeEventsProcessingErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of processing errors for change occasions for a pipeline with stream processing for DynamoDB.	If the metrics report growing values, seek advice from REF-002 in Reference Information under
Amazon DocumentDB	`documentdb.exportJobFailure.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The try to set off an export to Amazon S3 failed.	Assessment ERROR-level logs within the pipeline logs for entries starting with “Acquired an exception throughout export from DocumentDB, backing off and retrying.” These logs comprise the whole exception particulars indicating the foundation explanation for the failure.
Amazon DocumentDB	`documentdb.changeEventsProcessingErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of processing errors for change occasions for a pipeline with stream processing for Amazon DocumentDB.	Seek advice from REF-002 in Reference Information under
Kafka	`kafka.numberOfDeserializationErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The OpenSearch Ingestion pipeline encountered deserialization errors whereas consuming a file from Kafka.	Assessment WARN-level logs within the pipeline logs and confirm `serde_format` is configured appropriately within the pipeline configuration and the pipeline function has entry to the AWS Glue Schema Registry (if used).
OpenSearch	`opensearch.processingErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	Processing errors had been encountered whereas studying from the index. Ideally, the OpenSearch Ingestion pipeline would retry robotically, however for unknown exceptions, it would skip processing.	Seek advice from REF-001 or REF-002 in Reference Information under, to get the exception particulars that resulted in processing errors.
Amazon Kinesis Information Streams	`kinesis_data_streams.recordProcessingErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The OpenSearch Ingestion pipeline encountered an error whereas processing the information.	If the metrics report growing values, seek advice from REF-002 in Reference Information under, which may help in figuring out the trigger.
Amazon Kinesis Information Streams	`kinesis_data_streams.acknowledgementSetFailures.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The pipeline encountered a detrimental acknowledgment whereas processing the streams, inflicting it to reprocess the stream.	Seek advice from REF-001 or REF-002 in Reference Information under.
Confluence	`confluence.searchRequestsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	Whereas making an attempt to fetch the content material, the pipeline encountered the exception.	Assessment ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching content material.” These logs comprise the whole exception particulars indicating the foundation explanation for the failure.
Confluence	`confluence.authFailures.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of UNAUTHORIZED exceptions obtained whereas establishing the connection	Though the service ought to robotically renew tokens, if the metrics present an growing worth, evaluation ERROR-level logs within the pipeline logs to determine why the token refresh is failing.
Jira	`jira.ticketRequestsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	Whereas making an attempt to fetch the problem, the pipeline encountered an exception.	Assessment ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching difficulty.” These logs comprise the whole exception particulars indicating the foundation explanation for the failure.
Jira	`jira.authFailures.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of UNAUTHORIZED exceptions obtained whereas establishing the connection.	Though the service ought to robotically renew tokens, if the metrics present an growing worth, evaluation ERROR-level logs within the pipeline logs to determine why the token refresh is failing.

Processors

The next desk offers particulars about alarm metrics for various processors.

Processor	Alarm	Description	Really useful Motion
AWS Lambda	`aws_lambda_processor.recordsFailedToSentLambda.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	Among the information couldn’t be despatched to Lambda.	Within the case of excessive values for this metric, seek advice from REF-002 in Reference Information under.
AWS Lambda	`aws_lambda_processor.numberOfRequestsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The pipeline was unable to invoke the Lambda operate.	Though this example mustn’t happen beneath regular circumstances, if it does, evaluation Lambda logs and seek advice from REF-002 in Reference Information under.
AWS Lambda	`aws_lambda_processor.requestPayloadSize.max` Threshold: >= 6292536 Statistic: MAXIMUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The payload measurement is exceeding the 6 MB restrict, so the Lambda operate can’t be invoked.	Contemplate revisiting the batching thresholds within the pipeline configuration for the `aws_lambda` processor.
Grok	`grok.grokProcessingMismatch.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The incoming knowledge doesn’t match the Grok sample outlined within the pipeline configuration.	Within the case of excessive values for this metric, evaluation the Grok processor configurations and ensure the outlined sample matches based on the incoming knowledge.
Grok	`grok.grokProcessingErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The pipeline encountered an exception when extracting the knowledge from the incoming knowledge based on the outlined Grok sample.	Within the case of excessive values for this metric, seek advice from REF-002 in Reference Information under.
Grok	`grok.grokProcessingTime.max` Threshold: >= 1000 Statistic: MAXIMUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The utmost period of time that every particular person file takes to match towards patterns from the match configuration choice.	If the time taken is the same as or greater than 1 second, verify the incoming knowledge and the Grok sample. The utmost period of time throughout which matching happens is 30,000 milliseconds, which is managed by the `timeout_millis` parameter.

Sinks and DLQs

The next desk comprises particulars about alarm metrics for various sinks and DLQs.

Sink	Alarm	Description	Really useful Motion
OpenSearch	`opensearch.bulkRequestErrors.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of errors encountered whereas sending a bulk request.	Seek advice from REF-002 in Reference Information under which may help to determine the exception particulars.
OpenSearch	`opensearch.bulkRequestFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The variety of errors obtained after sending the majority request to the OpenSearch area.	Seek advice from REF-001 in Reference Information under which may help to determine the exception particulars.
Amazon S3	`s3.s3SinkObjectsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	The OpenSearch Ingestion pipeline encountered a failure whereas writing the item to Amazon S3.	Confirm that the pipeline function has the mandatory permissions to jot down objects to the required S3 key. Assessment the pipeline logs to determine the particular keys the place failures occurred. Monitor the `s3.s3SinkObjectsEventsFailed.depend` metric for granular particulars on the variety of failed write operations.
Amazon S3 DLQ	`s3.dlqS3RecordsFailed.depend` Threshold: >0 Statistic: SUM Interval: 5 minutes Datapoints to alarm: 1 out 1	For a pipeline with DLQ enabled, the information are both despatched to the sink or to the DLQ (if they’re unable to ship to the sink). This alarm signifies the pipeline was unable to ship the information to the DLQ resulting from some error.	Seek advice from REF-002 in Reference Information under which may help to determine the exception particulars.

Buffer

The next desk comprises particulars about alarm metrics for buffers.

Buffer	Alarm	Description	Really useful Motion
BlockingBuffer	`BlockingBuffer.bufferUsage.worth` Threshold: >80 Statistic: AVERAGE Interval: 5 minutes Datapoints to alarm: 1 out 1	The % utilization, based mostly on the variety of information within the buffer.	To analyze additional, verify if the Pipeline is bottlenecked resulting from processors or sink by evaluating timeElapsed.max metrics and analyzing bulkRequestLatency.max
Persistent	`persistentBufferRead.recordsLagMax.worth` Threshold: > 5000 Statistic: AVERAGE Interval: 5 minutes Datapoints to alarm: 1 out 1	The utmost lag by way of variety of information saved within the persistent buffer.	If the worth for bufferUsage is low, enhance the utmost OCUs. If bufferUsage can also be excessive [>80], examine if pipeline is bottlenecked by processors or sink.

Reference Information

The next present steerage for resolving frequent pipeline points together with basic reference.

REF-001: WARN-level Log Assessment

Assessment WARN-level logs within the pipeline logs to determine the exception particulars.

REF-002: ERROR-level Log Assessment

Assessment ERROR-level logs within the pipeline logs to determine the exception particulars.

REF-003: S3 Objects Failed

When troubleshooting growing s3ObjectsFailed.depend values, monitor these particular metrics to slender down the foundation trigger:

s3ObjectsAccessDenied.depend – This metric increments when the pipeline encounters Entry Denied or Forbidden errors whereas studying S3 objects. Widespread causes embrace:
Inadequate permissions within the pipeline function.
Restrictive S3 bucket coverage not permitting the pipeline function entry.
For cross-account S3 buckets, incorrectly configured bucket_owners mapping.
s3ObjectsNotFound.depend – This metric increments when the pipeline receives Not Discovered errors whereas trying to learn S3 objects.

For additional help with the really helpful actions, contact AWS help.

REF-004: Configuring Alarm for distinction in totalOpenShards.max and activeShardsInProcessing.worth for Amazon DynamoDB supply.

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
Within the navigation pane, select Alarms, All alarms.
Select Create alarm.
Select Choose Metric.
Choose Supply.

In supply, following JSON can be utilized after updating the , and .

{
    "metrics": [
        [ { "expression": "m1-e1", "label": "Expression2", "id": "e2", "period": 900 } ],
        [ { "expression": "FLOOR((m2/15)+0.5)", "label": "Expression1", "id": "activeShardsInProcessing", "visible": false, "period": 900 } ],
        [ "AWS/OSIS", ".dynamodb.totalOpenShards.max", "PipelineName", "", { "stat": "Maximum", "id": "m1", "visible": false } ],
        [ ".", ".dynamodb.activeShardsInProcessing.value", ".", ".", { "stat": "Average", "id": "m2", "visible": false } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "interval": 900,
    "area": ""
}

Let’s evaluation couple of eventualities based mostly on the above metrics.

Situation 1 – Perceive and Decrease Pipeline Latency

Latency inside a pipeline is constructed up of three important parts:

The time it takes to ship paperwork through bulk requests to OpenSearch,
the time it takes for knowledge to undergo the pipeline processors, and
the time that knowledge sits within the pipeline buffer

Bulk requests and processors (final two gadgets within the earlier checklist) are the foundation causes for why the buffer builds up and results in latency.

To observe how a lot knowledge is being saved within the buffer, monitor the bufferUsage.worth metric. The one approach to decrease latency inside the buffer is to optimize the pipeline processors and sink bulk request latency, relying on which of these is the bottleneck.

The bulkRequestLatency metric measures the time taken to execute bulk requests, together with retries, and can be utilized to observe write efficiency to the OpenSearch sink. If this metric reviews an unusually excessive worth, it signifies that the OpenSearch sink could also be overloaded, inflicting elevated processing time. To troubleshoot additional, evaluation the bulkRequestNumberOfRetries.depend metric to verify whether or not the excessive latency is because of rejections from OpenSearch which might be resulting in retries, equivalent to throttling (429 errors) or different causes. If doc errors are current, look at the configured DLQ to determine the failed doc particulars. Moreover, the max_retries parameter might be configured within the pipeline configuration to restrict the variety of retries. Nevertheless, if the documentErrors metric reviews zero, the bulkRequestNumberOfRetries.depend can also be zero, and the bulkRequestLatency stays excessive, it’s probably an indicator that the OpenSearch sink is overloaded. On this case, evaluation the vacation spot metrics for added particulars.

If the bulkRequestLatency metric is low (for instance, lower than 1.5 seconds) and the bulkRequestNumberOfRetries metric is reported as 0, then the bottleneck is probably going inside the pipeline processors. To observe the efficiency of the processors, evaluation the .timeElapsed.avg metric. This metric reviews the time taken for the processor to finish processing of a batch of information. For instance, if a grok processor is reporting a a lot increased worth than different processors for timeElapsed, it could be resulting from a sluggish grok sample that may be optimized and even changed with a extra performant processor, relying on the use case.

Situation 2 – Understanding and Resolving Doc Errors to OpenSearch

The documentErrors.depend metric tracks the variety of paperwork that didn’t be despatched by bulk requests. The failure can occur resulting from numerous causes equivalent to mapping conflicts, invalid knowledge codecs, or schema mismatches. When this metric reviews a non-zero worth, it signifies that some paperwork are being rejected by OpenSearch. To determine the foundation trigger, look at the configured Useless Letter Queue (DLQ), which captures the failed paperwork together with error particulars. The DLQ offers details about why particular paperwork failed, enabling you to determine patterns equivalent to incorrect discipline varieties, lacking required fields, or knowledge that exceeds measurement limits. For instance, discover the pattern DLQ objects for frequent points under:

Mapper parsing exception:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "",
        "failedData": {
            "index": "",
            "indexId": null,
            "status": 400,
            "message": "failed to parse field [] of sort [integer] in doc with id ''. Preview of discipline's worth: 'N/A' attributable to For enter string: "N/A"",
            "doc": {}
        },
        "timestamp": "…"
    }]}

Right here, OpenSearch can’t retailer the textual content string “N/A” in a discipline that’s just for numbers, so it rejects the doc and shops it within the DLQ.

Restrict of complete fields exceeded:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "",
        "failedData": {
            "index": "",
            "indexId": null,
            "status": 400,
            "message": "Limit of total fields [] has been exceeded",
            "doc": {}
        },
        "timestamp": "…"
    }]}

The index.mapping.total_fields.restrict setting is the parameter that controls the utmost variety of fields allowed in an index mapping, and exceeding this restrict will trigger indexing operations to fail. You’ll be able to verify if all these fields are required or leverage numerous processors supplied by OpenSearch Ingestion to rework the information.

As soon as these points are recognized, you possibly can both right the supply knowledge, alter the pipeline configuration to rework the information appropriately, or modify the OpenSearch index mapping to accommodate the incoming knowledge format.

Clear up

When establishing alarms for monitoring your OpenSearch Ingestion pipelines, it’s essential to be conscious of the potential prices concerned. Every alarm you configure will incur expenses based mostly on the CloudWatch pricing mannequin.

To keep away from pointless bills, we advocate fastidiously evaluating your alarm necessities and configuring them accordingly. Solely arrange the alarms which might be important in your use case, and commonly evaluation your alarm configurations to determine and take away unused or redundant alarms.

Conclusion

On this publish, we explored the excellent monitoring capabilities for OpenSearch Ingestion pipelines by CloudWatch alarms, protecting key metrics throughout numerous sources, processors, and sinks. Though this publish highlights essentially the most vital metrics, there’s extra to find. For a deeper dive, seek advice from the next sources:

Efficient monitoring by CloudWatch alarms is essential for sustaining wholesome ingestion pipelines and sustaining optimum knowledge circulate.

Amazon OpenSearch Ingestion 101: Set CloudWatch alarms for key metrics

Overview

Conditions

Sources

Processors

Sinks and DLQs

Buffer

Reference Information

Situation 1 – Perceive and Decrease Pipeline Latency

Situation 2 – Understanding and Resolving Doc Errors to OpenSearch

Mapper parsing exception:

Restrict of complete fields exceeded:

Clear up

Conclusion

Concerning the authors

Related Articles

This Artificial Cell Grows, Copies Its DNA, and Produces Offspring—However It Isn’t Alive

Publishers Accuse OpenAI of Withholding Proof in Copyright Lawsuits

Auxilium Biotechnologies Bioprints Kidney and Liver Tissue Aboard the ISS

LEAVE A REPLY Cancel reply

Latest Articles

This Artificial Cell Grows, Copies Its DNA, and Produces Offspring—However It Isn’t Alive

Publishers Accuse OpenAI of Withholding Proof in Copyright Lawsuits

Auxilium Biotechnologies Bioprints Kidney and Liver Tissue Aboard the ISS

Your id stack was constructed for 2 sorts of actor. Brokers are a 3rd.

Revving up Microsoft’s 10x sooner TypeScript 7

ABOUT US