-0.2 C
Canberra
Saturday, July 19, 2025

Technical Approaches and Sensible Tradeoffs


Technical Approaches and Sensible Tradeoffs

(Chuysang/Shutterstock)

On this planet of monitoring software program, the way you course of telemetry information can considerably influence your capacity to derive insights, troubleshoot points, and handle prices.

There are 2 main use circumstances for a way telemetry information is leveraged:

  • RadarĀ (Monitoring of techniques) often falls into the bucket of identified knowns and identified unknowns. This results in situations the place some information is sort of ā€˜pre-determined’ to behave, be plotted in a sure means – as a result of we all know what we’re searching for.
  • BlackboxĀ (Debugging, RCA and many others.) ones however are extra to do with unknown unknowns. Which entails to what we don’t know and should must hunt for to construct an understanding of the system.

Understanding Telemetry Information Challenges

Earlier than diving into processing approaches, it’s vital to know the distinctive challenges of telemetry information:

  • Quantity: Fashionable techniques generate monumental quantities of telemetry information
  • Velocity: Information arrives in steady, high-throughput streams
  • Selection: A number of codecs throughout metrics, logs, traces, profiles and occasions
  • Time-sensitivity: Worth typically decreases with age
  • Correlation wants: Information from totally different sources should be linked collectively

These traits create particular issues when selecting between ETL and ELT approaches.

Ā 

ETL for Telemetry: Remodel-First Structure

Technical Structure

In an ETL strategy, telemetry information undergoes transformation earlier than reaching its closing vacation spot:

Fig. 1 — ETL for Telemetry

A typical implementation stack may embrace:

  • Assortment: OpenTelemetry, Prometheus, Fluent Bit
  • Transport: Kafka or Kinesis or in reminiscence because the buffering layer
  • Transformation: Stream processing
  • Storage: Time-series databases (Prometheus) or specialised indices or Object Storage (s3)

Key Technical Elements

  1. Aggregation Methods

Pre-aggregation considerably reduces information quantity and question complexity. A typical pre-aggregation circulation seems to be like this:

Fig. 2 — Aggregation Methods

This transformation condenses uncooked information into 5-minute summaries, dramatically decreasing storage necessities and bettering question efficiency.

Instance:Ā For a gaming utility dealing with thousands and thousands of requests per day, uncooked request latency metrics (doubtlessly billions of knowledge factors) may be grouped by service and endpoint, then aggregated into 5-minute (or 1-minute) home windows. A single API name that generates 100 latency information factors per second (8.64 million per day) is decreased to only 288 aggregated entries per day (one per 5-minute window), whereas nonetheless preserving crucial p50/p90/p99 percentiles wanted for SLA monitoring.

  1. Cardinality Administration

Excessive-cardinality metrics can break time-series databases. The cardinality administration course of follows this sample:

Fig. 3 — Cardinality-Administration

Efficient methods embrace:

  • Label filtering and normalization
  • Strategic aggregation of particular dimensions
  • Hashing methods for high-cardinality values whereas preserving question patterns

Instance:Ā A microservice monitoring HTTP requests contains consumer IDs and request paths in its metrics. With 50,000 day by day energetic customers and 1000’s of distinctive URL paths, this creates thousands and thousands of distinctive label mixtures. The cardinality administration system filters out consumer IDs completely (configurable, too excessive cardinality), normalizes URL paths by changing dynamic segments with placeholders (e.g.,Ā /customers/123/profilebecomesĀ /customers/{id}/profile), and applies constant categorization to errors. This reduces distinctive time collection from thousands and thousands to a whole bunch, permitting the time-series database to perform effectively.

Fig. 4 — Actual-time Enrichment

  1. Actual-time Enrichment

Including context to metrics through the transformation part entails integrating exterior information sources:

This course of provides crucial enterprise and operational context to uncooked telemetry information, enabling extra significant evaluation and alerting primarily based on service significance, buyer influence, and different elements past pure technical metrics.

Instance:Ā A fee processing service emits fundamental metrics like request counts, latencies, and error charges. The enrichment pipeline joins this telemetry with service registry information so as to add metadata in regards to the service tier (crucial), SLO targets (99.99% availability), and workforce possession (payments-team). It then incorporates enterprise context to tag transactions with their kind (subscription renewal, one-time buy, refund) and estimated income influence. When an incident happens, alerts are mechanically prioritized primarily based on enterprise influence slightly than simply technical severity, and routed to the suitable workforce with wealthy context.

Technical Benefits

  • Question efficiency: Pre-calculated aggregates get rid of computation at question time
  • Predictable useful resource utilization: Each storage and question compute are managed
  • Schema enforcement: Information conformity is assured earlier than storage
  • Optimized storage codecs: Information may be saved in codecs optimized for particular entry patterns

Technical Limitations

  • Lack of granularity: Some element is completely misplaced
  • Schema rigidity: Adapting to new necessities requires pipeline adjustments
  • Processing overhead: Actual-time transformation provides complexity and useful resource calls for
  • Transformation-time choices: Evaluation paths should be identified prematurely

ELT for Telemetry: Uncooked Storage with Versatile Transformation

Technical Structure

ELT structure prioritizes getting uncooked information into storage, with transformations carried out at question time:

Fig. 5 — ELT for Telemetry

A typical implementation may embrace:

  • Assortment: OpenTelemetry, Prometheus, Fluent Bit
  • Transport: Direct ingestion with out advanced processing
  • Storage: Object storage (S3, GCS) or information lakes in Parquet format
  • Transformation: SQL engines (Presto, Athena), Spark jobs, or specialised OLAP techniques

Key Technical Elements

Fig. 6 — Environment friendly-Uncooked-Storage

  1. Environment friendly Uncooked Storage

Optimizing for long-term storage of uncooked telemetry requires cautious consideration of file codecs and storage group:

This strategy leverages columnar storage codecs like Parquet with applicable compression (ZSTD for traces, Snappy for metrics), dictionary encoding, and optimized column indexing primarily based on widespread question patterns (trace_id, service, time ranges).

Instance:Ā A cloud-native utility generates 10TB of hint information day by day throughout its distributed providers. As a substitute of discarding or closely sampling this information, the entire hint data is captured utilizing OpenTelemetry collectors and transformed to Parquet format with ZSTD compression. Key fields like trace_id, service identify, and timestamp are listed for environment friendly querying. This strategy reduces the storage footprint by 85% in comparison with uncooked JSON whereas sustaining question efficiency. When a crucial customer-impacting concern occurred, engineers have been capable of entry full hint information from 3 months prior, figuring out a delicate sample of intermittent failures that might have been misplaced with conventional sampling.

  1. Partitioning Methods

Efficient partitioning is essential for question efficiency towards uncooked telemetry. A well-designed partitioning technique follows this hierarchy:

Fig. 7 — Partitioning-Methods

This partitioning strategy permits environment friendly time-range queries whereas additionally permitting filtering by service and tenant, that are widespread question dimensions. The partitioning technique is designed to:

  • Optimize for time-based retrieval (most typical question sample)
  • Allow environment friendly tenant isolation for multi-tenant techniques
  • Enable service-specific queries with out scanning all information
  • Separate telemetry sorts for optimized storage codecs per kind

Instance:Ā A SaaS platform with 200+ enterprise prospects makes use of this partitioning technique for its observability information lake. When a high-priority buyer experiences a difficulty that occurred final Tuesday between 2-4pm, engineers can instantly question simply these particular partitions:Ā /12 months=2023/month=11/day=07/hour=1[4-5]/tenant=enterprise-x/*. This strategy reduces the scan dimension from doubtlessly petabytes to only a few gigabytes, enabling responses in seconds slightly than hours. When evaluating present efficiency towards historic baselines, the time-based partitioning permits environment friendly month-over-month comparisons by scanning solely the related time partitions.

  1. Question-time Transformations

SQL and analytical engines present highly effective query-time transformations. The question processing circulation for on-the-fly evaluation seems to be like this (See Fig. 8).

This question circulation demonstrates how advanced evaluation like calculating service latency percentiles, error charges, and utilization patterns may be carried out completely at question time with no need pre-computation. The analytical engine applies optimizations like predicate pushdown, parallel execution, and columnar processing to realize cheap efficiency even towards giant uncooked datasets.

Fig. 8 — Question-time-Transformations

Instance:Ā A DevOps workforce investigating a efficiency regression found it solely affected premium prospects utilizing a particular characteristic. Utilizing query-time transformations towards the ELT information lake, they wrote a single question that first filtered to the affected time interval, joined buyer tier data, extracted related attributes about characteristic utilization, calculated percentile response occasions grouped by buyer section, and recognized that premium prospects with excessive transaction volumes have been experiencing degraded efficiency solely when a particular optionally available characteristic flag was enabled. This evaluation would have been inconceivable with pre-aggregated information because the buyer section + characteristic flag dimension hadn’t been beforehand recognized as vital for monitoring.

Technical Benefits

  • Schema flexibility: New dimensions may be analyzed with out pipeline adjustments
  • Price-effective storage: Object storage is considerably cheaper than specialised DBs
  • Retroactive evaluation: Historic information may be examined with new views

Technical Limitations

  • Question efficiency challenges: Interactive evaluation could also be sluggish on giant datasets
  • Useful resource-intensive evaluation: Compute prices may be excessive for advanced queries
  • Implementation complexity: Requires extra subtle question tooling
  • Storage overhead: Uncooked information consumes considerably more room

Technical Implementation: The Hybrid Method

Core Structure Elements

Implementation Technique

  1. Twin-path processing

    Fig. 10 — -Twin-path-processing

Instance: A worldwide ride-sharing platform applied a dual-path telemetry system that routes service well being metrics and buyer expertise indicators (experience wait occasions, ETA accuracy) by the ETL path for real-time dashboards and alerting. In the meantime, all uncooked information together with detailed consumer journeys, driver actions, and utility logs flows by the ELT path to cost-effective storage. When a regional outage occurred, operations groups used the real-time dashboards to shortly determine and mitigate the speedy concern. Later, information scientists used the preserved uncooked information to carry out a complete root trigger evaluation, correlating a number of elements that wouldn’t have been seen in pre-aggregated information alone.

  1. Good information routing

Fig. 11 — Good Information Routing

Instance:Ā A monetary providers firm deployed a wise routing system for his or her telemetry information. All information is preserved within the information lake, however crucial metrics like transaction success charges, fraud detection alerts, and authentication service well being metrics are instantly routed to the real-time processing pipeline. Moreover, any security-related occasions akin to failed login makes an attempt, permission adjustments, or uncommon entry patterns are instantly despatched to a devoted safety evaluation pipeline. Throughout a current safety incident, this routing enabled the safety workforce to detect and reply to an uncommon sample of authentication makes an attempt inside minutes, whereas the entire context of consumer journeys and utility habits was preserved within the information lake for subsequent forensic evaluation.

  1. Unified question interface

Actual-world Implementation Instance

A particular engineering implementation atĀ last9.ioĀ demonstrates how this hybrid strategy works in observe:

For a large-scale Kubernetes platform with a whole bunch of clusters and 1000’s of providers, we applied a hybrid telemetry pipeline with:

  • Important-path metricsĀ processed by a pipeline that:

    Fig. 12 — Unified question interface

    • Performs dimensional discount (limiting label mixtures)
    • Pre-calculates service-level aggregations
    • Computes derived metrics like success charges and latency percentiles
  • Uncooked telemetryĀ saved in an economical information lake:
    • Partitioned by time, information kind, and tenant
    • Optimized for typical question patterns
    • Compressed with applicable codecs (Zstd for traces, Snappy for metrics)
  • Unified question layerĀ that:
    • Routes dashboard and alerting queries to pre-aggregated storage
    • Redirects exploratory and ad-hoc evaluation to the info lake
    • Manages correlation queries throughout each techniques

This strategy delivered each the question efficiency wanted for real-time operations and the analytical depth required for advanced troubleshooting.

Resolution Framework

When architecting telemetry pipelines, these technical issues ought to information your strategy:

Resolution Issue Use ETL Use ELT
Question latency necessities < 1 second Can wait minutes
Information retention wants Days/Weeks Months/Years
Cardinality Low/Medium Very excessive
Evaluation patterns Nicely-defined Exploratory
Finances precedence Compute Storage

Conclusion

The technical realities of telemetry information processing demand pondering past easy ETL vs. ELT paradigms. Engineering groups ought to architect tiered techniques that leverage the strengths of each approaches:

  • ETL-processed informationĀ for operational use circumstances requiring speedy insights
  • ELT-processed informationĀ for deeper evaluation, troubleshooting, and historic patterns
  • Metadata-driven routingĀ to intelligently direct queries to the suitable tier

This engineering-centric strategy balances efficiency necessities with price issues whereas sustaining the flexibleness required in trendy observability techniques.

In regards to the writer:Ā Nishant Modak is the founder and CEO of Last9, a excessive cardinality observability platform firm backed by Sequoia India (now PeakXV). He’s been an entrepreneur and dealing with giant scale corporations for practically 20 years.

Associated Objects:

From ETL to ELT: The Subsequent Technology of Information Integration Success

Can We Cease Doing ETL But?

50 Years Of ETL: Can SQL For ETL Be Changed?

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles