26.9 C
Canberra
Wednesday, March 4, 2026

How Nexthink constructed real-time alerts with Amazon Managed Service for Apache Flink


This put up is cowritten with Nikos Tragaras and Raphaël Afanyan from Nexthink.

On this put up, we describe Nexthink’s journey as they applied a brand new real-time alerting system utilizing Amazon Managed Service for Apache Flink. We discover the structure, the rationale behind key expertise decisions, and the Amazon Net Companies (AWS) companies that enabled a scalable and environment friendly answer.

Nexthink is a pioneering chief in digital worker expertise (DEX). With a mission to empower IT groups and elevate office productiveness, Nexthink’s Infinity platform presents real-time visibility into finish person environments, actionable insights, and strong automation capabilities. By combining real-time analytics, proactive monitoring, and clever automation, Infinity permits organizations to ship an optimum digital workspace.

Up to now 5 years, Nexthink accomplished its transformation right into a fully-fledged cloud platform that processes trillions of occasions per day, reaching over 5 GB per second of aggregated throughput. Internally, Infinity contains greater than 300 microservices that use the ability of Apache Kafka by way of Amazon Managed Service for Apache Kafka (Amazon MSK) for information ingestion and intra-service communication. The Nexthink ecosystem contains a number of tons of of Micronaut-based Java microservices deployed in Amazon Elastic Kubernetes Service (Amazon EKS). The overwhelming majority of microservices work together with Kafka by way of the Kafka Streams framework.

Nexthink alerting system

That can assist you perceive Nexthink’s journey towards a brand new real-time alerting answer, we start by inspecting the prevailing system and the evolving necessities that led them to hunt a brand new answer.

Nexthink’s present alerting system offers close to real-time notifications, serving to customers detect and reply to important occasions rapidly. Whereas efficient, this method has limitations in scalability, flexibility, and real-time processing capabilities.

Nexthink gathers telemetry information from hundreds of consumers’ laptops overlaying CPU utilization, reminiscence, software program variations, community efficiency, and extra. Amazon MSK and ClickHouse function the spine for this information pipeline. All endpoint information is ingested in Kafka multi-tenant matters, that are processed and at last saved in a ClickHouse database.

Utilizing the present alerting system, purchasers can outline monitoring guidelines in Nexthink Question Language (NQL), that are evaluated in close to actual time by polling the database each quarter-hour. Alerts are triggered when anomalies are detected towards client-defined thresholds or long-term baselines. This course of is illustrated within the following structure diagram.

Initially, database-polling allowed nice flexibility within the analysis of advanced alerts. Nevertheless, this strategy positioned heavy stress on the database. As the corporate grew and supported bigger clients with extra endpoints and screens, the database skilled more and more heavy masses.

Evolution to a brand new use-case: Actual-time alerts

As Nexthink expanded its information assortment to incorporate digital desktop infrastructure (VDI), the necessity for real-time alerting grew to become much more important. In contrast to conventional endpoints, akin to laptops, the place occasions are gathered each 5 minutes, VDI information is ingested each 30 seconds—considerably rising the amount and frequency of knowledge. The prevailing structure relied on database polling to guage alerts, operating at a 15-minute interval. This strategy was insufficient for the brand new VDI use case, the place alerts wanted to be evaluated in close to actual time on messages arriving each 30 seconds. Merely rising the polling frequency wasn’t a viable choice as a result of it will place extreme load on the database, resulting in efficiency bottlenecks and scalability challenges. To satisfy these new calls for effectively, we shifted to real-time alert analysis instantly on Kafka matters.

Know-how choices

As we evaluated options for our real-time alerting system, we analyzed two most important expertise choices: Apache Kafka Streams and Apache Flink. Every choice had advantages and limitations that wanted to be thought of.

All Nexthink microservices as much as that time built-in with Kafka utilizing Apache Kafka Streams. We’ve noticed in apply a number of advantages:

  • Light-weight and seamless integration. No want for added infrastructure.
  • Low latency utilizing RocksDB as an area key-value retailer.
  • Workforce experience. Nexthink groups have been writing microservices with Kafka-streams for a very long time and really feel very snug utilizing it.

In some use instances nevertheless, we discovered that there have been vital limitations:

  • Scalability – Scalability was constrained by the tight coupling between parallelism of microservices and the variety of partitions in Kafka matters. Many microservices had already scaled out to match the partition depend of the matters they consumed, limiting their capability to scale additional. One potential answer was rising the partition depend. Nevertheless, this strategy launched vital operational overhead, particularly with microservices consuming matters owned by different domains. It required rebalancing the whole Kafka cluster and wanted coordination throughout a number of groups. Moreover, such modifications impacted downstream companies, requiring cautious reconfiguration of stateful processing. The choice strategy can be to introduce intermediate matters to redistribute workload, however this might add complexity to the info pipeline and enhance useful resource consumption on Kafka. These challenges made it clear {that a} extra versatile and scalable strategy was wanted.
  • State administration – Companies that wanted to create massive Okay-tables in reminiscence had an elevated startup time. Additionally, in instances the place the inner state was massive in quantity, we discovered that it utilized vital load to the Kafka cluster throughout the creation of the inner state.
  • Late occasion processing – In windowing operations, late occasions needed to be managed manually with methods that complexified the codebase.

Looking for an alternate that would assist us overcome the challenges posed by our present system, we determined to guage Flink. Its strong streaming capabilities, scalability, and suppleness made it a superb selection for constructing real-time alerting methods based mostly on Kafka matters. A number of benefits made Flink significantly interesting:

  • Native integration with Kafka – Flink presents native connectors for Kafka, which is a central element within the Nexthink ecosystem.
  • Occasion-time processing and assist for late occasions – Flink permits messages to be processed based mostly on the occasion time (that’s, when the occasion truly occurred) even when they arrive out of order. This function is essential for real-time alerts as a result of it ensures their accuracy.
  • Scalability – Flink’s distributed structure permits it to scale horizontally independently from the variety of partitions within the Kafka matters. This function weighed loads in our decision-making as a result of the dependence on the variety of partitions was a powerful limitation in our platform up so far.
  • Fault tolerance – Flink helps checkpoints, permitting managed state to be endured and guaranteeing constant restoration in case of failures. In contrast to Kafka Streams, which depends on Kafka itself for long-term state persistence (including further load to the cluster), Flink’s checkpointing mechanism operates independently and runs out-of-band, minimizing the affect on Kafka whereas offering environment friendly state administration.
  • Amazon Managed Service for Apache Flink – Amazon Managed Service for Apache Flink is a completely managed service that simplifies the deployment, scaling, and administration of Flink purposes for real-time information processing. By eliminating the operational complexities of managing Flink clusters, AWS permits organizations to give attention to constructing and operating real-time analytics and event-driven purposes effectively. Amazon Managed Service for Apache Flink offered us with vital flexibility. It streamlined our analysis course of, which meant we may rapidly arrange a proof-of-concept atmosphere with out entering into the complexities of managing an inside Flink cluster. Furthermore, by decreasing the overhead of cluster administration, it made Flink a viable expertise selection and accelerated our supply timeline.

Answer

After cautious analysis of each choices, we selected Apache Flink as our answer on account of its superior scalability, strong event-time processing, and environment friendly state administration capabilities. Right here’s how we applied our new real-time alerting system.

The next diagram is the answer structure.

The primary use case was to detect points with VDI. Nevertheless, our intention was to construct a generic answer that might give us the choice to onboard sooner or later present use instances at present applied by way of polling. We needed to keep up a standard manner of configuring monitoring circumstances and permit alert analysis each with polling in addition to in actual time, relying on the kind of system being monitored.

This answer contains a number of components:

  • Monitor configuration – Utilizing Nexthink Question Language (NQL), the alerts administrator defines a monitor that specifies, for instance:
    • Knowledge supply – VDI occasions
    • Time window – Each 30 seconds
    • Metric – Common community latency, grouped by desktop pool
    • Set off situation(s) – Latency exceeding 300 ms for a continuing interval of 5 minutes

This monitor configuration is then saved in an internally developed doc retailer and propagated downstream in a Kafka subject.

  • Knowledge processing utilizing Generic Stream Companies– The Nexthink Collector, an agent put in on endpoints, captures and studies varied sorts of actions from the VDI endpoints the place it’s put in. These occasions are forwarded to Amazon MSK in one among Nexthink’s manufacturing digital personal clouds (VPCs) and are consumed by Java microservices operating on Amazon EKS belonging to a number of domains inside Nexthink

One in all them is Generic Stream Companies, a system that processes the collected occasions and aggregates them in buckets of 30 seconds. This element works as self-service for all of the function groups in Nexthink and might question and mixture information from an NQL question. This manner, we had been in a position to preserve a unified person expertise on monitor configuration utilizing NQL, no matter how alerts had been evaluated. This element is damaged down into two companies:

    • GS processor – Consumes uncooked VDI session occasions and applies preliminary processing
    • GS aggregator – Teams and aggregates the info in line with the monitor configuration
  • Actual-time monitoring utilizing Flink – Static threshold alerting and seasonal change detection, which identifies variations in information that comply with a recurring sample over time, are the 2 forms of detection that we provide for VDI points. The system splits the processing between two purposes:
    • Baseline software – Calculates statistical baselines with seasonality utilizing time-of-day anomaly algorithm. For instance, the latency by VDI shopper location or the CPU queue size of a desktop pool.
    • Alert software – Generates alerts based mostly on user-defined thresholds when the sudden values don’t change over time or dynamic thresholds based mostly on baselines, which set off when a metric deviates from an anticipated sample.

The next diagram illustrates how we be part of VDI metrics with monitor configurations, mixture information utilizing sliding time home windows, and consider threshold guidelines, all inside Apache Flink. From this course of, alerts are generated and are then grouped and filtered earlier than being processed additional by the shoppers of alerts.

  • Alert processing and notifications – After an alert is triggered (when a threshold is exceeded) or recovered (when a metric returns to regular ranges), the system will assess their affect to prioritize response by way of the affect processing module. Alerts are then consumed by notification companies that ship messages by way of emails or webhooks. The alert and affect information are then ingested right into a time sequence database.

Advantages of the brand new structure

One of many key benefits of adopting a streaming-based strategy over polling was its ease of configuration and administration, particularly for a small group of three engineers. There was no want for cluster administration, so all we would have liked to do was to provision the service and begin coding.

Given our prior expertise with Kafka and Kafka Streams and mixed with the simplicity of a managed service, we had been in a position to rapidly develop and deploy a brand new alerting system with out the overhead of advanced infrastructure setup. We used Amazon Managed Service for Apache Flink to spin up a proof of idea inside just a few hours, which meant the group may give attention to defining the enterprise logic with out having considerations associated to cluster administration.

Initially, we had been involved in regards to the challenges of becoming a member of a number of Kafka matters. With our earlier Kafka Streams implementation, joined matters required equivalent partition keys, a constraint often known as co-partitioning. This created an rigid structure, significantly when integrating matters throughout completely different enterprise domains. Every area naturally had its personal optimum partitioning technique, forcing troublesome compromises.

Amazon Managed Service for Apache Flink solved this drawback by way of its inside information partitioning capabilities. Though Flink nonetheless incurs some community visitors when redistributing information throughout the cluster throughout joins, the overhead is virtually negligible. The ensuing structure is each extra scalable (as a result of matters may be scaled independently based mostly on their particular throughput necessities) and simpler to keep up with out advanced partition alignment considerations.

This considerably improved our capability to detect and reply to VDI efficiency degradations in actual time whereas holding our structure clear and environment friendly.

Classes learnt

As with every new expertise, adopting Flink for real-time processing got here with its personal set of challenges and insights.

One of many main difficulties we encountered was observing Flink’s inside state. In contrast to Kafka Streams, the place the inner state is by default backed by a Kafka subject from which its content material may be visualized, Flink’s structure makes it inherently troublesome to examine what is occurring inside a operating job. This required us to put money into strong logging and monitoring methods to higher perceive what is occurring throughout the execution and debug points successfully.

One other important perception emerged round late occasion dealing with—particularly, managing occasions with timestamps that fall inside a time-window’s boundaries however arrive after that window has closed. Amazon Managed Service for Apache Flink addresses this problem by way of its built-in watermarking mechanism. A watermark is a timestamp-based threshold that signifies when Flink ought to think about all occasions earlier than a selected time to have arrived. This enables the system to make knowledgeable choices about when to course of time-based operations like window aggregations. Watermarks circulation by way of the streaming pipeline, enabling Flink to trace the progress of occasion time processing even with out-of-order occasions.

Though watermarks present a mechanism to handle late information, they introduce challenges when coping with a number of enter streams working at completely different speeds. Watermarks work properly when processing occasions from a single supply however can turn out to be problematic when becoming a member of streams with various velocities. It’s because they will result in unintended delays or untimely information discards. For instance, a sluggish stream can maintain again processing throughout the whole pipeline, and an idle stream would possibly trigger untimely window closing. Our implementation required cautious tuning of watermark methods and allowable lateness parameters to stability processing timeliness with information completeness.

Our transition from Kafka Streams to Apache Flink proved smoother than initially anticipated. Groups with Java backgrounds and prior expertise with Kafka Streams discovered Flink’s programming mannequin intuitive and simple to make use of. The DataStream API presents acquainted ideas and patterns, and Flink’s extra superior options may very well be adopted incrementally as wanted. This gradual studying curve gave our builders the pliability to turn out to be productive rapidly, focusing first on core stream processing duties earlier than shifting on to extra superior ideas like state administration and late occasion processing.

The way forward for Flink in Nexthink

Actual-time alerting is now deployed to manufacturing and accessible to our purchasers. A significant success of this challenge was the truth that we efficiently launched a expertise as a substitute for Kafka streams, with little or no administration necessities, assured scalability, data-management flexibility, and comparable price.

The affect on the Nexthink alerting system was vital as a result of we not have a single evaluating alert by way of database polling. Subsequently, we’re already assessing the timeframe for onboarding different alerting use instances to real-time analysis with Flink. This may alleviate database load and also will present extra accuracy on the alert triggering.

But the affect of Flink isn’t restricted to the Nexthink alerting system. We now have a confirmed production-ready different for companies which are restricted by way of scalability because of the variety of partitions of the matters they’re consuming. Thus, we’re actively evaluating the choice to transform extra companies to Flink to permit them to scale out extra flexibly.

Conclusion

Amazon Managed Service for Apache Flink has been transformative for our real-time alerting system at Nexthink. By dealing with the advanced infrastructure administration, AWS enabled our group to deploy a classy streaming answer in lower than a month, holding our give attention to delivering enterprise worth somewhat than managing Flink clusters.

The capabilities of Flink have confirmed it to be greater than a substitute for Kafka Streams. It’s turn out to be a compelling first selection for each new tasks and present function refactoring. Windowed processing, late occasion administration, and stateful streaming operations have made advanced use instances remarkably simple to implement. As our growth groups proceed to discover Flink’s potential, we’re more and more assured that it’s going to play a central function in Nexthink’s real-time information processing structure shifting ahead.

To get began with Amazon Managed Service for Apache Flink, discover the getting began sources and the hands-on workshop. To study extra about Nexthink’s broader journey with AWS, go to the weblog put up on Nexthink’s MSK-based structure.


Concerning the authors

Nikos Tragaras is a Principal Software program Architect at Nexthink with round 20 years of expertise in constructing distributed methods, from conventional architectures to trendy cloud-native platforms. He has labored extensively with streaming applied sciences, specializing in reliability and efficiency at scale. Keen about programming, he enjoys constructing clear options to advanced engineering issues

Raphaël Afanyan is a Software program Engineer and Tech Lead of the Alerts group at Nexthink. Over time, he has labored on designing and scaling information processing methods and performed a key function in constructing Nexthink’s alerting platform. He now collaborates throughout groups to carry revolutionary product concepts to life, from backend structure to polished person interfaces.

Simone Pomata is a Senior Options Architect at AWS. He has labored enthusiastically within the tech trade for greater than 10 years. At AWS, he helps clients reach constructing new applied sciences daily.

Subham Rakshit is a Senior Streaming Options Architect for Analytics at AWS based mostly within the UK. He works with clients to design and construct streaming architectures to allow them to get worth from analyzing their streaming information. His two little daughters preserve him occupied more often than not outdoors work, and he loves fixing jigsaw puzzles with them. Join with him on LinkedIn.

Lorenzo Nicora works as a Senior Streaming Options Architect at AWS, serving to clients throughout EMEA. He has been constructing cloud-centered, data-intensive methods for over 25 years, working throughout industries each by way of consultancies and product firms. He has used open supply applied sciences extensively and contributed to a number of tasks, together with Apache Flink.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles