6.1 C
Canberra
Monday, October 27, 2025

Venture Flash replace: Advancing Azure Digital Machine availability monitoring


Flash allows speedy detection of points originating from the Azure platform, serving to groups reply rapidly to infrastructure-related disruptions.

Beforehand, we shared an replace on Venture Flash as a part of our Advancing Reliability weblog sequence, reaffirming our dedication to serving to Azure prospects detect and diagnose digital machine (VM) availability points with pace and precision. This 12 months, we’re excited to unveil the newest improvements that take VM availability monitoring to the subsequent stage—enabling prospects to function their workloads on Azure with even higher confidence. I’ve requested Yingqi (Halley) Ding, Technical Program Supervisor from the Azure Core Compute staff, to stroll us via the latest investments powering the subsequent section of Venture Flash.

— Mark Russinovich, CTO, Deputy CISO, and Technical Fellow, Microsoft Azure.


Venture Flash is a cross-division initiative at Microsoft. Its imaginative and prescient is to ship exact telemetry, real-time alerts, and scalable monitoring—all inside a unified, user-friendly expertise designed to fulfill the varied observability wants of digital machine (VM) availability.

Flash addresses each platform-level and user-level challenges. It allows speedy detection of points originating from the Azure platform, serving to groups reply rapidly to infrastructure-related disruptions. On the similar time, it equips you with actionable insights to diagnose and resolve issues inside your personal atmosphere. This twin functionality helps excessive availability and helps guarantee your corporation Service-Degree Agreements are constantly met. It’s our mission to make sure you can:

  • Acquire clear visibility into disruptions, akin to VM reboots and restarts, software freezes resulting from community driver updates, and 30-second host OS updates—with detailed insights into what occurred, why it occurred, and whether or not it was deliberate or sudden.
  • Analyze tendencies and set alerts to hurry up debugging and monitor availability over time.
  • Monitor at scale and construct customized dashboards to remain on high of the well being of all assets.
  • Obtain automated root trigger analyses (RCAs) that specify which VMs have been affected, what triggered the difficulty, how lengthy it lasted, and what was carried out to repair it.
  • Obtain real-time notifications for essential occasions, akin to degraded nodes requiring VM redeployment, platform-initiated service therapeutic, or in-place reboots triggered by {hardware} points—empowering your groups to reply swiftly and decrease consumer impression.
  • Adapt restoration insurance policies dynamically to fulfill altering workload wants and enterprise priorities.

Throughout our staff’s journey with Flash, it has garnered widespread adoption from a number of the world’s main firms spanning from e-commerce, gaming, finance, hedge funds, and lots of different sectors. Their intensive utilization of Flash underscores its effectiveness and worth in assembly the varied wants of high-profile organizations.

At BlackRock, VM reliability is essential to our operations. If a VM is operating on degraded {hardware}, we need to be alerted rapidly so we’ve got the utmost alternative to mitigate the difficulty earlier than it impacts customers. With Venture Flash, we obtain a useful resource well being occasion built-in into our alerting processes the second an underlying node in Azure infrastructure is marked unallocatable, usually resulting from well being degradation. Our infrastructure staff then schedules a migration of the affected useful resource to wholesome {hardware} at an optimum time. This capability to predictively keep away from abrupt VM failures has diminished our VM interruption fee and improved the general reliability of our funding platform.

— Eli Hamburger, Head of Infrastructure Internet hosting, BlackRock.

Suite of options accessible at this time

The Flash initiative has developed into a sturdy, scalable monitoring framework designed to fulfill the varied wants of contemporary infrastructure—whether or not you’re managing a handful of VMs or working at large scale. Constructed with reliability at its core, Flash empowers you to watch what issues most, utilizing the instruments and telemetry that align together with your structure and operational mannequin.

Flash publishes VM availability states and useful resource well being annotations for detailed failure attribution and downtime evaluation. The information beneath outlines your choices so you possibly can select the fitting Flash monitoring resolution on your state of affairs.

Resolution Description
Azure Useful resource Graph (common availability) For investigations at scale, centralized useful resource repositories, and historic lookups, you possibly can periodically devour useful resource availability telemetry throughout all workloads without delay utilizing Azure Useful resource Graph (ARG).
Occasion Grid system subject (public preview) To set off time-sensitive and significant mitigations, akin to redeploying or restarting VMs to stop end-user impression, you possibly can obtain alerts inside seconds of essential modifications in useful resource availability through Occasion Handlers in Occasion Grid.
Azure Monitor – Metrics (public preview) To trace tendencies, combination platform metrics (e.g., CPU, disk), and configure exact threshold-based alerts, you possibly can devour an out-of-the-box VM availability metric through Azure Monitor.
Useful resource Well being (common availability) To carry out instantaneous and handy per-resource well being checks within the Portal UI, you possibly can rapidly view the RHC blade. You may also entry a 30-day historic view of well being checks for that useful resource to assist quick and efficient troubleshooting.
Flash endpoints flowchart.
Determine 1: Flash endpoints

What’s new?

Public preview: Person vs platform dimension launched for VM availability metric

Many purchasers have emphasised the necessity for user-friendly monitoring options that present real-time, scalable entry to compute useful resource availability information. This info is important for triggering well timed mitigation actions in response to availability modifications.

Designed to fulfill this essential want, the VM availability metric is well-suited for monitoring tendencies, aggregating platform metrics (akin to CPU and disk utilization), and configuring exact threshold-based alerts. You’ll be able to make the most of this out-of-the-box VM availability metric in Azure Monitor.

A monitoring overview showing the V M availability metric.
Determine 2: VM availability metric

Now you should use the Context dimension to establish whether or not VM availability was influenced by Azure or user-orchestrated exercise. This dimension signifies, throughout any disruption or when the metric drops to zero, whether or not the trigger was platform-triggered or user-driven. It may possibly assume values of Platform, Buyer, or Unknown.

Graph showing context dimension.
Determine 3: Context dimension

The brand new dimension can also be supported in Azure Monitor alert guidelines as a part of the filtering course of.

U I for creating an alert rule, choosing dimension values.
Determine 4: Azure Monitor alert rule

Public preview: Allow sending well being assets occasions to Azure Monitor alerts in Occasion Grid

Azure Occasion Grid is a extremely scalable, totally managed Pub/Sub message distribution service that provides versatile message consumption patterns. Occasion Grid lets you publish and subscribe to messages to assist Web of Issues (IoT) options. By way of HTTP, Occasion Grid lets you construct event-driven options, the place a writer service (akin to Venture Flash) broadcasts its system state modifications (occasions) to subscriber functions.

List of event grid system topics.
Determine 5: Occasion Grid system matters

With the mixing of Azure Monitor alerts as a brand new occasion handler, now you can obtain low-latency notifications—akin to VM availability modifications and detailed annotations—through SMS, e-mail, push notifications, and extra. This combines Occasion Grid’s close to real-time supply with Azure Monitor’s direct alerting capabilities.

U I of creating an event subscription.
Determine 6: Occasion Grid subscription

To get began, merely observe the step-by-step directions and start receiving real-time alerts with Flash’s new providing.

What’s subsequent?

Wanting forward, we plan to broaden our focus to incorporate situations akin to inoperable top-of-rack switches, failures in accelerated networking, and new lessons of {hardware} failure prediction. As well as, we goal to proceed enhancing information high quality and consistency throughout all Flash endpoints—enabling extra correct downtime attribution and deeper visibility into VM availability.

For complete monitoring of VM availability—together with situations akin to routine upkeep, dwell migration, service therapeutic, and degradation—we advocate leveraging each Flash Well being occasions and Scheduled Occasions (SE).

  • Flash Well being occasions provide real-time insights into ongoing and historic availability disruptions, together with VM degradation. This facilitates efficient downtime administration, helps automated mitigation methods, and enhances root trigger evaluation.
  • Scheduled Occasions, in distinction, present as much as quarter-hour of advance discover previous to deliberate upkeep, enabling proactive decision-making and preparation. Throughout this window, it’s possible you’ll select to acknowledge the occasion or defer actions based mostly in your operational readiness.

For upcoming updates on the Flash initiative, we encourage you to observe the advancing reliability sequence!



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles