17.7 C
Canberra
Wednesday, December 3, 2025

Infinite scale: The structure behind the Azure AI superfactory


In the present day, we’re unveiling the subsequent Fairwater web site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is linked to our first Fairwater web site in Wisconsin, prior generations of AI supercomputers and the broader Azure world datacenter footprint to create the world’s first planet-scale AI superfactory. By packing computing energy extra densely than ever earlier than, every Fairwater web site is constructed to effectively meet unprecedented demand for AI compute, push the frontiers of mannequin intelligence and empower each individual and group on the planet to realize extra.

To fulfill this demand, we have now reinvented how we design AI datacenters and the methods we run inside them. Fairwater is a departure from the standard cloud datacenter mannequin and makes use of a single flat community that may combine lots of of hundreds of the most recent NVIDIA GB200 and GB300 GPUs into a large supercomputer. These improvements are a product of many years of expertise designing datacenters and networks, in addition to learnings from supporting a number of the largest AI coaching jobs on the planet.

Whereas the Fairwater datacenter design is properly fitted to coaching the subsequent technology of frontier fashions, additionally it is constructed with fungibility in thoughts. Coaching has developed from a single monolithic job into a spread of workloads with totally different necessities (resembling pre-training, fine-tuning, reinforcement studying and artificial information technology). Microsoft has deployed a devoted AI WAN spine to combine every Fairwater web site right into a broader elastic system that allows dynamic allocation of numerous AI workloads and maximizes GPU utilization of the mixed system.

Under, we stroll via a number of the thrilling technical improvements that assist Fairwater, from the best way we construct datacenters to the networking inside and throughout the websites.

Most density of compute

Trendy AI infrastructure is more and more constrained by the legal guidelines of physics. The pace of sunshine is now a key bottleneck in our means to tightly combine accelerators, compute and storage with performant latency. Fairwater is designed to maximise the density of compute to attenuate latency inside and throughout racks and maximize system efficiency.

One of many key levers for driving density is enhancing cooling at scale. AI servers within the Fairwater datacenters are linked to a facility-wide cooling system designed for longevity, with a closed-loop method that reuses the liquid constantly after the preliminary fill with no evaporation. The water used within the preliminary fill is equal to what 20 properties eat in a 12 months and is just changed if water chemistry signifies it’s wanted (it’s designed for 6-plus years), making it extraordinarily environment friendly and sustainable.

Liquid-based cooling additionally offers a lot greater warmth switch, enabling us to maximise rack and row-level energy (~140kW per rack, 1,360 kW per row) to pack compute as densely as potential contained in the datacenter. State-of-the-art cooling additionally helps us maximize utilization of this dense compute in steady-state operations, enabling massive coaching jobs to run performantly at excessive scale. After biking via a system of chilly plate paths throughout the GPU fleet, warmth is dissipated by one of many largest chiller crops on the planet.

An image of a rack level direct liquid cooling
Rack stage direct liquid cooling.

One other approach we’re driving compute density is with a two-story datacenter constructing design. Many AI workloads are very delicate to latency, which implies cable run lengths can meaningfully influence cluster efficiency. Each GPU in Fairwater is linked to each different GPU, so the two-story datacenter constructing method permits for placement of racks in three dimensions to attenuate cable lengths, which in flip improves latency, bandwidth, reliability and price.

An image of two-story networking architecture
Two-story networking structure.

Excessive-availability, low-cost energy

We’re pushing the envelope in serving this compute with cost-efficient, dependable energy. The Atlanta web site was chosen with resilient utility energy in thoughts and is able to reaching 4×9 availability at 3×9 price. By securing extremely accessible grid energy, we are able to additionally forgo conventional resiliency approaches for the GPU fleet (resembling on-site technology, UPS methods and dual-corded distribution), driving price financial savings for purchasers and quicker time-to-market for Microsoft.

We have now additionally labored with our trade companions to codevelop power-management options to mitigate energy oscillations created by massive scale jobs, a rising problem in sustaining grid stability as AI demand scales. This features a software-driven resolution that introduces supplementary workloads in periods of decreased exercise, a hardware-driven resolution the place the GPUs implement their very own energy thresholds and an on-site vitality storage resolution to additional masks energy fluctuations with out using extra energy.

Chopping-edge accelerators and networking methods

Fairwater’s world-class datacenter design is powered by purpose-built servers, cutting-edge AI accelerators and novel networking methods. Every Fairwater datacenter runs a single, coherent cluster of interconnected NVIDIA Blackwell GPUs, with a complicated community structure that may scale reliably past conventional Clos community limits with current-gen switches (lots of of hundreds of GPUs on a single flat community). This required innovation throughout scale-up networking, scale-out networking and networking protocol.

When it comes to scale-up, every rack of AI accelerators homes as much as 72 NVIDIA Blackwell GPUs, linked through NVLink for ultra-low-latency communication inside the rack. Blackwell accelerators present the best compute density accessible at this time, with assist for low-precision quantity codecs like FP4 to extend whole FLOPS and allow environment friendly reminiscence use. Every rack offers 1.8 TB of GPU-to-GPU bandwidth, with over 14 TB of pooled reminiscence accessible to every GPU.

An image of densely populated GPU racks with app driven networking
Densely populated GPU racks with app pushed networking.

These racks then use scale-out networking to create pods and clusters that allow all GPUs to operate as a single supercomputer with minimal hop counts. We obtain this with a two-tier, ethernet-based backend community that helps huge cluster sizes with 800 Gbps GPU-to-GPU connectivity. Counting on a broad ethernet ecosystem and SONiC (Software program for Open Community within the Cloud – which is our personal working system for our community switches) additionally helps us keep away from vendor lock-in and handle price, as we are able to use commodity {hardware} as a substitute of proprietary options.

Enhancements throughout packet trimming, packet spray and high-frequency telemetry are core elements of our optimized AI community. We’re additionally working to allow deeper management and optimization of community routes. Collectively, these applied sciences ship superior congestion management, speedy detection and retransmission and agile load balancing, guaranteeing ultra-reliable, low-latency efficiency for contemporary AI workloads.

Planet scale

Even with these improvements, compute calls for for big coaching jobs (now measured in trillions of parameters) are shortly outpacing the facility and area constraints of a single facility. To serve these wants, we have now constructed a devoted AI WAN optical community to increase Fairwater’s scale-up and scale-out networks. Leveraging our scale and many years of hyperscale experience, we delivered over 120,000 new fiber miles throughout the US final 12 months — increasing AI community attain and reliability nationwide.

With this high-performance, high-resiliency spine, we are able to immediately join totally different generations of supercomputers into an AI superfactory that exceeds the capabilities of a single web site throughout geographically numerous places. This empowers AI builders to faucet our broader community of Azure AI datacenters, segmenting visitors based mostly on their wants throughout scale-up and scale-out networks inside a web site, in addition to throughout websites through the continent spanning AI WAN.

This can be a significant departure from the previous, the place all visitors needed to trip the scale-out community whatever the necessities of the workload. Not solely does it present clients with fit-for-purpose networking at a extra granular stage, it additionally helps create fungibility to maximise the pliability and utilization of our infrastructure.

Placing all of it collectively

The brand new Fairwater web site in Atlanta represents the subsequent leap within the Azure AI infrastructure and displays our expertise working the most important AI coaching jobs on the planet. It combines breakthrough improvements in compute density, sustainability and networking methods to effectively serve the large demand for computational energy we’re seeing. It additionally integrates deeply with different AI datacenters and the broader Azure platform to kind the world’s first AI superfactory. Collectively, these improvements present a versatile, fit-for-purpose infrastructure that may serve the complete spectrum of recent AI workloads and empower each individual and group on the planet to realize extra. For our clients, this implies simpler integration of AI into each workflow and the flexibility to create modern AI options that have been beforehand unattainable.

Discover out extra about how Microsoft Azure may help you combine AI to streamline and strengthen improvement lifecycles right here.

Scott Guthrie is answerable for hyperscale cloud computing options and providers together with Azure, Microsoft’s cloud computing platform, generative AI options, information platforms and knowledge and cybersecurity. These platforms and providers assist organizations worldwide remedy pressing challenges and drive long-term transformation.

Editor’s notice: An replace was made to extra clearly clarify how we optimize our community.

Tags: , , ,



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles