
At a look
- As we speak’s AI agent benchmarks check one activity at a time, whereas actual office productiveness requires managing dozens of interdependent duties without delay. To replicate this, we created a setting referred to as Multi-Horizon Process Environments (MHTEs).
- Beneath multi-task hundreds, main computer-using brokers degrade sharply, with completion charges dropping from 16.7% to eight.7%.
- CORPGEN introduces digital staff, with hierarchical planning, reminiscence isolation, and experiential studying, delivering as much as 3.5 occasions increased completion charges than baselines throughout three impartial agent backends.
- As a result of CORPGEN is architecture-agnostic and modular, its positive aspects come from system design somewhat than any single base mannequin, and it advantages instantly as underlying fashions enhance.
By mid-morning, a typical data employee is already juggling a consumer report, a finances spreadsheet, a slide deck, and an electronic mail backlog, all interdependent and all demanding consideration without delay. For AI brokers to be genuinely helpful in that atmosphere, they might want to function the identical method, however right now’s greatest fashions are evaluated one activity at a time, not dozens without delay.
In our paper, “CORPGEN: Simulating Company Environments with Autonomous Digital Workers in Multi-Horizon Process Environments,” we suggest an agent framework that equips AI with the reminiscence, planning, and studying capabilities to shut that hole.
Introducing Multi-Horizon Process Environments
Replicating the truth of office multitasking requires a brand new type of analysis atmosphere. In response, we developed Multi-Horizon Process Environments (MHTEs), settings the place an agent should handle a number of advanced duties concurrently. Every activity requires 10 to 30 dependent steps inside a single session spanning 5 hours.
To find out what a benchmark would wish to check, we ran MHTEs at scale on a few of right now’s main AI brokers, exposing 4 weaknesses. First, reminiscence fills up. An agent can’t maintain particulars for a number of lively duties without delay. Second, data from one activity interferes with reasoning about one other. Third, duties don’t rely on one another in easy sequences. They kind advanced webs the place an agent should always verify whether or not upstream work is completed earlier than it could possibly transfer ahead on something downstream. Fourth, each motion cycle requires reprioritizing throughout all lively duties, not merely resuming the place the agent left off.
We additionally examined three impartial agent programs below growing hundreds. Because the variety of concurrent duties rose from 12 to 46, completion charges fell from 16.7% to eight.7% throughout all programs.
CORPGEN’s structure
CORPGEN introduces digital staff: LLM-powered AI brokers with persistent identities, role-specific experience, and real looking work schedules. They function Microsoft Workplace functions by way of GUI automation and carry out constantly inside MHTEs over hours of steady exercise. Determine 1 illustrates how a digital worker strikes by way of a full workday.

CORPGEN addresses every of the 4 weaknesses of concurrent activity execution—reminiscence overload, cross-task interference, dependency complexity, and reprioritization—in a focused method. Hierarchical planning breaks targets into day by day targets after which into moment-to-moment choices, permitting the agent to behave from a structured plan as an alternative of reviewing all accessible duties earlier than every step.
Subagents carry out advanced operations like internet analysis in remoted contexts, stopping cross-task contamination. A tiered reminiscence system allows selective recall of task-related data somewhat than retaining every little thing in lively context. Adaptive summarization compresses routine observations whereas preserving crucial data, preserving reminiscence development managed.
As a result of these mechanisms will not be tied to a particular base mannequin, we examined CORPGEN throughout three totally different brokers. In every case, we noticed constant positive aspects. The enhancements got here from the structure, not from the energy of any specific mannequin. Determine 2 reveals how they match collectively inside CORPGEN’s structure.

How digital staff collaborate
When a number of digital staff function in the identical atmosphere, collaboration takes form by way of normal communication channels, with out predefined coordination guidelines. One worker sends an electronic mail requesting knowledge; one other picks it up within the subsequent cycle, makes use of its reminiscence to course of it, and responds. This trade mirrors actual office communication.
There isn’t a shared inside state between brokers. Coordination happens completely by way of electronic mail and Microsoft Groups, the identical channels many staff use. Over time, these impartial exchanges kind recognizable organizational patterns. Some brokers tackle management roles; others present assist; shared paperwork change into the connective tissue.
When a communication path breaks, akin to an electronic mail supply error, brokers reroute messages by way of alternate channels to maintain work transferring. The result’s a digital group that behaves like an actual one with out being explicitly programmed to take action.
Evaluating CORPGEN
We evaluated CORPGEN on a multi-task benchmark that mixed as much as 46 duties right into a single six-hour session. Three findings stood out.
Baselines degrade as load will increase; CORPGEN doesn’t. All three baseline agent programs confirmed regular efficiency declines as activity load rose. CORPGEN, against this, maintained or improved its completion charges at increased hundreds. At 46 duties, CORPGEN accomplished 15.2% of duties, in contrast with 4.3% for the baselines, roughly 3.5 occasions extra.
Experiential studying drives the most important positive aspects. We launched CORPGEN’s elements sequentially: first the orchestration layer, then cognitive instruments, and eventually experiential studying. The primary two produced average enhancements. Experiential studying, during which brokers retailer information of accomplished duties and reuse them once they encounter structurally related work, produced the most important enhance, elevating completion charges from 8.7% to fifteen.2%.
Analysis methodology adjustments the image. Once we inspected the precise output information produced by brokers, the outcomes agreed with human judgements roughly 90% of the time. Analysis primarily based on screenshots and motion logs agreed solely about 40% of the time. This hole means that frequent analysis approaches could underestimate what brokers really accomplish in follow.
Highlight: Occasion Sequence
Microsoft Analysis Discussion board
Be a part of us for a steady trade of concepts about analysis within the period of common AI. Watch the primary 4 episodes on demand.
Implications and searching ahead
The outcomes counsel that reminiscence and retrieval, not simply uncooked mannequin functionality, could also be a key bottleneck in getting brokers to work in the actual world. The biggest positive aspects got here from experiential studying. Brokers that study from prior successes and apply these patterns to structurally related duties construct a bonus over programs that reply to every activity in isolation.
CORPGEN additionally opens a brand new lens on how AI brokers collaborate. Subsequent steps embrace testing whether or not brokers can keep reminiscence throughout a number of workdays and the way they coordinate when working in groups. We’re additionally exploring methods to make brokers sooner and extra dependable by combining totally different strategies of interacting with software program.
Acknowledgments
This work is a results of a collaboration between the Workplace of the CTO at Microsoft and the Microsoft AI Growth Accelerator Program (MAIDAP). We wish to thank the Microsoft Safety Analysis crew for offering sources that supported this analysis. We additionally thank the members of the Microsoft UFO2 (opens in new tab) crew and the Mem0 (opens in new tab) undertaking for his or her open-source contributions, which enabled key elements of the CORPGEN structure, and the OSWorld crew for the benchmark that served as the muse for our multi-task analysis.
Lastly, we thank the numerous contributors to this analysis: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.
