In-depth evaluation of DS-STAR
Subsequent, we performed ablation research to confirm the effectiveness of DS-STAR’s particular person parts and analyze the impression of the variety of refinement rounds, particularly by measuring the iterations required to generate a enough plan.
Information File Analyzer: This agent is important for top efficiency. With out the descriptions it generates (Variant 1), DS-STAR’s accuracy on tough duties throughout the DABStep benchmark sharply dropped to 26.98%, underscoring the significance of wealthy information context for efficient planning and implementation.
Router: The Router agent’s capability to find out if a brand new step is required or to repair an incorrect step is important. Once we eliminated it (Variant 2), DS-STAR solely added new steps sequentially, resulting in worse efficiency on each simple and onerous duties. This demonstrated that it’s more practical to right errors in a plan than to maintain including doubtlessly flawed steps.
Generalizability Throughout LLMs: We additionally examined DS-STAR’s adaptability through the use of GPT-5 as the bottom mannequin. This yielded promising outcomes on the DABStep benchmark, indicating the framework’s generalizability. Apparently, DS-STAR with GPT-5 carried out higher on simple duties, whereas the Gemini-2.5-Professional model carried out higher on onerous duties.
