Curious Now

Story

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

ComputingArtificial Intelligence

Key takeaway

AI coding assistants can introduce bugs that break existing tests, but researchers propose a new "test-driven" approach to track code impact and reduce regressions.

Read the paper

Quick Explainer

TDAD is a tool that aims to reduce code regressions in AI coding agents by building an abstract-syntax-tree derived code-test dependency graph and applying weighted impact analysis. It identifies tests affected by code changes and surfaces this information as a lightweight agent skill. This allows agents to more effectively navigate dependencies and avoid breaking previously-passing tests, addressing a key challenge where agents face a tradeoff between thoroughness and speed when running tests. The paper highlights a "TDD prompting paradox" where procedural TDD instructions without graph context actually increased regressions for smaller models, underscoring the importance of providing contextual information over prescriptive workflows.

Deep Dive

Technical Deep Dive: TDAD - Test-Driven Agentic Development

Overview

This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology for reducing code regressions in AI coding agents. TDAD builds an abstract-syntax-tree (AST) derived code–test dependency graph, applies weighted impact analysis, and surfaces the results as a lightweight agent skill. The key findings are:

  • TDAD achieved a 70% reduction in test-level regressions across two models and frameworks.
  • A "TDD prompting paradox" was observed - procedural TDD instructions without graph context actually increased regressions for smaller models.
  • TDAD improved resolution from 24% to 32% and generation from 40% to 68% when deployed as an agent skill.
  • An autonomous auto-improvement loop raised resolution from 12% to 60% with 0% regression.

The work advocates for elevating regression rate to a first-class metric alongside resolution rate in AI coding benchmarks.

Problem & Context

  • AI coding agents can resolve real-world software issues, but frequently introduce regressions that break previously-passing tests.
  • Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied.
  • Detecting regressions is fundamentally hard for agents without dependency awareness - they face a tradeoff between running all tests (slow) or only nearby tests (misses indirect dependencies).
  • The standard SWE-bench evaluation already collects regression data, but it is not surfaced in rankings or reports.

Methodology

Architecture Overview

TDAD has two stages:

  1. Indexing: Parses the Python repository, builds an AST-derived code–test dependency graph.
  2. Impact Analysis: Identifies tests affected by changed files and exports a static test_map.txt file. The agent receives this file plus a 20-line skill definition.

Graph Schema

The graph uses 4 node types (File, Function, Class, Test) and 5 edge types (CONTAINS, CALLS, IMPORTS, TESTS, INHERITS) to capture structural relationships.

Impact Analysis Strategies

Four strategies run in parallel to score test impact:

  1. Direct: Tests that directly exercise changed code
  2. Transitive: Tests 1-3 call hops away from changed code
  3. Coverage: File-level dependency
  4. Imports: Tests that import changed files

Scores are merged using a weighted average with a confidence factor.

Agent Integration

TDAD integrates with agents through two static artifacts:

  • test_map.txt: Maps source files to test files
  • SKILL.md: 20-line definition to find, run, and fix related tests

Data & Experimental Setup

Benchmark: SWE-bench Verified, a human-validated subset of 500 GitHub issues from 12 Python projects.

Models & Agents:

  • Phase 1 (100 instances): Qwen3-Coder 30B, local consumer hardware
  • Phase 2 (25 instances): Qwen3.5-35B-A3B, MLX on Apple Silicon, OpenCode v1.2.24 agent

Configurations:

  • Vanilla: Default prompt, no TDD or graph
  • TDD Prompt: Adds TDD workflow instructions
  • GraphRAG+TDD: TDD + SKILL.md + test_map.txt
  • Baseline: OpenCode agent, no TDAD
  • TDAD Skill: OpenCode + TDAD skill + NetworkX

Evaluation: SWE-bench Docker harness, measuring resolution rate, generation rate, test-level regression rate, and instance-level regression rate.

Results

Phase 1: Regression Reduction (100 instances)

  • GraphRAG+TDD achieved a 72% reduction in total P2P failures (562 → 155) and a 70% reduction in test-level regression rate (6.08% → 1.82%).
  • TDD prompting without graph context actually increased regressions from 6.08% to 9.94%.
  • Resolution decreased only modestly (-2 pp) despite higher abstention, as GraphRAG's impact was on reducing damage rather than increasing fixes.

Phase 2: TDAD as an Agent Skill (25 instances)

  • TDAD skill improved resolution by +8 percentage points (24% → 32%) and generation by +28 pp (40% → 68%).
  • Both configurations had 0% regression rate on this instance set.
  • The key benefit was improved codebase navigation and generation, rather than regression reduction.

Auto-Improvement Loop

  • Over 15 iterations, 4 changes were accepted, raising resolution from 12% to 60% with 0% regression.
  • The most impactful change was simplifying the SKILL.md from 107 to 20 lines, confirming that smaller models benefit more from contextual information than procedural guidance.

Interpretation

The results suggest that for AI agent tool design, surfacing contextual information (e.g., which tests to verify) outperforms prescribing procedural workflows (e.g., how to do TDD). Smaller models struggle to effectively use verbose prompts, but can leverage lightweight, fact-based guidance.

The paper advocates for elevating regression rate alongside resolution rate as a first-class metric in AI coding benchmarks, to avoid incentivizing agents to introduce regressions in pursuit of a higher fix rate.

Limitations & Uncertainties

  • Scale and statistical power: 100 and 25 instances are practical but small samples.
  • Language generalization: The work is limited to Python, requiring language-specific parsing.
  • Dynamic analysis: The static graph cannot capture dynamic dispatch or runtime-generated code.
  • Auto-loop scale: Evaluated on only 10 instances per iteration.

What Comes Next

  • Extend TDAD to support multiple languages via Tree-sitter.
  • Integrate dynamic coverage data to improve impact precision.
  • Evaluate with frontier models (e.g., Claude Opus 4.6, GPT-5.4) to test whether the TDD prompting paradox persists at larger scale.
  • Advocate for the community to adopt composite metrics that jointly capture resolution and regression.

Source