Tech corner - 29. May 2026

How AI Orchestration increased test coverage

header_image

Most teams know technical debt is expensive. The challenge is finding time to fix it without slowing down feature delivery. This article looks at how AI orchestration helped modernize a legacy test suite, improve coverage, and reduce execution times across a 15-year-old codebase.  

Imagine a product in active development for over 15 years, with a team constantly under pressure to ship at breakneck speed, sacrificing just a little bit of test coverage here and there over the years. The result? A feature-packed, complex monolith with over 330,000 lines of production code... and 15% line coverage.

But wait, it gets better. Not only is the coverage a disaster zone, but the existing 4,000 or so tests are also written in JUnit 4 with heavy PowerMock usage — a framework you really do not want to maintain in 2026. And to make matters worse, running those 4,000 tests takes 90 minutes. Sounds like fun, right?

So you have three debts stacked on top of each other:

  1. a coverage debt that would take an estimated 97–145 FTE-months to clear manually
  2. a framework debt with 540 test files still on JUnit 4 and 239 using PowerMock,
  3. and a performance debt that makes automated test generation painfully slow to iterate on.

Even if you just handed every developer an AI coding assistant and told them to focus on tests, you are still looking at 25–40 FTE-months of effort. That is years of work competing directly with feature development, and no engineering organisation is going to accept that trade-off. You simply cannot stop shipping features for two years to pay back test debt.

This is exactly the kind of problem that AI orchestration is built for. Let me walk you through how we solved all three debts simultaneously in 33 days — without blocking the development team for a single day.

Why AI coding assistants alone are not enough 

The naive approach is tempting: spin up an agentic coding tool, point it at the lowest-coverage modules, and let it run.

Anyone who has tried this at scale knows the problems pretty quickly. Models have limited context. They lose track of progress. They need consistent, careful prompting for every class. They'll happily generate a test that compiles and passes while asserting nothing meaningful. And unless you happen to employ developers who enjoy working at 3am, that's when the agent stops.

The real solution is a programmatic orchestration harness — a system that strictly controls the sequence of steps, handles failures gracefully, verifies results through actual Maven test execution rather than trusting the model's judgment, and runs 24/7 without anyone watching over it. That's what we built.

The orchestrator is custom-built on top of opencode as the coding agent under the hood. One of the key design decisions was making it model-agnostic from the start. We began with GPT-5.4 for implementation and upgraded to GPT-5.5 mid-campaign when it became available — without interrupting the run. Low-complexity tasks use GPT-5.4 mini to keep token costs in check.

For the review phase, we run two models in parallel: GPT-5.5 and Claude Sonnet 4.6. This was a deliberate bet on model diversity — different models have genuinely different strengths, and running them independently on the same code surfaces more categories of problems than a single model reviewing its own output. That bet paid off in practice, with the two models consistently flagging different things.

How the AI Orchestration pipeline works

 How the orchestration pipeline works, How AI Orchestration increased test coverage from 15% to 84% in 33 days

The orchestrator runs a four-phase loop, module by module, until coverage targets are met. We deliberately did not attempt all 50 modules at once — keeping things scoped to one module at a time keeps context manageable, builds contained, and failures recoverable. One particularly large module we had to slice further by package.

The first phase is target selection. The orchestrator parses JaCoCo coverage reports and prioritizes the lowest-coverage classes. Critically, it excludes low-value targets entirely — generated code, DTOs, getter/setter-only files, constants, wrappers. We only targeted lines that were actually worth covering. This matters when you look at the final numbers.

In the implementation phase, the agent generates or extends test files in test-only mode. Production source is never touched during this phase. Commits only happen on green builds — failed runs get discarded, not accumulated into a mess you have to untangle later.

Then comes the part that surprises most people. The review phase runs two independent models in parallel — one focused on structural quality and mock correctness, the other on assertion semantics. Findings are severity-ranked, and a separate remediation pass automatically fixes everything above the severity threshold before the branch is finalized.

By the time a PR reaches a developer, it has already been reviewed by two AI models and remediated by a third pass. Developers see clean, already-reviewed code — not raw AI output. This is what makes the human review time so manageable in practice.

The review phase also examines the production code associated with each test, not just the tests themselves. It's instructed not to touch production source unless it finds a confirmed high-severity bug — because sometimes features depend on bugs, and you do not want an AI quietly "fixing" behavior the rest of the system relies on. In practice, several latent bugs surfaced that had gone undetected — most were flagged for the development team to investigate, and a few high-severity ones were addressed directly.

The results

Lines of code (thousands)

Coverage growth throughout the campaign, How AI Orchestration increased test coverage from 15% to 84% in 33 days

Let's talk about what actually happened across the campaign.

On coverage: we went from 15% to 84% — 172,000 covered lines across a 50-module Maven monorepo, exceeding the original 80% target. Total API cost landed at $4,000. Human time invested was around 4 weeks in total — one week of calibration upfront to tune the workflow and the agent skill, one week driving the orchestration suite, and around two weeks of distributed review heavily front-loaded and tapering off as the system stabilized.

And that 84% is not padded. Because we excluded DTOs, generated classes, and trivial files from the start, every percentage point represents coverage of code that actually warranted testing. We did not inflate the number by going after easy wins.

Results after 33 days, How AI Orchestration increased test coverage from 15% to 84% in 33 days

On framework modernisation: all 540 files still importing JUnit 4 were migrated to JUnit 5 and Mockito — including all 239 PowerMock files. PowerMock is not a mechanical migration. Replacing static mocking patterns requires understanding what the test was actually trying to verify before you can replace it correctly. This required genuine iteration on the agent skill, and that skill kept improving as it was also used in the review passes throughout the campaign.

On test suite performance: the suite went from roughly 4,000 tests running in 90 minutes to roughly 24,000 tests running in 40 minutes — around 10× throughput improvement per test. This work happened in parallel with the early phase of the campaign, AI-driven but manually orchestrated rather than part of the main harness. Without it, the iterative generation and verification loop would have been too slow to run safely at this scale.

How we validated AI-generated tests 

This is the question every experienced engineer will ask when they hear these numbers, and it is the right question to ask. So let's address it directly.

Any successful AI orchestration initiative must include mechanisms for validating generated code rather than relying on model output alone.

Before we finished calibrating the agent skill, roughly 5% of generated tests had issues flagged during the review phase. After calibration, that number dropped to effectively zero — not because the model stopped making mistakes, but because the review-and-remediation loop catches and fixes them before any human sees the output. That is the whole point of the self-correcting architecture.

All generated tests were also validated with Semgrep static analysis. Any valid findings were automatically fixed by the remediation pass — the same loop that handles AI code review findings. Nothing reached a developer with a static analysis issue still open.

Developer review confirmed the quality. A few of the new tests went further — they caught real bugs in production code that had been sitting there undetected. Tests that legitimately failed because the implementation had a problem, not the test. That is the difference between coverage for its own sake and coverage that actually means something.

What was actually hard

A 15-year-old monolith accumulates complexity in ways that are not obvious from the outside. The biggest challenges were about calibration, not capability.

Balancing review thoroughness with efficiency was the central engineering challenge of the first week. A review pass that is too aggressive becomes a bottleneck; one that is too lenient lets problems slip through to developers. Getting the prompts, the severity thresholds, and the remediation boundaries right required real iteration — and that calibration is what most of the first week was spent on.

The PowerMock migration was the other significant challenge. 239 files is not a footnote, and PowerMock patterns are varied enough that early versions of the agent skill handled some of them poorly. The skill got substantially better over the course of the campaign, including through being used and refined in the review phases.

And then there is just the reality of a large multi-module monorepo with shared test infrastructure, Testcontainers, database-backed tests, and module-specific Maven configurations. The orchestrator had to handle all of this gracefully without cross-contaminating modules or leaving the repository in a broken state.

The actual comparison

The economics of test debt repayment, How AI Orchestration increased test coverage from 15% to 84% in 33 days

The point is not that AI writes good tests — though it does, when properly directed and verified. The point is that AI orchestration changes the nature of the problem entirely.

Without an orchestrator, you are still a human in the loop for every class, every module, every review cycle. At 210,000 coverable lines, that is still years of work regardless of how good your AI assistant is. With an orchestrator, your role shifts to calibration, milestone review, and handling the edge cases the system surfaces. Everything else runs around the clock, verifies its own output, and does not need nudging.

Three debts, 33 days. A modernised test foundation, 140,000 newly covered lines of high-value code, and a test suite running faster with ~6× more tests than it had before. The development team kept shipping features throughout. That is what AI orchestration makes possible — and we are just getting started with what this kind of workflow can do for teams carrying years of technical debt they had resigned themselves to living with.

Key Takeaways

  1. AI orchestration can solve large-scale testing challenges that individual coding assistants cannot.
  2. Automated review and remediation workflows improve test quality significantly.
  3. Legacy test debt can be reduced without stopping feature development.
  4. Human oversight remains critical, but the effort shifts from implementation to calibration.
blog author
Author
Lukáš Chmelař

I got into software development at 15 by building websites for other people, and that early curiosity has been turning into working software ever since. My role is a healthy mix of architecture, development, team leadership, and business analysis—which mostly means I enjoy variety and have accepted context switching as a lifestyle. I like solving different kinds of problems, especially the messy ones where technology, people, and ideas all meet. Lately, I’ve been focused on helping teams become AI-native: adopting new tools and workflows in practical ways, building smarter, moving faster, and dealing with the kind of technical debt everyone knows exists, but nobody is excited to touch.

Read more

Contact us

Let's talk