Dev diary - 2. April 2026

AI test automation: Why context matters more than prompts

header_image

AI test automation promises faster and more efficient testing, but in practice, results often depend on one critical factor: context.

Generating automated tests with AI sounds straightforward: describe a scenario, run a prompt, and receive a ready-to-use test. In reality, the process rarely works that smoothly. The quality of AI-generated tests depends heavily on context, and in many real-world environments, that context is incomplete, fragmented, or entirely missing.

This Dev Diary explores a practical challenge: how to generate reliable test automation using AI when the available context is limited, especially in black-box testing scenarios.

AI test automation in two different realities

From practical experience, there are two fundamentally different realities when working with automated tests.

Understanding this distinction is critical, because it directly determines how much context is available to AI and how reliable the generated tests will be.

  1. In the first scenario, testers work closely with developers and have direct access to the application code. Tests are stored in the same repository as the application, and there is clear visibility into components, structure, and internal logic. In this environment, generating tests with AI becomes much easier because the relationship between the user interface and the underlying implementation is transparent. The model can infer structure, reuse patterns, and align with existing conventions.
  2. In the second scenario, the situation looks very different. Tests are separated from the application and maintained in an independent repository without access to source code. This architecture is common in regulated environments, vendor-managed systems, or enterprise setups where test automation must remain decoupled from development. From an AI perspective, however, this separation creates a serious limitation. There is no direct mapping between UI elements and application logic, no component hierarchy, and no reliable way to infer behavior from code.

At that point (2. scenario), the only available inputs are typically:

  1. existing automated tests
  2. partial specifications
  3. screenshots or recordings
  4. the prompt itself

That combination is rarely sufficient to produce reliable automation on the first attempt. That combination is rarely sufficient to produce reliable automation on the first attempt.

Why prompts are not enough

In a black-box environment, everything depends on how well the test scenario is described. This reflects a broader challenge in working with AI, where relying solely on prompts often leads to inconsistent results, as discussed in AI Monthly Insights #3. Even when the prompt is carefully written, it cannot fully capture real user behavior. It cannot describe every interaction, timing nuance, visual dependency, or edge case that emerges during normal usage. Most importantly, prompts often fail to capture the reasoning process of the tester while navigating the system.

This gap between intent and execution is where AI-generated tests begin to lose reliability. The model may produce syntactically correct code, but the resulting test can be fragile, incomplete, or misaligned with real workflows — a limitation also visible when working with AI coding assistants in practice.

Why AI test automation fails without context

Instead of trying to write increasingly detailed prompts, the focus shifted to a different question: what if context could be captured automatically rather than manually described?

This idea led to building a small experimental tool called KiwiGen. You can check it out on our GitHub: https://github.com/hotovo/kiwigen

A more technical look at KiwiGen

KiwiGen is designed as a lightweight interaction recorder that observes how a tester works with a web application and converts those interactions into structured, machine-readable data. The goal is not to generate tests directly, but to create a high-quality contextual layer that can later be consumed by AI models or test automation frameworks.

AI test automation, Dev Diary: Improving AI-Generated Test Automation with Better Context, black background

At its core, KiwiGen listens to browser-level events and records user actions such as navigation, clicks, text input, selections, and assertions. These interactions are captured in a deterministic sequence that reflects the exact execution order of the workflow. Each recorded step contains metadata describing the target element, the action performed, and the relevant state of the application at that moment.

The recording process intentionally focuses on stability and reproducibility. Instead of storing raw coordinates or visual snapshots, KiwiGen resolves elements using selectors and semantic identifiers. This design decision ensures that the recorded interactions remain usable even when the UI layout changes slightly. The system prioritizes maintainable selectors and consistent references over transient visual data.

Another important aspect of the design is transparency. Every recorded interaction is stored in a structured format that can be inspected, modified, or replayed. The output is not hidden behind proprietary abstractions. Testers and developers can review the recorded steps, understand the sequence of actions, and refine the workflow before passing it to downstream tools.

Why interactions are recorded this way

The structure of the recorded data is intentionally simple and explicit. Each step represents a single atomic interaction with the application. This approach provides several practical advantages:

1. It preserves the chronological flow of the scenario. AI models perform significantly better when they receive step-by-step context rather than a summarized description of the workflow. The recorded sequence acts as a reliable execution trace that eliminates ambiguity.

2. It enables deterministic replay. Because each interaction is clearly defined, the same sequence can be executed repeatedly with predictable results. This property is essential for automated testing, where consistency is more important than speed.

3. It keeps the system framework-agnostic. By separating interaction recording from test execution logic, KiwiGen avoids tight coupling with any specific test automation framework.

The role of voice commentary

One of the most distinctive features of KiwiGen is the integration of voice commentary during recording sessions. While the tester interacts with the application, they can verbally describe their intent, expectations, and observations. This commentary is captured alongside the interaction data and later transcribed into structured text.

Voice commentary plays a critical role in bridging the gap between technical actions and testing intent. A click alone does not explain why the action was performed. A typed value does not indicate what condition is being verified. By attaching spoken explanations to individual steps, the system captures the reasoning behind the workflow.

For AI models, this additional context dramatically improves interpretation accuracy. Instead of guessing the purpose of an interaction, the model receives explicit semantic guidance. The result is more meaningful assertions, better validation logic, and fewer incorrect assumptions.

From recording to AI test automation

After the recording session is complete, the collected interaction data and voice annotations are combined into a structured scenario representation. This representation serves as an intermediate artifact between manual exploration and automated execution.

At this stage, the data can be processed in multiple ways. It can be provided directly to an AI model as contextual input for test generation, converted into reusable test steps, or integrated into an existing automation pipeline. The system does not assume a specific workflow or toolchain.

Most importantly, KiwiGen produces output that is test automation framework agnostic.

This is not just a convenience feature — it is a deliberate architectural decision. The recorded scenario is stored in a neutral format that can be translated into Playwright, Cypress, Selenium, Robot Framework, or any other automation technology. This flexibility allows teams to adopt the tool without restructuring their existing infrastructure.

Why this matters in real projects

The primary benefit of this approach is not speed, but consistency. When AI operates with richer context, the output becomes more predictable and easier to maintain. There is less guesswork, fewer missing steps, and fewer unexpected failures.

This advantage becomes especially valuable in environments where:

  1. access to source code is restricted
  2. documentation is incomplete
  3. existing test suites provide limited guidance Instead of relying on fragmented information, the system provides a unified representation of real user behavior.

The role of existing test suites

This method becomes even more powerful when combined with existing automated tests. Recorded workflows can be compared with established patterns, reused as templates, or integrated into regression suites.

Instead of starting from zero for every new scenario, teams can build on previous experience. This leads to improved consistency, better structure, and easier long-term maintenance.

A shift in how we think about AI in testing

AI is often treated as a tool that generates output from a prompt. In practice, the real value lies in how well the input is structured — something that reflects a broader principle that AI integrations are not magic, but engineering.

In testing, this means capturing real workflows instead of attempting to describe them perfectly. The closer the input reflects actual behavior, the more dependable the output becomes.

Final Thoughts

Generating test automation with AI is not primarily a tooling problem. It is a context problem. When context is weak, results are inconsistent. When context improves, the entire workflow becomes more stable and predictable.

KiwiGen represents one practical step toward solving that problem.

blog author
Author
Jozef Kováč

After transitioning from the world of pharmacy to software engineering, my passion for building things became my profession—and something I enjoy every day. As a test automation specialist, I take on challenges that many developers prefer to avoid (yes, writing tests). I constantly look beyond conventional tools and explore new paths to help deliver better, more reliable, and defect-free software. Many ideas for planning and designing ongoing projects come to me during runs in the local park, so when I return to the keyboard, I can focus on turning those ideas into reality.

Read more

Contact us

Let's talk