EngineeringMay 26, 2026·7 min read·by Vijay Javvadi

How our AI Defect Analysis actually works under the hood

A look at the deterministic rules layer + LLM fallback that turns a failing test into a four-section plain-English narrative — and why we built it this way.

The most common reaction we get from new users of TestForge AI isn't to the test generation, or to the self-healing locators, or even to the per-step visual regression. The thing that consistently makes people stop and look more closely is the Defect Analysis panel — the four-section plain-English explanation that appears next to every failing test.

This post walks through how it actually works under the hood, why we architected it this way, and what we learned tuning it for production.

What you see in the UI

When a test fails, the Defect Analysis tab shows something like this:

What was expected: The submit button should be enabled after the form is filled out.

What happened: The submit button was found disabled even after the form was filled out, as reported at 'CheckoutFlow.spec.ts:88'.

Most likely causes: A required field is not being filled or recognised, leaving form validation incomplete. A timing issue means the enabled-state check runs before the form's validation logic fires. A recent code change broke the enable/disable logic tied to form completion.

Suggested fix: File a bug because the submit button remains disabled after all form fields are filled, and investigate 'CheckoutFlow.spec.ts:88' to confirm all required fields are populated and validation has settled before the assertion runs.

Above that, you see a category pill (real-defect, flake, environmental, infrastructure, expected-failure) with a confidence percentage. Below it, the per-step screenshot at the moment of failure, the matched-pattern chips, and — if the verdict is real-defect — a draft Jira ticket ready to paste.

That whole panel comes from a separate microservice on the platform called the Defect Analysis service. It has two distinct layers.

Layer one: the deterministic rules engine

The first thing the service does with every failure is route it through a rules-based classifier. The rules engine has five categories — infrastructure, environmental, flake, real-defect, expected-failure — and a signature library of regex patterns that match each. It also takes signals like how many times the test was already retried, how long it took to fail, and whether it's tagged as a known/expected failure.

The rules engine is fast, free, and accurate for the common cases. Most production regression failures are repeats of patterns the team has seen before — a timeout, a connection refused, a 500 from a downstream service — and the rules engine handles them in a millisecond.

Crucially, the rules engine also produces a confidence score. A failure that matches multiple signatures gets a high confidence; a failure that matches none gets the lowest possible confidence and is defaulted to real-defect so the team errs on the side of investigating rather than auto-dismissing.

Layer two: the LLM analyst

When the rules engine's confidence drops below a threshold, the failure is escalated to the LLM analyst — built on Anthropic's Claude Sonnet. The LLM does three jobs sequentially:

  • Classify borderline cases. Cases where the rules engine wasn't sure which category applied. The LLM reads the failure message, stack trace, retry count, and step text, and returns a category plus rationale.
  • Compose the user-facing narrative. Regardless of category, the LLM writes the four-section explanation you see in the UI. The prompt explicitly asks the model to quote concrete values (URLs, selectors, error messages) verbatim rather than paraphrasing them — this is what makes the narrative trustworthy.
  • Draft a Jira ticket, but only if the final verdict is real-defect. The draft includes a summary, Observed/Expected sections, repro steps, and a Suspected Root Cause section.

Each of these is a separate LLM call with its own prompt version and its own cache. A repeat failure with the same fingerprint hits the cache and skips the LLM entirely — which is why running the same flaky test five times in a day costs us one LLM call, not five.

Why deterministic-first matters

We could have built this as one big "send everything to Claude and trust the answer" pipeline. We deliberately didn't — for three reasons.

First: cost. Most failures are easy. Spending an Anthropic API call on a clearly-formatted connection-refused error is wasteful. The rules engine catches those for free.

Second: latency. The rules engine returns in single-digit milliseconds. An LLM call returns in seconds. Running regression suites across thousands of failures, that difference adds up to whether the team sees the report in their morning standup or whether it's still running at lunch.

Third — and most important — reproducibility. The same failure should produce the same verdict. A deterministic rules engine guarantees that; an LLM does not. By running rules first and only escalating uncertain cases to the LLM, we preserve reproducibility for the bulk of the work. And by caching every LLM verdict by fingerprint, we preserve it for the escalated cases too — within a project, the same failure always gets the same classification.

Governance: every prompt is scrubbed

Before any prompt leaves the box, it passes through a governance scrubber: emails, API keys, IP addresses, account numbers, JWTs, common PII patterns are redacted. Every Anthropic call is recorded with the model used, the prompt version, the input hash, the token count, and the redaction count. The whole audit log lives in a ring buffer accessible from the admin UI.

This isn't a compliance theatre move. It's how we know — and how customers can verify — that the AI never saw something it shouldn't have. When an enterprise asks "does our customer data flow to Anthropic?" we can show them the scrub log for every call their tenant has ever made.

What we learned

The biggest lesson tuning this for production was: structure the output, and validate it. The early versions of the narrative prompt asked the LLM to write a few paragraphs of explanation. The output was correct but unstructured — sometimes the "suggested fix" was the first sentence, sometimes the last, sometimes implicit. That made it impossible to render consistently in the UI.

We switched to a strict JSON output with four named fields — what_was_expected, what_happened, most_likely_causes, suggested_fix — and a tolerant parser on our side that strips any markdown fences or prose preamble the model occasionally adds. Quality of explanation went up. Render consistency became 100%. Costs went down because each section can be shorter and more pointed.

The second lesson was: the suggested fix has to actually distinguish test defects from product defects. Early versions would say "file a bug" for every failure. Our prompt now explicitly asks the model to start the suggested fix with "Revise the test step to..." if the test's expectation looks wrong, or "File a bug because..." if the product is wrong. That single distinction — whose-fault-is-this — is what makes the panel actionable instead of decorative.

If you want to see this for yourself, run TestForge against an application with at least one intentionally-broken test. The panel will tell you which kind of broken, in plain English, with the actual values from the failure quoted back to you.