Monday, June 15, 2026

AI Agent Failure Detection and Root Trigger Evaluation with Strands Evals


When your AI agent fails in manufacturing, figuring out that it failed is simply the start. The more durable query is why it failed and what to repair. Conventional analysis tells you “this agent scored 60 p.c on objective completion,” however leaves you manually reviewing execution traces to grasp what went fallacious. For groups working brokers at scale, this guide analysis turns into the bottleneck between detecting an issue and transport a repair. Detectors within the Strands Evals SDK take away this bottleneck by routinely figuring out failures in agent execution traces and performing root trigger evaluation, so you possibly can cut back analysis time from hours to minutes.

On this publish, we stroll you thru calling the detector features to diagnose actual agent failures. You learn to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream signs, and repair suggestions specifying whether or not a change belongs in your system immediate or instrument definitions. You additionally learn to combine detection into your analysis pipeline for automated analysis on each take a look at run.

Detectors complement the analysis framework launched in a earlier publish by answering not solely “how properly did the agent do?” but additionally “why did it fail and the way do I repair it?”

Conditions

You will need to have the next stipulations to observe together with this publish.

  • Python 3.10 or later.
  • Strands Evals SDK put in with pip set up strands-agents-evals.
  • Amazon Bedrock mannequin entry enabled (detectors use giant language mannequin (LLM)-based evaluation).
  • For Amazon CloudWatch examples, AWS credentials configured with logs:StartQuery and logs:GetQueryResults permissions.

Why scores alone are usually not sufficient

The Strands Evals framework supplies dependable high quality alerts by way of Instances, Experiments, and Evaluators: objective success charges, instrument choice accuracy, and helpfulness scores. These are vital for catching regressions and understanding efficiency at a statistical degree. However contemplate what occurs after you detect a regression. Your agent’s objective success fee drops from 85 p.c to 70 p.c after a deployment or after immediate or instrument adjustments in build-time testing. Evaluators verify the drop. Now what?

You will need to establish which particular behaviors brought on failures, distinguish root causes from downstream signs, decide whether or not the repair belongs within the system immediate or instrument definitions, and prioritize by affect. This analysis workflow has historically required senior engineers to manually examine traces span by span and correlate failures throughout lots of of steps, and this course of doesn’t scale.

Detectors automate this workflow. Evaluators reply “how properly did the agent do?” by producing scores on the per-case degree. Detectors reply “why did it fail?” by producing diagnoses on the per-span degree with categorized failures, causal chains, and repair suggestions.

How detectors work

The detector pipeline operates in two phases, every powered by LLM-based evaluation of the execution hint. Confer with Perceive observability for agentic sources in Amazon Bedrock AgentCore to be taught extra about classes, traces, and spans of brokers.

Part 1: Failure detection scans every span in a session towards a complete failure taxonomy organized into 9 mother or father classes: hallucination, incorrect actions, orchestration errors, job instruction non-compliance, execution errors, context dealing with errors, repetitive conduct, LLM output points, and configuration mismatch. For every recognized failure, it returns the span location, a number of classes, a confidence rating, and proof extracted from the hint.

Part 2: Root trigger evaluation takes the detected failures and traces causal chains between them. A single upstream mistake usually cascades into a number of downstream failures. Root trigger evaluation separates causes from signs. It classifies every failure’s causality (PRIMARY, SECONDARY, or TERTIARY), determines propagation affect, and generates repair suggestions categorized by the place the repair belongs (system immediate, instrument description, or different).

Each phases deal with classes of various sizes by way of a tiered technique: direct evaluation for classes that match inside the context window of the chosen Detector mannequin, failure path pruning that retains solely ancestor and descendant spans for reasonably giant classes, and chunked evaluation with merge for very giant classes that splits the hint into overlapping home windows and reconciles outcomes.

The next diagram exhibits the end-to-end pipeline with two entry factors converging into the identical detection and evaluation circulation.

Determine: Detector pipeline with built-in and standalone entry factors flowing into failure detection and root trigger evaluation.

Getting began with failure detection

The next examples use a session hint from the drug discovery analysis assistant featured in Evaluating AI brokers for manufacturing: A sensible information to Strands Evals. The agent is constructed on Strands Brokers and Amazon Bedrock. To observe alongside, run your agent with OpenTelemetry tracing enabled and export the session as JSON, or use the CloudWatchProvider proven later on this publish to fetch an present hint. Confer with Consumer Simulation within the Strands Brokers SDK documentation for the best way to arrange tracing and export classes.

The detect_failures perform takes a Session object (the usual hint format in Strands Evals) and returns structured failures. Every failure consists of the span the place it occurred, a number of classes from the pre-defined failure taxonomy, a confidence rating, and proof extracted from the hint.

import json
from strands_evals.detectors import detect_failures
from strands_evals.varieties.hint import Session
from strands_evals.detectors import ConfidenceLevel

with open("agent_trace.json") as f:
    session = Session.model_validate_json(f.learn())

end result = detect_failures(session, confidence_threshold=ConfidenceLevel.MEDIUM)

for failure in end result.failures:
    for cat, conf, ev in zip(failure.class, failure.confidence, failure.proof):
        print(f"[{conf}] {cat} at span {failure.span_id}")
        print(f"  Proof: {ev}")

The next is output from a analysis agent that was requested to “Analysis the affect of power necessities for powering AI in the true world.” The agent encountered instrument configuration points and progressively degraded:

[0.9] execution-error-category-tool-schema at span f503a7d546fa4157
  Proof: Software execution failed as a consequence of lacking required parameter
  'knowledgeBaseId'. Error: 'Parameter validation failed: Invalid kind
  for parameter knowledgeBaseId, worth: None'

[0.75] hallucination-category-hall-usage at span 0466979670d14099
  Proof: Agent claims 'I haven't got entry to the particular information
  base wanted' after which proceeds to supply detailed details about AI
  power necessities 'primarily based on basic information' with out utilizing any instruments.

[0.9] orchestration-related-errors-category-goal-deviation at span d98d578e61233d33
  Proof: Agent fully abandons the unique job about AI power
  necessities and as a substitute supplies a prolonged response about marine
  biology, stating 'I'll pivot to debate marine biology as a substitute.'

In a single move, the detector identifies failures at a number of ranges: execution errors (instrument parameter validation), semantic points (hallucinating from “basic information”), and orchestration issues (full objective deviation). A single span can exhibit a number of failure classes, every with impartial confidence and proof.

Including root trigger evaluation

Figuring out failures is beneficial, however understanding why they occurred is what drives fixes. The analyze_root_cause perform takes detected failures and traces causal chains between them, separating root causes from downstream signs and recommending the place every repair belongs. If failures aren’t supplied to analyze_root_cause, it runs failure detection routinely.

from strands_evals.detectors import detect_failures, analyze_root_cause

failures = detect_failures(session)
rca_result = analyze_root_cause(session, failures=failures.failures)

for rc in rca_result.root_causes:
    print(f"Causality: {rc.causality}")
    print(f"  Span: {rc.failure_span_id} | Repair kind: {rc.fix_type}")
    print(f"  Root trigger: {rc.root_cause_explanation}")
    print(f"  Suggestion: {rc.fix_recommendation}")

Persevering with with the identical analysis agent session, root trigger evaluation reveals the causal construction:

Causality: PRIMARY_FAILURE
  Span: f503a7d546fa4157 | Repair kind: TOOL_DESCRIPTION_FIX
  Root trigger: Agent referred to as retrieve instrument with out required knowledgeBaseId
    parameter as a result of instrument description doesn't clearly doc that
    knowledgeBaseId is necessary. This brought on parameter validation failure
    and compelled agent into a number of retry makes an attempt with completely different parameter
    mixtures.
  Suggestion: Replace retrieve instrument description to explicitly mark
    knowledgeBaseId as a required parameter with clear documentation
    together with format constraints and instance values.

Causality: SECONDARY_FAILURE
  Span: 0466979670d14099 | Repair kind: SYSTEM_PROMPT_FIX
  Root trigger: Agent fabricated detailed AI power consumption data
    claiming it's 'primarily based on basic information' in spite of everything retrieval makes an attempt
    failed, as a result of system immediate lacks instruction prohibiting technology
    of factual content material with out tool-retrieved proof.
  Suggestion: Add instruction to system immediate requiring agent to
    explicitly acknowledge lack of ability to finish analysis duties when
    retrieval instruments fail, and prohibit producing detailed factual content material
    with out tool-verified sources.

The excellence between repair varieties is what makes root trigger evaluation actionable. The instrument schema error is a TOOL_DESCRIPTION_FIX as a result of the retrieve instrument’s knowledgeBaseId isn’t documented clearly. The downstream hallucination is a SYSTEM_PROMPT_FIX due to lacking directions for the best way to deal with persistent instrument failures. Fixing just one class leaves the opposite unaddressed.

Built-in analysis with diagnose_session

For comfort, diagnose_session runs each phases as a single pipeline (detect failures, then analyze root causes) and returns a unified DiagnosisResult with deduplicated suggestions:

from strands_evals.detectors import diagnose_session, ConfidenceLevel

end result = diagnose_session(session, confidence_threshold=ConfidenceLevel.MEDIUM)
print(f"Discovered {len(end result.failures)} failures, {len(end result.root_causes)} root causes")

for rec in end result.suggestions:
    print(f"  - {rec}")

This produces the identical failures and root causes proven within the previous examples, packaged right into a single end result with suggestions deduplicated throughout all root causes. From one perform name, you get a prioritized listing of concrete adjustments categorized by the place they belong.

Integration with analysis pipelines

Detectors present further worth if you combine them into your present analysis workflow. The DiagnosisConfig attaches automated analysis to any experiment, so that each failing take a look at case routinely produces a analysis:

from strands_evals import Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.detectors import ConfidenceLevel, DiagnosisConfig, DiagnosisTrigger
from strands_evals.varieties.evaluation_report import EvaluationReport

experiment = Experiment(
    instances=test_cases,
    task_function=my_agent_task,
    evaluators=[GoalSuccessRateEvaluator()],
    diagnosis_config=DiagnosisConfig(
        set off=DiagnosisTrigger.ON_FAILURE,
        confidence_threshold=ConfidenceLevel.MEDIUM
    ),
)

report = experiment.run()
report.show(include_recommendations=True)

Two set off modes can be found. ON_FAILURE (default) runs analysis solely when at the very least one evaluator returns test_pass=False, making it cost-efficient for steady integration and steady supply (CI/CD) regression detection. ALWAYS runs analysis on each case no matter end result, which is beneficial for figuring out suboptimal paths in nominally passing instances.

With this integration, your CI/CD pipeline tells you “3 exams failed”, and it tells you why they failed and what to alter. This closes the suggestions loop: outline instances, run the experiment, get scores and analysis collectively, apply the advisable fixes, and re-run to substantiate.

Observe: Operating detectors makes use of Amazon Bedrock inference for LLM-based evaluation, which incurs fees. See Amazon Bedrock pricing for particulars. Amazon CloudWatch Logs storage additionally incurs fees. See Amazon CloudWatch pricing for particulars. Monitor your utilization in AWS Value Explorer, particularly when integrating detectors into CI/CD pipelines that run steadily.

Diagnosing manufacturing classes from CloudWatch

The previous examples use native session recordsdata, however in manufacturing your agent traces reside in Amazon CloudWatch Logs, exported with OpenTelemetry. The CloudWatchProvider fetches traces straight from Amazon CloudWatch and converts them into Session objects that you would be able to analyze with detectors:

from strands_evals.suppliers import CloudWatchProvider
from strands_evals.detectors import diagnose_session, ConfidenceLevel

supplier = CloudWatchProvider(agent_name="my-research-agent", area="us-east-1")
information = supplier.get_evaluation_data(session_id="abc-123-def-456")
session = information["trajectory"]

end result = diagnose_session(session, confidence_threshold=ConfidenceLevel.MEDIUM)

for rc in end result.root_causes:
    print(f"[{rc.fix_type}] {rc.fix_recommendation}")

Underneath the hood, the supplier queries Amazon CloudWatch Logs Insights for OTEL information matching the session ID, auto-detects the agent framework (Strands, LangChain, or others) from span metadata, and maps the spans right into a standardized Session. Detectors work with any framework that exports OpenTelemetry traces to Amazon CloudWatch, not solely Strands Brokers.

You may as well mix this with the experiment pipeline for offline analysis: use CloudWatchProvider to guage and diagnose historic manufacturing classes with out re-running the agent. You may as well retrieve traces from Langfuse or OpenSearch utilizing LangfuseProvider or OpenSearchProvider.

Greatest practices

Begin with MEDIUM confidence. The LOW threshold catches extra potential points however consists of extra noise, which is beneficial for deep investigation of a particular failing case. MEDIUM supplies signal-to-noise ratio for routine use. Reserve HIGH for manufacturing monitoring the place you solely need high-certainty findings.

Use ON_FAILURE in CI/CD, ALWAYS for periodic audits. ON_FAILURE retains LLM prices proportional to failure charges, making it sensible for each take a look at run. Schedule ALWAYS-mode runs weekly or per-release to catch suboptimal behaviors hiding in passing instances.

Repair PRIMARY failures first. Secondary and tertiary failures usually resolve when their root trigger is addressed. Earlier than implementing a number of suggestions, test whether or not fixing the first failure removes the downstream ones. This reduces iteration cycles.

Group suggestions by repair kind. Batch TOOL_DESCRIPTION_FIX adjustments collectively and SYSTEM_PROMPT_FIX adjustments collectively. This makes the affect of every change class independently measurable if you re-run analysis.

Cross pre-detected failures to analyze_root_cause. If in case you have already run detect_failures and wish to examine the outcomes earlier than working root trigger evaluation, move them on to keep away from redundant detection:

failures = detect_failures(session)
# ... examine or filter failures ...
rca = analyze_root_cause(session, failures=failures.failures)

Use the take a look at session for experimentation. The flawed_session.json used on this publish is offered within the Strands Evals take a look at suite so that you can strive detectors domestically.

Clear up sources

The detector features themselves don’t provision any persistent AWS sources. Nevertheless, in the event you configured Amazon CloudWatch Logs export to your agent traces, you would possibly wish to assessment the next:

  • Amazon CloudWatch log teams: Deleting a log group completely removes all log information and might’t be undone. Verify that you’ve got exported any logs it is advisable to retain earlier than continuing. In the event you created log teams particularly for testing, delete them by way of the Amazon CloudWatch console or by working aws logs delete-log-group --log-group-name .
  • Amazon Bedrock mannequin entry: The LLM evaluation makes use of Amazon Bedrock. In the event you enabled mannequin entry solely for this walkthrough, revoke it by way of the Amazon Bedrock console underneath Mannequin entry.

Conclusion

Detectors shut the loop between measuring agent high quality and bettering it. By automating the failure detection and root trigger evaluation that beforehand required guide hint inspection, you possibly can go from “take a look at failed” to “here’s what to repair” in minutes as a substitute of hours.

To get began, see the Strands Evals SDK Detectors documentation and the Strands Evals GitHub repository. Attempt the included pattern hint file, then add DiagnosisConfig to at least one present take a look at case in your analysis pipeline to see automated analysis in motion.


Concerning the authors

Po-Shin Chen

Po-Shin Chen

Po-Shin Chen is a Software program Developer specializing in agentic AI growth and evaluations at Amazon Net Companies. With a background in engineering and science, his work focuses on constructing core capabilities for agentic framework (Strands SDK), main and creating the agent analysis framework (Strands Evals).

Aaron Farntrog

Aaron Farntrog

Aaron Farntrog is a Software program Engineer at Amazon targeted on constructing agentic options. His work consists of creating agentic frameworks such because the Strands SDK and Strands Evals, and bringing agentic capabilities to manufacturing programs on the utility layer utilizing Strands.

Muhyun Kim

Muhyun Kim

Muhyun Kim is a principal information scientist at AWS AI Basic Analysis who researches and develops key primitives for agentic system comparable to analysis, observability, optimization and safety.

JJ Cho

JJ Cho

Jaejin Cho is an Utilized Scientist at AWS, engaged on agent programs with a concentrate on observability and analysis. Previous to AWS, he targeted on mannequin coaching and analysis primarily throughout speech, textual content, and picture modalities

Ninad Kulkarni

Ninad Kulkarni

Ninad Kulkarni is a senior utilized scientist at AWS AI Basic Analysis creating primitives for agentic functions together with observability, registry, and safety.

Abhishek Kumar

Abhishek Kumar

Abhishek is an Utilized Scientist at AWS, working on the intersection of synthetic intelligence and machine studying, with a concentrate on agent observability, simulation, and analysis. His major analysis pursuits heart on agentic conversational programs. Previous to his present function, Abhishek spent two years at Alexa, Amazon, the place he contributed to constructing and coaching fashions that powered Alexa’s core capabilities.

Related Articles

Latest Articles