When your AI agent fails in manufacturing, figuring out that it failed is simply the start. The more durable query is why it failed and what to repair. Conventional analysis tells you “this agent scored 60 p.c on objective completion,” however leaves you manually reviewing execution traces to grasp what went fallacious. For groups working brokers at scale, this guide analysis turns into the bottleneck between detecting an issue and transport a repair. Detectors within the Strands Evals SDK take away this bottleneck by routinely figuring out failures in agent execution traces and performing root trigger evaluation, so you possibly can cut back analysis time from hours to minutes.
On this publish, we stroll you thru calling the detector features to diagnose actual agent failures. You learn to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream signs, and repair suggestions specifying whether or not a change belongs in your system immediate or instrument definitions. You additionally learn to combine detection into your analysis pipeline for automated analysis on each take a look at run.
Detectors complement the analysis framework launched in a earlier publish by answering not solely “how properly did the agent do?” but additionally “why did it fail and the way do I repair it?”
Conditions
You will need to have the next stipulations to observe together with this publish.
- Python 3.10 or later.
- Strands Evals SDK put in with
pip set up strands-agents-evals. - Amazon Bedrock mannequin entry enabled (detectors use giant language mannequin (LLM)-based evaluation).
- For Amazon CloudWatch examples, AWS credentials configured with
logs:StartQueryandlogs:GetQueryResultspermissions.
Why scores alone are usually not sufficient
The Strands Evals framework supplies dependable high quality alerts by way of Instances, Experiments, and Evaluators: objective success charges, instrument choice accuracy, and helpfulness scores. These are vital for catching regressions and understanding efficiency at a statistical degree. However contemplate what occurs after you detect a regression. Your agent’s objective success fee drops from 85 p.c to 70 p.c after a deployment or after immediate or instrument adjustments in build-time testing. Evaluators verify the drop. Now what?
You will need to establish which particular behaviors brought on failures, distinguish root causes from downstream signs, decide whether or not the repair belongs within the system immediate or instrument definitions, and prioritize by affect. This analysis workflow has historically required senior engineers to manually examine traces span by span and correlate failures throughout lots of of steps, and this course of doesn’t scale.
Detectors automate this workflow. Evaluators reply “how properly did the agent do?” by producing scores on the per-case degree. Detectors reply “why did it fail?” by producing diagnoses on the per-span degree with categorized failures, causal chains, and repair suggestions.
How detectors work
The detector pipeline operates in two phases, every powered by LLM-based evaluation of the execution hint. Confer with Perceive observability for agentic sources in Amazon Bedrock AgentCore to be taught extra about classes, traces, and spans of brokers.
Part 1: Failure detection scans every span in a session towards a complete failure taxonomy organized into 9 mother or father classes: hallucination, incorrect actions, orchestration errors, job instruction non-compliance, execution errors, context dealing with errors, repetitive conduct, LLM output points, and configuration mismatch. For every recognized failure, it returns the span location, a number of classes, a confidence rating, and proof extracted from the hint.
Part 2: Root trigger evaluation takes the detected failures and traces causal chains between them. A single upstream mistake usually cascades into a number of downstream failures. Root trigger evaluation separates causes from signs. It classifies every failure’s causality (PRIMARY, SECONDARY, or TERTIARY), determines propagation affect, and generates repair suggestions categorized by the place the repair belongs (system immediate, instrument description, or different).
Each phases deal with classes of various sizes by way of a tiered technique: direct evaluation for classes that match inside the context window of the chosen Detector mannequin, failure path pruning that retains solely ancestor and descendant spans for reasonably giant classes, and chunked evaluation with merge for very giant classes that splits the hint into overlapping home windows and reconciles outcomes.
The next diagram exhibits the end-to-end pipeline with two entry factors converging into the identical detection and evaluation circulation.
Determine: Detector pipeline with built-in and standalone entry factors flowing into failure detection and root trigger evaluation.
Getting began with failure detection
The next examples use a session hint from the drug discovery analysis assistant featured in Evaluating AI brokers for manufacturing: A sensible information to Strands Evals. The agent is constructed on Strands Brokers and Amazon Bedrock. To observe alongside, run your agent with OpenTelemetry tracing enabled and export the session as JSON, or use the CloudWatchProvider proven later on this publish to fetch an present hint. Confer with Consumer Simulation within the Strands Brokers SDK documentation for the best way to arrange tracing and export classes.
The detect_failures perform takes a Session object (the usual hint format in Strands Evals) and returns structured failures. Every failure consists of the span the place it occurred, a number of classes from the pre-defined failure taxonomy, a confidence rating, and proof extracted from the hint.
The next is output from a analysis agent that was requested to “Analysis the affect of power necessities for powering AI in the true world.” The agent encountered instrument configuration points and progressively degraded:
In a single move, the detector identifies failures at a number of ranges: execution errors (instrument parameter validation), semantic points (hallucinating from “basic information”), and orchestration issues (full objective deviation). A single span can exhibit a number of failure classes, every with impartial confidence and proof.
Including root trigger evaluation
Figuring out failures is beneficial, however understanding why they occurred is what drives fixes. The analyze_root_cause perform takes detected failures and traces causal chains between them, separating root causes from downstream signs and recommending the place every repair belongs. If failures aren’t supplied to analyze_root_cause, it runs failure detection routinely.
Persevering with with the identical analysis agent session, root trigger evaluation reveals the causal construction:
The excellence between repair varieties is what makes root trigger evaluation actionable. The instrument schema error is a TOOL_DESCRIPTION_FIX as a result of the retrieve instrument’s knowledgeBaseId isn’t documented clearly. The downstream hallucination is a SYSTEM_PROMPT_FIX due to lacking directions for the best way to deal with persistent instrument failures. Fixing just one class leaves the opposite unaddressed.
Built-in analysis with diagnose_session
For comfort, diagnose_session runs each phases as a single pipeline (detect failures, then analyze root causes) and returns a unified DiagnosisResult with deduplicated suggestions:
This produces the identical failures and root causes proven within the previous examples, packaged right into a single end result with suggestions deduplicated throughout all root causes. From one perform name, you get a prioritized listing of concrete adjustments categorized by the place they belong.
Integration with analysis pipelines
Detectors present further worth if you combine them into your present analysis workflow. The DiagnosisConfig attaches automated analysis to any experiment, so that each failing take a look at case routinely produces a analysis:
Two set off modes can be found. ON_FAILURE (default) runs analysis solely when at the very least one evaluator returns test_pass=False, making it cost-efficient for steady integration and steady supply (CI/CD) regression detection. ALWAYS runs analysis on each case no matter end result, which is beneficial for figuring out suboptimal paths in nominally passing instances.
With this integration, your CI/CD pipeline tells you “3 exams failed”, and it tells you why they failed and what to alter. This closes the suggestions loop: outline instances, run the experiment, get scores and analysis collectively, apply the advisable fixes, and re-run to substantiate.
Observe: Operating detectors makes use of Amazon Bedrock inference for LLM-based evaluation, which incurs fees. See Amazon Bedrock pricing for particulars. Amazon CloudWatch Logs storage additionally incurs fees. See Amazon CloudWatch pricing for particulars. Monitor your utilization in AWS Value Explorer, particularly when integrating detectors into CI/CD pipelines that run steadily.
Diagnosing manufacturing classes from CloudWatch
The previous examples use native session recordsdata, however in manufacturing your agent traces reside in Amazon CloudWatch Logs, exported with OpenTelemetry. The CloudWatchProvider fetches traces straight from Amazon CloudWatch and converts them into Session objects that you would be able to analyze with detectors:
Underneath the hood, the supplier queries Amazon CloudWatch Logs Insights for OTEL information matching the session ID, auto-detects the agent framework (Strands, LangChain, or others) from span metadata, and maps the spans right into a standardized Session. Detectors work with any framework that exports OpenTelemetry traces to Amazon CloudWatch, not solely Strands Brokers.
You may as well mix this with the experiment pipeline for offline analysis: use CloudWatchProvider to guage and diagnose historic manufacturing classes with out re-running the agent. You may as well retrieve traces from Langfuse or OpenSearch utilizing LangfuseProvider or OpenSearchProvider.
Greatest practices
Begin with MEDIUM confidence. The LOW threshold catches extra potential points however consists of extra noise, which is beneficial for deep investigation of a particular failing case. MEDIUM supplies signal-to-noise ratio for routine use. Reserve HIGH for manufacturing monitoring the place you solely need high-certainty findings.
Use ON_FAILURE in CI/CD, ALWAYS for periodic audits. ON_FAILURE retains LLM prices proportional to failure charges, making it sensible for each take a look at run. Schedule ALWAYS-mode runs weekly or per-release to catch suboptimal behaviors hiding in passing instances.
Repair PRIMARY failures first. Secondary and tertiary failures usually resolve when their root trigger is addressed. Earlier than implementing a number of suggestions, test whether or not fixing the first failure removes the downstream ones. This reduces iteration cycles.
Group suggestions by repair kind. Batch TOOL_DESCRIPTION_FIX adjustments collectively and SYSTEM_PROMPT_FIX adjustments collectively. This makes the affect of every change class independently measurable if you re-run analysis.
Cross pre-detected failures to analyze_root_cause. If in case you have already run detect_failures and wish to examine the outcomes earlier than working root trigger evaluation, move them on to keep away from redundant detection:
Use the take a look at session for experimentation. The flawed_session.json used on this publish is offered within the Strands Evals take a look at suite so that you can strive detectors domestically.
Clear up sources
The detector features themselves don’t provision any persistent AWS sources. Nevertheless, in the event you configured Amazon CloudWatch Logs export to your agent traces, you would possibly wish to assessment the next:
- Amazon CloudWatch log teams: Deleting a log group completely removes all log information and might’t be undone. Verify that you’ve got exported any logs it is advisable to retain earlier than continuing. In the event you created log teams particularly for testing, delete them by way of the Amazon CloudWatch console or by working
aws logs delete-log-group --log-group-name. - Amazon Bedrock mannequin entry: The LLM evaluation makes use of Amazon Bedrock. In the event you enabled mannequin entry solely for this walkthrough, revoke it by way of the Amazon Bedrock console underneath Mannequin entry.
Conclusion
Detectors shut the loop between measuring agent high quality and bettering it. By automating the failure detection and root trigger evaluation that beforehand required guide hint inspection, you possibly can go from “take a look at failed” to “here’s what to repair” in minutes as a substitute of hours.
To get began, see the Strands Evals SDK Detectors documentation and the Strands Evals GitHub repository. Attempt the included pattern hint file, then add DiagnosisConfig to at least one present take a look at case in your analysis pipeline to see automated analysis in motion.
Concerning the authors
