Speed up Enterprise AI Growth utilizing Weights & Biases and Amazon Bedrock AgentCore

January 6, 2026

61

This put up is co-written by Thomas Capelle and Ray Strickland from Weights & Biases (W&B).

Generative synthetic intelligence (AI) adoption is accelerating throughout enterprises, evolving from easy basis mannequin interactions to classy agentic workflows. As organizations transition from proof-of-concepts to manufacturing deployments, they require sturdy instruments for growth, analysis, and monitoring of AI purposes at scale.

On this put up, we show the best way to use Basis Fashions (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore alongside W&B Weave to assist construct, consider, and monitor enterprise AI options. We cowl the whole growth lifecycle from monitoring particular person FM calls to monitoring complicated agent workflows in manufacturing.

Overview of W&B Weave

Weights & Biases (W&B) is an AI developer system that gives complete instruments for coaching fashions, fine-tuning, and leveraging basis fashions for enterprises of all sizes throughout varied industries.

W&B Weave gives a unified suite of developer instruments to help each stage of your agentic AI workflows. It permits:

Tracing & monitoring: Monitor giant language mannequin (LLM) calls and utility logic to debug and analyze manufacturing programs.
Systematic iteration: Refine and iterate on prompts, datasets and fashions.
Experimentation: Experiment with completely different fashions and prompts within the LLM Playground.
Analysis: Use customized or pre-built scorers alongside our comparability instruments to systematically assess and improve utility efficiency. Accumulate person and professional suggestions for real-life testing and analysis.
Guardrails: Assist defend your utility with safeguards for content material moderation, immediate security, and extra. Use customized or third-party guardrails (together with Amazon Bedrock Guardrails) or W&B Weave’s native guardrails.

W&B Weave could be absolutely managed by Weights & Biases in a multi-tenant or single-tenant surroundings or could be deployed in a buyer’s Amazon Digital Personal Cloud (VPC) immediately. As well as, W&B Weave’s integration into the W&B Growth Platform offers organizations a seamlessly built-in expertise between the mannequin coaching/fine-tuning workflow and the agentic AI workflow.

To get began, subscribe to the Weights & Biases AI Growth Platform via AWS Market. People and tutorial groups can subscribe to W&B at no further value.

Monitoring Amazon Bedrock FMs with W&B Weave SDK

W&B Weave integrates seamlessly with Amazon Bedrock via Python and TypeScript SDKs. After putting in the library and patching your Bedrock consumer, W&B Weave routinely tracks the LLM calls:

!pip set up weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client

weave.init("my_bedrock_app")

# Create and patch the Bedrock consumer
consumer = boto3.consumer("bedrock-runtime")
patch_client(consumer)

# Use the consumer as normal
response = consumer.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    physique=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 100,
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ]
    }),
    contentType="utility/json",
    settle for="utility/json"
)
response_dict = json.masses(response.get('physique').learn())
print(response_dict["content"][0]["text"])

This integration routinely variations experiments and tracks configurations, offering full visibility into your Amazon Bedrock purposes with out modifying core logic.

Experimenting with Amazon Bedrock FMs in W&B Weave Playground

The W&B Weave Playground accelerates immediate engineering with an intuitive interface for testing and evaluating Bedrock fashions. Key options embrace:

Direct immediate enhancing and message retrying
Facet-by-side mannequin comparability
Entry from hint views for fast iteration

To start, add your AWS credentials within the Playground settings, choose your most popular Amazon Bedrock FMs, and begin experimenting. The interface permits fast iteration on prompts whereas sustaining full traceability of experiments.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations

W&B Weave Evaluations offers devoted instruments for evaluating generative AI fashions successfully. By leveraging W&B Weave Evaluations alongside Amazon Bedrock, customers can effectively consider these fashions, analyze outputs, and visualize efficiency throughout key metrics. Customers can use in-built scorers from W&B Weave, third celebration or customized scorers, and human/professional suggestions as effectively. This mixture permits for a deeper understanding of the tradeoffs between fashions, reminiscent of variations in value, accuracy, velocity, and output high quality.

W&B Weave has a first-class option to monitor evaluations with Mannequin & Analysis courses. To arrange an analysis job, prospects can:

Outline a dataset or listing of dictionaries with a set of examples to be evaluated
Create a listing of scoring capabilities. Every operate ought to have a model_output and optionally, different inputs out of your examples, and return a dictionary with the scores
Outline an Amazon Bedrock mannequin by utilizing Mannequin class
Consider this mannequin by calling Analysis

Right here’s an instance of organising an analysis job:

import weave
from weave import Analysis
import asyncio

# Accumulate your examples
examples = [
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
    {"question": "What is the square root of 64?", "expected": "8"},
]

# Outline any customized scoring operate
@weave.op()
def match_score1(anticipated: str, output: dict) -> dict:
    # Right here is the place you'd outline the logic to attain the mannequin output
    return {'match': anticipated == model_output['generated_text']}

@weave.op()
def function_to_evaluate(query: str):
    # this is the place you'll add your LLM name and return the output
    return  {'generated_text': 'Paris'}

# Rating your examples utilizing scoring capabilities
analysis = Analysis(
    dataset=examples, scorers=[match_score1]
)

# Begin monitoring the analysis
weave.init('intro-example')
# Run the analysis
asyncio.run(analysis.consider(function_to_evaluate))

The analysis dashboard visualizes efficiency metrics, enabling knowledgeable selections about mannequin choice and configuration. For detailed steerage, see our earlier put up on evaluating LLM summarization with Amazon Bedrock and Weave.

Enhancing Amazon Bedrock AgentCore Observability with W&B Weave

Amazon Bedrock AgentCore is an entire set of providers for deploying and working extremely succesful brokers extra securely at enterprise scale. It offers safer runtime environments, workflow execution instruments, and operational controls that work with fashionable frameworks like Strands Brokers, CrewAI, LangGraph, and LlamaIndex, in addition to many LLM fashions – whether or not from Amazon Bedrock or exterior sources.

AgentCore consists of built-in observability via Amazon CloudWatch dashboards that monitor key metrics like token utilization, latency, session length, and error charges. It additionally traces workflow steps, displaying which instruments had been invoked and the way the mannequin responded, offering important visibility for debugging and high quality assurance in manufacturing.

When working with AgentCore and W&B Weave collectively, groups can use AgentCore’s built-in operational monitoring and safety foundations whereas additionally utilizing W&B Weave if it aligns with their present growth workflows. Organizations already invested within the W&B surroundings might select to include W&B Weave’s visualization instruments alongside AgentCore’s native capabilities. This method offers groups flexibility to make use of the observability answer that most closely fits their established processes and preferences when growing complicated brokers that chain a number of instruments and reasoning steps.

There are two foremost approaches so as to add W&B Weave observability to your AgentCore brokers: utilizing the native W&B Weave SDK or integrating via OpenTelemetry.

Native W&B Weave SDK

The only method is to make use of W&B Weave’s @weave.op decorator to routinely monitor operate calls. Initialize W&B Weave along with your venture identify and wrap the capabilities you need to monitor:

import weave
import os

os.environ["WANDB_API_KEY"] = "your_api_key"
weave.init("your_project_name")

@weave.op()
def word_count_op(textual content: str) -> int:
    return len(textual content.cut up())

@weave.op()
def run_agent(agent: Agent, user_message: str) -> Dict[str, Any]:
    end result = agent(user_message)
    return {"message": end result.message, "mannequin": agent.mannequin.config["model_id"]}

Since AgentCore runs as a docker container, add W&B weave to your dependencies (for instance, uv add weave) to incorporate it in your container picture.

OpenTelemetry Integration

For groups already utilizing OpenTelemetry or wanting vendor-neutral instrumentation, W&B Weave helps OTLP (OpenTelemetry Protocol) immediately:

from opentelemetry import hint
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

auth_b64 = base64.b64encode(f"api:{WANDB_API_KEY}".encode()).decode()
exporter = OTLPSpanExporter(
    endpoint="https://hint.wandb.ai/otel/v1/traces",
    headers={"Authorization": f"Primary {auth_b64}", "project_id": WEAVE_PROJECT}
)

# Create spans to trace execution
with tracer.start_as_current_span("invoke_agent") as span:
    span.set_attribute("enter.worth", json.dumps({"immediate": user_message}))
    end result = agent(user_message)
    span.set_attribute("output.worth", json.dumps({"message": end result.message}))

This method maintains compatibility with AgentCore’s present OpenTelemetry infrastructure whereas routing traces to W&B Weave for visualization.When utilizing each AgentCore and W&B Weave collectively, groups have a number of choices for observability. AgentCore’s CloudWatch integration screens system well being, useful resource utilization, and error charges whereas offering tracing for agent reasoning and gear choice. W&B Weave gives visualization capabilities that current execution knowledge in codecs acquainted to groups already utilizing the W&B surroundings. Each options present visibility into how brokers course of info and make selections, permitting organizations to decide on the observability method that finest aligns with their present workflows and preferences.This dual-layer method means customers can:

Monitor manufacturing service stage agreements (SLAs) via CloudWatch alerts
Debug complicated agent behaviors in W&B Weave’s hint explorer
Optimize token utilization and latency with detailed execution breakdowns
Evaluate agent efficiency throughout completely different prompts and configurations

The mixing requires minimal code modifications, preserves your present AgentCore deployment, and scales along with your agent complexity. Whether or not you’re constructing easy tool-calling brokers or orchestrating multi-step workflows, this observability stack offers the insights wanted to iterate shortly and deploy confidently.

For implementation particulars and full code examples, discuss with our earlier put up.

Conclusion

On this put up, we demonstrated the best way to construct and optimize enterprise-grade agentic AI options by combining Amazon Bedrock’s FMs and AgentCore with W&B Weave’s complete observability toolkit. We explored how W&B Weave can improve each stage of the LLM growth lifecycle—from preliminary experimentation within the Playground to systematic analysis of mannequin efficiency, and eventually to manufacturing monitoring of complicated agent workflows.

The mixing between Amazon Bedrock and W&B Weave offers a number of key capabilities:

Automated monitoring of Amazon Bedrock FM calls with minimal code modifications utilizing the W&B Weave SDK
Speedy experimentation via the W&B Weave Playground’s intuitive interface for testing prompts and evaluating fashions
Systematic analysis with customized scoring capabilities to judge completely different Amazon Bedrock fashions
Complete observability for AgentCore deployments, with CloudWatch metrics offering extra sturdy operational monitoring supplemented by detailed execution traces

To get began:

Request a free trial or subscribe to Weights &Biases AI Growth Platform via AWS Market
Set up the W&B Weave SDK and observe our code examples to start monitoring your Bedrock FM calls
Experiment with completely different fashions within the W&B Weave Playground by including your AWS credentials and testing varied Amazon Bedrock FMs
Arrange evaluations utilizing the W&B Weave Analysis framework to systematically examine mannequin efficiency in your use instances
Improve your AgentCore brokers by including W&B Weave observability utilizing both the native SDK or OpenTelemetry integration

Begin with a easy integration to trace your Amazon Bedrock calls, then progressively undertake extra superior options as your AI purposes develop in complexity. The mixture of Amazon Bedrock and W&B Weave’s complete growth instruments offers the inspiration wanted to construct, consider, and preserve production-ready AI options at scale.

In regards to the authors

James Yi is a Senior AI/ML Associate Options Architect at AWS. He spearheads AWS’s strategic partnerships in Rising Applied sciences, guiding engineering groups to design and develop cutting-edge joint options in generative AI. He permits discipline and technical groups to seamlessly deploy, function, safe, and combine associate options on AWS. James collaborates carefully with enterprise leaders to outline and execute joint Go-To-Market methods, driving cloud-based enterprise progress. Exterior of labor, he enjoys taking part in soccer, touring, and spending time along with his household.

Ray Strickland is a Senior Associate Options Architect at AWS specializing in AI/ML, Agentic AI and Clever Doc Processing. He permits companions to deploy scalable generative AI options utilizing AWS finest practices and drives innovation via strategic associate enablement packages. Ray collaborates throughout a number of AWS groups to speed up AI adoption and has intensive expertise in associate analysis and enablement.

Thomas Capelle is a Machine Studying Engineer at Weights & Biases. He’s answerable for retaining the www.github.com/wandb/examples repository reside and updated. He additionally builds content material on MLOPS, purposes of W&B to industries, and enjoyable deep studying typically. Beforehand he was utilizing deep studying to unravel short-term forecasting for photo voltaic vitality. He has a background in City Planning, Combinatorial Optimization, Transportation Economics, and Utilized Math.

Scott Juang is the Director of Alliances at Weights & Biases. Previous to W&B, he led various strategic alliances at AWS and Cloudera. Scott studied Supplies Engineering and has a ardour for renewable vitality.

Speed up Enterprise AI Growth utilizing Weights & Biases and Amazon Bedrock AgentCore

Overview of W&B Weave

Monitoring Amazon Bedrock FMs with W&B Weave SDK

Experimenting with Amazon Bedrock FMs in W&B Weave Playground

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations

Enhancing Amazon Bedrock AgentCore Observability with W&B Weave

Native W&B Weave SDK

OpenTelemetry Integration

Conclusion

In regards to the authors

Related Articles

Google DeepMind Introduces Unified Latents (UL): A Machine Studying Framework that Collectively Regularizes Latents Utilizing a Diffusion Prior and Decoder

This loaded M3 iPad Air is underneath $1,000 proper now ($250 off)

Methods to construct one of the best emergency roadside package

Latest Articles

Google DeepMind Introduces Unified Latents (UL): A Machine Studying Framework that Collectively Regularizes Latents Utilizing a Diffusion Prior and Decoder

This loaded M3 iPad Air is underneath $1,000 proper now ($250 off)

Methods to construct one of the best emergency roadside package

A number of Brokers Auditing Your Callaway and Sant’Anna Diff-in-Diff (Half 2)

But One other Solution to Middle an (Absolute) Aspect