Tuesday, June 9, 2026

LangSmith vs. Langfuse vs. Arize In contrast


Your AI agent works nice in testing. Then you definitely ship it, and one thing kinda breaks. A instrument referred to as loops ceaselessly, prefer it by no means learns. A retrieval step returns rubbish and prices spike. You don’t have any concept why, in any respect.

That’s the agent observability drawback. And in case you’re constructing with LLMs, you might want to clear up it earlier than manufacturing, not after. This submit kinda breaks down three of the most-used observability instruments: LangSmith, Langfuse and Arize. We’ll set every one up, hint the identical agent and examine what you truly get. 

What’s Agent Observability?

Conventional software monitoring tracks requests, errors, and latency, however that’s not sufficient for Brokers.

An Agent could name a number of instruments in sequence, with every LLM step having its personal immediate, token utilization, latency, and potential failure level. A single failed retrieval or instrument name can result in an incorrect closing response.

Agent observability captures the complete execution graph: each step, resolution, LLM enter and output, instrument name, arguments, outcomes, token utilization, latency, and analysis rating. With out this visibility, debugging agent conduct turns into guesswork.

Setting Up the Take a look at Agent

We are going to make the most of a quite simple LangChain agent to match them. The agent receives a query from the person, retrieves related context, and responds utilizing a number of instruments to offer a solution.  

First, you might want to create the take a look at agent and for that set up all of the required libraries.   

Let’s take a look at the bottom agent with two strategies (search_docs and get_order_status). It will act as our foundational base for comparability with the three observability instruments. 

"""
Base agent used throughout all three observability demos.

Swap the OPENAI_API_KEY env var or name build_agent() from any demo file.
"""

import os

from dotenv import load_dotenv
from langchain.brokers import AgentExecutor, create_openai_tools_agent
from langchain.instruments import instrument
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

load_dotenv()


@instrument
def search_docs(question: str) -> str:
    """Search inside docs for related info."""
    # Simulated retrieval — swap together with your precise vector retailer
    docs = {
        "refund": (
            "Refunds are processed inside 5-7 enterprise days. "
            "Gadgets should be returned inside 30 days."
        ),
        "transport": (
            "Normal transport takes 3-5 enterprise days. "
            "Categorical is 1-2 days."
        ),
        "account": (
            "You'll be able to reset your password through the login web page. "
            "Contact assist for account points."
        ),
    }

    for key phrase, content material in docs.gadgets():
        if key phrase in question.decrease():
            return content material

    return f"Discovered basic docs associated to: {question}"


@instrument
def get_order_status(order_id: str) -> str:
    """Search for the standing of an order by ID."""
    # Simulated order lookup
    statuses = {
        "ORD-001": "Shipped — anticipated supply 2026-05-30",
        "ORD-002": "Processing — not but shipped",
        "ORD-003": "Delivered on 2026-05-25",
    }

    return statuses.get(
        order_id,
        f"Order {order_id} not discovered within the system.",
    )


def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(
        mannequin="gpt-4o",
        temperature=0,
        api_key=os.environ["OPENAI_API_KEY"],
    )

    instruments = [search_docs, get_order_status]

    immediate = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful customer support assistant. "
                "Use tools when needed.",
            ),
            ("user", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )

    agent = create_openai_tools_agent(llm, instruments, immediate)

    return AgentExecutor(
        agent=agent,
        instruments=instruments,
        verbose=False,
    )


TEST_QUESTIONS = [
    "What are the refund policies?",
    "What is the status of order ORD-002?",
    "How long does shipping take?",
]


if __name__ == "__main__":
    executor = build_agent()

    for query in TEST_QUESTIONS:
        print(f"nQ: {query}")

        consequence = executor.invoke({"enter": query})

        print(f"A: {consequence['output']}")

This creates a candidate agent that can be used with every of the instruments. The primary instrument we are going to discover would be the one offered by LangSmith. 

LangSmith: Native Langchain Tracing

The LangChain crew has developed LangSmith. If you’re utilizing LangChain, then integration shall be fast and straightforward. 

"""
LangSmith observability demo.

Setup:

pip set up langsmith

Set LANGCHAIN_API_KEY in your .env file.

The way it works:

LangSmith hooks into LangChain's callback system through env vars, so no code
adjustments are wanted past the 2 os.environ strains beneath.
"""

import os

from dotenv import load_dotenv

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()

# Allow LangSmith tracing. These two vars are all you want.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"

# LANGCHAIN_API_KEY should be set in your .env or atmosphere.


def run_with_metadata(
    executor,
    query: str,
    user_id: str = "demo-user",
):
    """Run the agent and fasten per-run metadata through config."""
    return executor.invoke(
        {"enter": query},
        config={
            "metadata": {
                "user_id": user_id,
                "supply": "langsmith_demo",
            },
            # Non-obligatory: tag runs for filtering within the dashboard.
            "tags": ["observability-blog", "demo"],
        },
    )


def major():
    print("=== LangSmith Demo ===")
    print("Traces will seem at: https://smith.langchain.com")
    print(f"Mission: {os.environ['LANGCHAIN_PROJECT']}n")

    executor = build_agent()

    for query in TEST_QUESTIONS:
        print(f"Q: {query}")

        consequence = run_with_metadata(executor, query)

        print(f"A: {consequence['output']}n")

    print("Performed. Open LangSmith to examine the complete hint tree for every run.")


if __name__ == "__main__":
    major()

LangSmith mechanically connects to LangChain’s callback system with out the necessity for decorators or wrappers to see every run seem in your venture dashboard. 

What you’ll see on the dashboard: 

LangSmith’s hint view exhibits the complete agent execution tree, from the preliminary name to instrument use, LLM responses, and closing output. Every node consists of inputs, outputs, and latency.

You’ll be able to tag runs, add metadata, filter by consequence, save runs as datasets, and run evaluations. That is helpful when bettering prompts or retrieval logic.

The immediate playground is one other sturdy characteristic. You’ll be able to open any hint, edit the immediate inline, and rerun it to debug poor LLM efficiency.

LangSmith’s limitations seem at scale. The free tier has caps, and integration takes extra effort if you’re not utilizing LangChain, although OpenTelemetry is supported.

Langfuse: Open Supply and Framework-Agnostic

Langfuse is the open-source different right here. You’ll be able to both host it in your server, or use their cloud service. It additionally integrates with all frameworks like LangChain, LlamaIndex, uncooked OpenAI APIs, and many others. 

# Learn this Doc-string for putting in the dependencies and their setup 
"""
Langfuse observability demo.

Setup:

pip set up langfuse

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY in your .env file.

LANGFUSE_HOST defaults to https://cloud.langfuse.com; override for self-hosted.

Key variations from LangSmith:

- Callback handler is handed per-invoke for extra express management.
- Native session grouping for multi-turn conversations.
- You'll be able to rating any hint after the very fact through the Langfuse shopper.
"""

import os

from dotenv import load_dotenv
from langfuse import Langfuse
from langfuse.callback import CallbackHandler

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def build_handler(
    session_id: str,
    user_id: str = "demo-user",
) -> CallbackHandler:
    return CallbackHandler(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
        session_id=session_id,
        user_id=user_id,
        metadata={"supply": "langfuse_demo"},
        tags=["observability-blog", "demo"],
    )


def score_trace(
    trace_id: str,
    rating: float,
    remark: str = "",
):
    """Add a correctness rating to a hint after reviewing the output."""
    lf = Langfuse(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
    )

    lf.rating(
        trace_id=trace_id,
        title="correctness",
        worth=rating,
        remark=remark,
    )

    lf.flush()

    print(f"Scored hint {trace_id}: {rating}")


def run_single_session(
    executor,
    session_id: str,
):
    """Run all take a look at questions in a single session in order that they're linked within the UI."""
    handler = build_handler(session_id=session_id)
    trace_ids = []

    for query in TEST_QUESTIONS:
        print(f"Q: {query}")

        consequence = executor.invoke(
            {"enter": query},
            config={"callbacks": [handler]},
        )

        print(f"A: {consequence['output']}n")

        # handler.get_trace_id() returns the hint ID for the final run.
        trace_ids.append(handler.get_trace_id())

    # Flush ensures traces are despatched earlier than the method exits.
    # That is important in batch jobs.
    handler.flush()

    return trace_ids


def major():
    print("=== Langfuse Demo ===")
    print(f"Dashboard: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}n")

    executor = build_agent()
    session_id = "demo-session-001"

    trace_ids = run_single_session(executor, session_id)

    # Instance: programmatically rating the primary hint.
    if trace_ids and trace_ids[0]:
        print("nScoring first hint for instance:")
        score_trace(trace_ids[0], rating=0.9, remark="Reply was correct")

    print(f"nDone. Discover all runs underneath session '{session_id}' in your Langfuse dashboard.")


if __name__ == "__main__":
    major()

You’ll be able to move callback handlers each run, which is a bit bit extra express than LangSmith is, however offers larger flexibility since you’ll be able to assign person IDs, session IDs, and customized metadata while you invoke it. 

 Analysis Workflow 

Langfuse has a very good analysis workflow as properly; you’ll be able to add scores after the hint has been accomplished. 

from langfuse import Langfuse

lf = Langfuse()

# Rating a particular hint by ID.
lf.rating(
    trace_id="trace-abc123",
    title="correctness",
    worth=0.9,
    remark="Reply was correct however barely verbose",
)

This works together with human critiques of the responses your crew scores, permitting you to get aggregated analysis metrics over time. 

Customers can set up their classes by connecting them, so brokers can simply comply with conversations throughout a number of turns. All of the traces in a person person session are related within the software, which lets you comply with a whole dialog in a single place. 

Arize: Manufacturing-Grade ML Observability

Initially developed as a platform for monitoring typical machine studying fashions, Arize is now able to observing each language fashions and brokers. The truth that it was initially created to assist groups deploy fashions into manufacturing at scale has remained intact. 

Using OpenInference 

Along with utilizing the OpenInference customary as its measurement scheme, Arize integrates with OpenTelemetry for instrumentation. Configuring Arize is extra difficult than it’s for many suppliers. 

# Learn this Doc-string for putting in the dependencies and their setup 
"""
Arize observability demo.

Setup:

pip set up arize-otel openinference-instrumentation-langchain

Set ARIZE_SPACE_ID and ARIZE_API_KEY in your .env file.

Key variations from the others:

- Makes use of OpenTelemetry underneath the hood, so it integrates with present OTel stacks.
- Instrumentation is international like LangSmith, not per-invoke like Langfuse.
- Greatest-in-class manufacturing monitoring: drift detection, cohort evaluation, alerting.
- Phoenix, arize-phoenix, is the free native sibling for growth use.
"""

import os

from arize.otel import register
from dotenv import load_dotenv
from openinference.instrumentation.langchain import LangChainInstrumentor

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def setup_arize_tracing():
    """Register Arize because the OTel tracer supplier and instrument LangChain globally."""
    tracer_provider = register(
        space_id=os.environ["ARIZE_SPACE_ID"],
        api_key=os.environ["ARIZE_API_KEY"],
        project_name="agent-observability-demo",
    )

    LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

    return tracer_provider


def run_with_attributes(
    executor,
    query: str,
    user_segment: str = "customary",
):
    """Run the agent and fasten span attributes for cohort evaluation in Arize."""
    from opentelemetry import hint

    tracer = hint.get_tracer(__name__)

    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("person.phase", user_segment)
        span.set_attribute("question.textual content", query)
        span.set_attribute("demo.supply", "arize_demo")

        consequence = executor.invoke({"enter": query})

        span.set_attribute("response.textual content", consequence["output"])

        return consequence


def major():
    print("=== Arize Demo ===")
    print("Traces will seem at: https://app.arize.com")
    print("Mission: agent-observability-demon")

    setup_arize_tracing()

    executor = build_agent()

    # Simulate two person segments to reveal cohort evaluation in Arize.
    segments = ["premium", "standard", "standard"]

    for query, phase in zip(TEST_QUESTIONS, segments):
        print(f"Q: {query} [segment={segment}]")

        consequence = run_with_attributes(
            executor,
            query,
            user_segment=phase,
        )

        print(f"A: {consequence['output']}n")

    print("Performed. In Arize, use the cohort filter to match premium vs customary responses.")
    print("Arrange screens on the Arize dashboard to alert on response high quality drift.")


if __name__ == "__main__":
    major()

The instrumentation is international like that of LangSmith, however it turns into a element of OpenTelemetry’s total measurement framework. Subsequently, Arize can make the most of the present observability stack of your group whatever the precise framework you utilize (i.e., Jaeger, Grafana, and many others.). 

Which Ought to You Choose for Agent Observability?

To be utterly open, there is no such thing as a single proper instrument for all use circumstances; all of it will depend on the place you’re within the growth cycle and what your crew wants.  

Function LangSmith Langfuse Arize
Setup complexity Minimal (2 env vars) Low (callback handler) Most boilerplate
Framework assist LangChain-native; others through OTel Any framework Any framework through OTel
Self-hosting Restricted First-class (Docker Compose) Phoenix solely (native dev)
Hint visualization Glorious tree view Good, session-linked Good, OTel-standard
Analysis / scoring Dataset + playground Session-level human scores Rubric-based evals
Manufacturing monitoring Fundamental Fundamental Drift, alerting, cohorts
Multi-turn / classes Thread-level Native session grouping Hint-level solely
Open supply Proprietary Totally open supply Phoenix is OSS; platform isn’t
Free tier Restricted traces/month Beneficiant (self-host = limitless) Restricted
Greatest for LangChain dev & iteration Knowledge possession + any framework Manufacturing-scale monitoring
  • Use LangSmith if you’re constructing with LangChain and need the quickest setup for immediate debugging and iteration.
  • Use Langfuse in case you want self-hosting, stronger knowledge possession, multi-framework assist, or session-level monitoring for conversational brokers.
  • Use Arize when your agent is transferring into manufacturing and also you want monitoring, drift detection, cohorts, and alerts.

Conclusion

Agent observability is a type of belongings you solely remorse skipping after one thing goes incorrect in manufacturing. Tracing an agent run after the very fact, with none instrumentation is like debugging a distributed system with print statements.  

All three instruments lined listed below are manufacturing prepared. They every have a free path in. And so they every take underneath half-hour to combine with a LangChain agent. There’s no good motive to ship an unobservable agent anymore. 

Choose the instrument that matches your present stage. Add scoring early, even informally. And when your agent begins doing one thing bizarre at 2am, you’ll be glad you probably did. 

Knowledge Science Trainee at Analytics Vidhya
I’m at present working as a Knowledge Science Trainee at Analytics Vidhya, the place I concentrate on constructing data-driven options and making use of AI/ML strategies to unravel real-world enterprise issues. My work permits me to discover superior analytics, machine studying, and AI purposes that empower organizations to make smarter, evidence-based choices.
With a robust basis in pc science, software program growth, and knowledge analytics, I’m keen about leveraging AI to create impactful, scalable options that bridge the hole between know-how and enterprise.
📩 You too can attain out to me at [email protected]

Login to proceed studying and revel in expert-curated content material.

Related Articles

Latest Articles