HALO: Debug AI Agent Traces Regionally And not using a Cloud Subscription

July 4, 2026

1

Tips on how to Debug AI Agent Traces Regionally with HALO

Set up the HALO CLI globally through npm after verifying the bundle writer.
Initialize a HALO undertaking with halo init and configure the halo.config.cjs file.
Begin the native HALO server with halo serve to activate the hint collector and dashboard.
Instrument your Node.js AI agent utilizing @halo/trace-sdk with startTrace, span, and endTrace calls.
Generate a corpus of 10–20 traces by operating diversified queries by means of the instrumented agent.
Analyze failure clusters and RLM insights within the React-based dashboard at localhost:4280.
Export important hint snapshots as moveable JSON information for crew overview with out cloud infrastructure.
Iterate on agent prompts or software logic primarily based on systemic patterns, then re-run traces to validate fixes.

As multi-step AI brokers constructed on tool-calling, ReAct, and plan-and-execute patterns proliferate, they produce nested execution traces spanning LLM calls, software invocations, context meeting, and response synthesis. This tutorial walks by means of putting in HALO, instrumenting a Node.js AI agent with its hint SDK, producing a corpus of 10–20 diversified traces, and analyzing outcomes by means of its React-based dashboard.

Desk of Contents

Why AI Agent Debugging Is Damaged

Necessary: HALO is a conceptual reference implementation described for academic functions. Earlier than putting in, confirm that the packages @halo/cli and @halo/trace-sdk are revealed on the npm registry by operating npm view @halo/cli and npm view @halo/trace-sdk. Don’t set up unverified packages — a bundle claiming these names from an unknown writer could pose a supply-chain threat. Confirm the writer and checksum at npmjs.com earlier than continuing.

As multi-step AI brokers constructed on tool-calling, ReAct, and plan-and-execute patterns proliferate, they produce nested execution traces spanning LLM calls, software invocations, context meeting, and response synthesis. Attempting to debug these traces with console.log or customary APM instruments is like diagnosing a distributed system failure with unstructured print statements and no aggregation layer. The traces are too deep, too interconnected, and too depending on probabilistic LLM outputs for conventional approaches to floor helpful insights.

Cloud-based observability platforms corresponding to LangSmith, Langfuse Cloud, and Arize Phoenix (hosted) deal with a few of these challenges. They supply structured hint visualization and evaluation capabilities. However they carry per-seat SaaS charges (free tiers exist with restricted hint quotas; examine every vendor’s pricing web page for present limits), could require sending delicate immediate information off-premises, and may introduce vendor lock-in that complicates architectural choices.

HALO presents a distinct path: a locally-run hint evaluation engine (supply and license accessible at github.com/your-org/halo) that lets builders debug AI agent traces with out a cloud subscription. HALO’s hint evaluation method, which the undertaking calls Recursive Language Mannequin (RLM) evaluation (a HALO-specific time period reasonably than a standardized trade idea), detects systemic agent failure patterns that customary LLMs and easy log viewers miss totally. This tutorial walks by means of putting in HALO, instrumenting a Node.js AI agent with its hint SDK, producing a corpus of 10-20 diversified traces, and analyzing outcomes by means of its React-based dashboard.

What Is HALO and How Does It Work?

RLM-Primarily based Hint Evaluation Defined

HALO’s hint evaluation method, which the undertaking calls Recursive Language Mannequin (RLM) evaluation (a HALO-specific time period describing its inside structure), differs from single-pass LLM evaluation. The place an ordinary LLM processes a complete hint as a flat enter and produces a single analytical output, the RLM engine decomposes the issue recursively. HALO takes an agent hint, breaks it into sub-spans representing discrete execution steps, evaluates every layer independently for failure alerts, after which correlates patterns throughout these layers and throughout a number of hint runs.

This recursive decomposition lets HALO detect systemic agent failures. Contemplate recurring hallucination loops the place an agent repeatedly generates fabricated software parameters, silent tool-call failures the place a software returns empty outcomes with out triggering error dealing with, or degrading context home windows the place amassed dialog historical past causes progressive high quality loss. An ordinary LLM analyzing a single hint in isolation may flag a person error, but it surely can’t establish that the identical failure sample recurs throughout, say, half of all runs and shares a standard root trigger. HALO’s RLM engine performs precisely this type of cross-trace correlation.

An ordinary LLM analyzing a single hint in isolation may flag a person error, but it surely can’t establish that the identical failure sample recurs throughout, say, half of all runs and shares a standard root trigger.

HALO’s Core Parts

HALO consists of 4 tightly built-in elements. The Hint Collector runs on localhost as an OpenTelemetry-compatible ingestion layer, accepting spans and traces from instrumented purposes utilizing customary protocols. Working totally on the developer’s machine, the RLM Evaluation Engine processes collected traces domestically; confirm with lsof -i -n -P | grep halo that no exterior connections are established earlier than processing delicate information. A React-based Dashboard lets builders discover traces, failure clusters, and root-cause options the RLM engine generates. Lastly, the Export and Replay system lets builders snapshot traces into moveable codecs so teammates can share traces with out cloud infrastructure. (The replay function just isn’t lined on this tutorial; see the HALO documentation for particulars.)

Conditions and Atmosphere Setup

Earlier than putting in HALO, guarantee the event surroundings meets these necessities:

Node.js v18.12.0 LTS or larger with npm v9+ or pnpm
A working AI agent undertaking, or willingness to make use of the pattern agent supplied beneath
Fundamental familiarity with OpenTelemetry ideas (spans, traces)
Docker (optionally available, for containerized HALO deployment)
OPENAI_API_KEY surroundings variable set and legitimate (see beneath)
8GB or extra of RAM beneficial for responsive RLM evaluation, for the reason that engine runs domestically. The halo-rlm-base mannequin runs CPU-only on x86 and ARM architectures. Verify the HALO repository for present mannequin file measurement and disk house necessities.
Web entry for npm set up steps and OpenAI API calls

Set your OpenAI API key earlier than operating any agent code. Use a .env file with a library like dotenv, or export the variable in your shell session. By no means hardcode keys in supply information or go them as inline shell arguments (which might expose them in shell historical past):

export OPENAI_API_KEY=sk-your-key-here


node --version


npm view @halo/cli
npm view @halo/trace-sdk



npm set up -g --ignore-scripts @halo/cli@1.0.0


halo --version

Putting in and Configuring HALO Regionally

Set up through npm

With the CLI put in globally, initialize a brand new HALO undertaking within the working listing. The halo init command creates the undertaking listing and scaffolds the required configuration and storage construction.


halo init --project my-agent-debugger
cd my-agent-debugger

Subsequent, create a bundle.json to your undertaking and set up native dependencies:


npm init -y


npm pkg set kind="module"


npm set up --ignore-scripts openai@4.28.0 @halo/trace-sdk@1.0.0

Your bundle.json ought to embody:

{
  "kind": "module",
  "dependencies": {
    "openai": "4.28.0",
    "@halo/trace-sdk": "1.0.0"
  }
}

This generates a halo.config.cjs file with key configuration choices. Observe the .cjs extension: as a result of the undertaking makes use of "kind": "module" for ESM agent code, the HALO config file makes use of the CommonJS .cjs extension to keep away from module system conflicts:


const path = require('path');

module.exports = {
  server: {
    port: 4280,
    traceCollectorPort: 4281,
  },
  storage: {
    path: path.resolve(__dirname, './halo-traces'),
    maxRetentionDays: 30,
  },
  rlm: {
    mannequin: 'halo-rlm-base',
    analysisDepth: 3,
    crossTraceCorrelation: true,
    minTracesForSystemicAnalysis: 10,
  },
  dashboard: {
    enabled: true,
    openOnStart: false,
  },
};

The rlm.analysisDepth controls what number of recursive decomposition layers the engine applies. Every extra degree above 3 roughly doubles evaluation time on shopper {hardware} with 8GB RAM. The minTracesForSystemicAnalysis threshold defines the minimal hint corpus (10 traces) wanted earlier than cross-trace correlation prompts.

Begin the HALO Native Server


halo serve

Each the hint collector and dashboard run on localhost. As famous above, confirm with lsof -i -n -P | grep halo that no exterior connections are established earlier than processing delicate information.

Instrumenting a Node.js AI Agent for Hint Assortment

Constructing a Pattern AI Agent

The next pattern agent demonstrates a typical multi-step circulation: accepting a consumer question, calling an LLM through the OpenAI SDK, optionally invoking a software (internet search), and returning a synthesized reply. This creates the form of nested traces HALO is designed to investigate. Observe the deliberate intermittent failure within the search software, which returns empty outcomes roughly half the time to generate attention-grabbing hint information.


import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: course of.env.OPENAI_API_KEY });

export const instruments = [
  {
    type: 'function',
    function: {
      name: 'web_search',
      description: 'Search the web for current information',
      parameters: {
        type: 'object',
        properties: { query: { type: 'string' } },
        required: ['query'],
      },
    },
  },
];


export operate webSearch(question) {
  if (Math.random() > 0.5) {
    return { outcomes: [] }; 
  }
  return { outcomes: [{ title: `Result for: ${query}`, snippet: 'Sample data.' }] };
}

export async operate runAgent(userQuery) {
  const messages = [
    { role: 'system', content: 'You are a research assistant. Use tools when needed.' },
    { role: 'user', content: userQuery },
  ];

  let response = await openai.chat.completions.create({
    mannequin: 'gpt-4o-mini',
    messages,
    instruments,
    tool_choice: 'auto',
  });

  let message = response.decisions[0].message;
  let retries = 0;

  whereas (message.tool_calls && retries < 3) {
    messages.push(message);
    for (const toolCall of message.tool_calls) {
      const args = JSON.parse(toolCall.operate.arguments);
      const outcome = webSearch(args.question);
      messages.push({
        function: 'software',
        tool_call_id: toolCall.id,
        content material: JSON.stringify(outcome),
      });
    }

    response = await openai.chat.completions.create({
      mannequin: 'gpt-4o-mini',
      messages,
      instruments,
      tool_choice: 'auto',
    });
    message = response.decisions[0].message;
    retries++;
  }

  return message.content material;
}

Including HALO’s Hint SDK

The @halo/trace-sdk bundle offers three key capabilities for instrumentation. halo.startTrace() initiates a brand new hint context tied to a single agent execution. halo.span() wraps particular person execution steps (LLM calls, software invocations) with metadata assortment. halo.endTrace() finalizes the hint and sends it to the native collector.

The halo.annotate() operate attaches contextual information throughout the presently energetic span. It’s context-bound: calls to halo.annotate() are solely legitimate inside a halo.span() callback. Calling it exterior a span context may have no impact.


import { halo } from '@halo/trace-sdk';
import OpenAI from 'openai';
import { instruments, webSearch } from './agent.js';

const openai = new OpenAI({ apiKey: course of.env.OPENAI_API_KEY });

export async operate runAgentTraced(userQuery) {
  const hint = halo.startTrace({ title: 'research-agent', enter: userQuery });
  const messages = [
    { role: 'system', content: 'You are a research assistant. Use tools when needed.' },
    { role: 'user', content: userQuery },
  ];

  const MAX_MESSAGES = 40; 

  let finalOutput = null;

  strive {
    let response = await halo.span(hint, {
      title: 'llm-call-initial',
      kind: 'llm',
      metadata: { mannequin: 'gpt-4o-mini' },
    }, async () => {
      const res = await openai.chat.completions.create({
        mannequin: 'gpt-4o-mini', messages, instruments, tool_choice: 'auto',
      });
      halo.annotate({ tokenCount: res.utilization?.total_tokens });
      return res;
    });

    let message = response.decisions[0].message;
    let retries = 0;
    const MAX_LOOP_ITERATIONS = 3; 

    whereas (message.tool_calls && retries < MAX_LOOP_ITERATIONS) {
      messages.push(message);

      for (const toolCall of message.tool_calls) {
        let args;
        strive {
          args = JSON.parse(toolCall.operate.arguments);
        } catch (parseErr) {
          halo.annotate({ parseError: parseErr.message, rawArguments: toolCall.operate.arguments });
          throw new Error(`Did not parse software arguments: ${parseErr.message}`);
        }

        const outcome = await halo.span(hint, {
          title: 'tool-call-web-search',
          kind: 'software',
          metadata: { question: args.question, try: retries + 1 },
        }, async () => {
          const res = webSearch(args.question);
          halo.annotate({
            resultCount: res.outcomes.size,
            standing: res.outcomes.size > 0 ? 'success' : 'empty',
          });
          return res;
        });

        messages.push({
          function: 'software',
          tool_call_id: toolCall.id,
          content material: JSON.stringify(outcome),
        });
      }

      
      const trimmedMessages = messages.slice(-MAX_MESSAGES);

      response = await halo.span(hint, {
        title: 'llm-call-retry',          
        kind: 'llm',
        metadata: { mannequin: 'gpt-4o-mini', retryAttempt: retries + 1 },
      }, async () => openai.chat.completions.create({
        mannequin: 'gpt-4o-mini',
        messages: trimmedMessages,
        instruments,
        tool_choice: 'auto',
      }));

      message = response.decisions[0].message;
      retries++;
    }

    if (message.content material == null) {
      console.warn('[runAgentTraced] Ultimate message content material is null — mannequin could have stopped on tool_calls');
    }
    finalOutput = message.content material ?? '';

  } catch (err) {
    halo.endTrace(hint, { output: null, error: err.message });
    throw err;
  }

  halo.endTrace(hint, { output: finalOutput });
  return finalOutput;
}

Every halo.span() block captures timing, metadata, and annotations. The halo.annotate() calls inside spans connect contextual information corresponding to token counts and power response standing, which the RLM engine makes use of throughout failure sample evaluation. The whole operate physique is wrapped in strive/catch to make sure halo.endTrace() is at all times known as, even when errors happen, stopping dangling open traces from corrupting the hint corpus.

Producing Pattern Traces

HALO’s systemic evaluation requires a number of hint runs to establish cross-trace patterns. Single-trace inspection reveals particular person errors; the RLM engine’s power lies in correlating failure alerts throughout a corpus. Generate a minimum of 10 to twenty traces with diversified inputs.

Price observe: The batch script beneath makes 12 queries, every doubtlessly involving a number of LLM calls (as much as ~4 per question resulting from retries). This can lead to as much as ~48 OpenAI API calls per batch run. Confirm your OpenAI account has adequate credit earlier than continuing.

Create a runner script to deal with ESM imports cleanly:


import { runAgentTraced } from './agent-traced.js';

const question = course of.argv[2];
if (!question) {
  console.error('Utilization: node run-query.mjs ""');
  course of.exit(1);
}

runAgentTraced(question)
  .then(() => console.log('Achieved:', question))
  .catch((err) => {
    console.error('Agent run failed for question:', question);
    console.error(err);
    course of.exit(1);
  });

Then use the batch script to generate traces:

#!/usr/bin/env bash

set -euo pipefail

QUERIES=(
  "newest AI analysis"
  "climate in Tokyo"
  "Node.js greatest frameworks"
  "quantum computing information"
  "React vs Vue comparability"
  "inventory market traits"
  "machine studying tutorials"
  "open supply LLMs"
  "WebAssembly use circumstances"
  "serverless structure patterns"
  "Rust vs Go efficiency"
  "API design rules"
)

FAILED=0

for question in "${QUERIES[@]}"; do
  if ! node run-query.mjs "$question"; then
    echo "[ERROR] Question failed: $question" >&2
    FAILED=$((FAILED + 1))
  fi
accomplished

if [ "$FAILED" -gt 0 ]; then
  echo "[WARN] $FAILED question/queries failed. Hint corpus could also be incomplete." >&2
  exit 1
fi

chmod +x batch-run.sh
./batch-run.sh

Analyzing Traces within the HALO Dashboard

Navigating the React-Primarily based UI

Open localhost:4280 in a browser to entry the dashboard. The interface presents 4 major views. The Hint Checklist shows all collected traces with timestamps, length, and standing indicators. Choose any hint to open the Hint Waterfall, which renders a hierarchical visualization of spans inside that hint, revealing latency distribution and nesting depth. Failure Clusters teams traces sharing widespread failure signatures as recognized by the RLM engine. The RLM Insights view presents the engine’s systemic evaluation, together with root-cause hypotheses and instructed fixes.

Decoding RLM Evaluation Outcomes

With a corpus of 10 or extra traces from the pattern agent, the RLM engine surfaces a sample: given the 50% random failure price within the pattern code, roughly half the traces will exhibit a “software retry loop” the place the agent calls the search software as much as thrice with near-identical queries earlier than exhausting the retry restrict. The precise fraction varies by run. The RLM engine surfaces this as a systemic sample reasonably than remoted incidents, producing a root-cause speculation corresponding to “Agent immediate doesn’t instruct question reformulation on empty software responses.”

An ordinary LLM-based analyzer processing particular person traces would flag every empty software response as a standalone error. It will not establish the cross-trace sample exhibiting that the agent by no means reformulates its question between retries. This distinction between per-trace error reporting and systemic failure detection is the place HALO’s RLM structure offers its major worth.

This distinction between per-trace error reporting and systemic failure detection is the place HALO’s RLM structure offers its major worth.

Exporting and Sharing Hint Snapshots

Hint snapshots let teammates share and overview traces with out cloud infrastructure.



halo export --trace-id <trace-id> --output ./snapshots/retry-loop-example.json


halo import --file ./snapshots/retry-loop-example.json


halo serve

HALO vs. Cloud-Primarily based Alternate options

Characteristic	HALO (Native)	LangSmith	Langfuse Cloud	Arize Phoenix (Hosted)
Runs domestically	✅	❌	❌	❌
No subscription required	✅	❌*	❌*	❌*
Information stays on-premises	✅	❌	❌	❌
HALO RLM engine	✅	❌	❌	❌
OpenTelemetry appropriate	✅	✅	✅	✅
Crew collaboration	Export/Import	Cloud-native	Cloud-native	Cloud-native
Manufacturing-scale monitoring	❌ (dev-focused)	✅	✅	✅

* Free tiers accessible for LangSmith, Langfuse Cloud, and Arize Phoenix; function limits apply. Confirm present pricing and limits at every vendor’s web site.

HALO targets native improvement and debugging, not manufacturing monitoring at scale. For those who want multi-user dashboards or course of 1000’s of traces per day, use a cloud platform. For groups iterating on agent architectures, HALO works properly as complementary tooling: use it for native iteration and systemic sample detection throughout improvement, then pair it with a cloud platform for manufacturing observability when scale calls for it.

Implementation Guidelines

☐ Node.js v18.12.0+ put in
☐ OPENAI_API_KEY surroundings variable set
☐ HALO CLI put in globally (npm i -g --ignore-scripts @halo/cli@1.0.0) — writer verified on npmjs.com
☐ bundle.json created with "kind": "module" and dependencies put in
☐ halo.config.cjs configured with storage path and port
☐ HALO native server operating (halo serve)
☐ AI agent instrumented with @halo/trace-sdk
☐ Minimal 10 hint runs generated for systemic evaluation
☐ Dashboard reviewed for failure clusters and RLM insights
☐ Essential hint snapshots exported for crew overview
☐ Failure patterns addressed in agent immediate or software logic
☐ Re-run traces to validate fixes

What Comes Subsequent

Three extensions value exploring from right here. First, combine HALO into CI pipelines for automated hint regression testing: run new agent code towards a set set of queries and let HALO flag novel failure patterns earlier than they attain manufacturing. Second, discover HALO’s plugin API for customized evaluation guidelines tailor-made to your particular agent structure. Third, pair HALO with a cloud platform for manufacturing monitoring to get full lifecycle observability. The HALO GitHub repository (github.com/your-org/halo) and documentation cowl every of those paths intimately. The HALO repository just isn’t but publicly accessible; examine the URL for present standing. Readers ought to independently confirm HALO’s community habits earlier than processing delicate information.