Find out how to Construct a Manufacturing-Prepared Multi-Agent Incident Response System Utilizing OpenAI Swarm and Device-Augmented Brokers

January 3, 2026

56

On this tutorial, we construct a sophisticated but sensible multi-agent system utilizing OpenAI Swarm that runs in Colab. We exhibit how we are able to orchestrate specialised brokers, equivalent to a triage agent, an SRE agent, a communications agent, and a critic, to collaboratively deal with a real-world manufacturing incident situation. By structuring agent handoffs, integrating light-weight instruments for data retrieval and determination rating, and holding the implementation clear and modular, we present how Swarm permits us to design controllable, agentic workflows with out heavy frameworks or advanced infrastructure. Try the FULL CODES HERE.

!pip -q set up -U openai
!pip -q set up -U "git+https://github.com/openai/swarm.git"


import os


def load_openai_key():
   attempt:
       from google.colab import userdata
       key = userdata.get("OPENAI_API_KEY")
   besides Exception:
       key = None
   if not key:
       import getpass
       key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()
   if not key:
       elevate RuntimeError("OPENAI_API_KEY not supplied")
   return key


os.environ["OPENAI_API_KEY"] = load_openai_key()

We arrange the atmosphere and securely load the OpenAI API key so the pocket book can run safely in Google Colab. We guarantee the secret is fetched from Colab secrets and techniques when out there and fall again to a hidden immediate in any other case. This retains authentication easy and reusable throughout classes. Try the FULL CODES HERE.

import json
import re
from typing import Record, Dict
from swarm import Swarm, Agent


shopper = Swarm()

We import the core Python utilities and initialize the Swarm shopper that orchestrates all agent interactions. This snippet establishes the runtime spine that enables brokers to speak, hand off duties, and execute instrument calls. It serves because the entry level for the multi-agent workflow. Try the FULL CODES HERE.

KB_DOCS = [
   {
       "id": "kb-incident-001",
       "title": "API Latency Incident Playbook",
       "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
   },
   {
       "id": "kb-risk-001",
       "title": "Risk Communication Guidelines",
       "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
   },
   {
       "id": "kb-ops-001",
       "title": "On-call Handoff Template",
       "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
   },
]


def _normalize(s: str) -> Record[str]:
   return re.sub(r"[^a-z0-9s]", " ", s.decrease()).break up()


def search_kb(question: str, top_k: int = 3) -> str:
   q = set(_normalize(question))
   scored = []
   for d in KB_DOCS:
       rating = len(q.intersection(set(_normalize(d["title"] + " " + d["text"]))))
       scored.append((rating, d))
   scored.kind(key=lambda x: x[0], reverse=True)
   docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
   return json.dumps(docs, indent=2)

We outline a light-weight inside data base and implement a retrieval perform to floor related context throughout agent reasoning. Through the use of easy token-based matching, we enable brokers to floor their responses in predefined operational paperwork. This demonstrates how Swarm may be augmented with domain-specific reminiscence with out exterior dependencies. Try the FULL CODES HERE.

def estimate_mitigation_impact(options_json: str) -> str:
   attempt:
       choices = json.masses(options_json)
   besides Exception as e:
       return json.dumps({"error": str(e)})
   rating = []
   for o in choices:
       conf = float(o.get("confidence", 0.5))
       threat = o.get("threat", "medium")
       penalty = {"low": 0.1, "medium": 0.25, "excessive": 0.45}.get(threat, 0.25)
       rating.append({
           "choice": o.get("choice"),
           "confidence": conf,
           "threat": threat,
           "rating": spherical(conf - penalty, 3)
       })
   rating.kind(key=lambda x: x["score"], reverse=True)
   return json.dumps(rating, indent=2)

We introduce a structured instrument that evaluates and ranks mitigation methods based mostly on confidence and threat. This enables brokers to maneuver past free-form reasoning and produce semi-quantitative selections. We present how instruments can implement consistency and determination self-discipline in agent outputs. Try the FULL CODES HERE.

def handoff_to_sre():
   return sre_agent


def handoff_to_comms():
   return comms_agent


def handoff_to_handoff_writer():
   return handoff_writer_agent


def handoff_to_critic():
   return critic_agent

We outline specific handoff features that allow one agent to switch management to a different. This snippet illustrates how we mannequin delegation and specialization inside Swarm. It makes agent-to-agent routing clear and straightforward to increase. Try the FULL CODES HERE.

triage_agent = Agent(
   identify="Triage",
   mannequin="gpt-4o-mini",
   directions="""
Resolve which agent ought to deal with the request.
Use SRE for incident response.
Use Comms for buyer or govt messaging.
Use HandoffWriter for on-call notes.
Use Critic for evaluate or enchancment.
""",
   features=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)


sre_agent = Agent(
   identify="SRE",
   mannequin="gpt-4o-mini",
   directions="""
Produce a structured incident response with triage steps,
ranked mitigations, ranked hypotheses, and a 30-minute plan.
""",
   features=[search_kb, estimate_mitigation_impact]
)


comms_agent = Agent(
   identify="Comms",
   mannequin="gpt-4o-mini",
   directions="""
Produce an exterior buyer replace and an inside technical replace.
""",
   features=[search_kb]
)


handoff_writer_agent = Agent(
   identify="HandoffWriter",
   mannequin="gpt-4o-mini",
   directions="""
Produce a clear on-call handoff doc with commonplace headings.
""",
   features=[search_kb]
)


critic_agent = Agent(
   identify="Critic",
   mannequin="gpt-4o-mini",
   directions="""
Critique the earlier reply, then produce a refined last model and a guidelines.
"""
)

We configure a number of specialised brokers, every with a clearly scoped accountability and instruction set. By separating triage, incident response, communications, handoff writing, and critique, we exhibit a clear division of labor. Try the FULL CODES HERE.

def run_pipeline(user_request: str):
   messages = [{"role": "user", "content": user_request}]
   r1 = shopper.run(agent=triage_agent, messages=messages, max_turns=8)
   messages2 = r1.messages + [{"role": "user", "content": "Review and improve the last answer"}]
   r2 = shopper.run(agent=critic_agent, messages=messages2, max_turns=4)
   return r2.messages[-1]["content"]


request = """
Manufacturing p95 latency jumped from 250ms to 2.5s after a deploy.
Errors barely elevated, DB CPU secure, upstream timeouts rising.
Present a 30-minute motion plan and a buyer replace.
"""


print(run_pipeline(request))

We assemble the complete orchestration pipeline that executes triage, specialist reasoning, and significant refinement in sequence. This snippet reveals how we run the end-to-end workflow with a single perform name. It ties collectively all brokers and instruments right into a coherent, production-style agentic system.

In conclusion, we established a transparent sample for designing agent-oriented techniques with OpenAI Swarm that emphasizes readability, separation of tasks, and iterative refinement. We confirmed the way to route duties intelligently, enrich agent reasoning with native instruments, and enhance output high quality by way of a critic loop, all whereas sustaining a easy, Colab-friendly setup. This method permits us to scale from experimentation to actual operational use circumstances, making Swarm a robust basis for constructing dependable, production-grade agentic AI workflows.

Try the FULL CODES HERE. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as effectively.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Find out how to Construct a Manufacturing-Prepared Multi-Agent Incident Response System Utilizing OpenAI Swarm and Device-Augmented Brokers

Related Articles

The actual fact that this text calls to thoughts a Roald Dahl quick story might be a crimson flag

The Means We Discover, That’s What Actually Issues: Instantiating UI Elements with Distinguishing Variations

FinOps for brokers: Loop limits, tool-call caps and the brand new unit economics of agentic SaaS

Latest Articles

The actual fact that this text calls to thoughts a Roald Dahl quick story might be a crimson flag

The Means We Discover, That’s What Actually Issues: Instantiating UI Elements with Distinguishing Variations

FinOps for brokers: Loop limits, tool-call caps and the brand new unit economics of agentic SaaS

How you can Create a FinTech App in 2026: Varieties, Necessities & Improvement Course of

Gemini simply made it simpler to import pictures and movies