Sunday, June 21, 2026

Cisco AI Introduces FAPO: Pipeline-Conscious Immediate Optimization With Step-Degree Failure Attribution and Claude Code Orchestration


Getting prompts proper remains to be the toughest a part of transport dependable LLM functions. Small wording modifications can swing accuracy by 20 p.c. What works on just a few examples typically breaks at scale. When a multi-step pipeline returns a improper reply, discovering the failing step means inspecting intermediate outputs by hand.

Cisco AI launched FAPO to handle that bottleneck. FAPO stands for Totally Automated Immediate Optimization. It’s a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to focus on accuracy. You provide a dataset and an preliminary immediate. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The entire loop is orchestrated by Claude Code brokers. The undertaking ships open supply beneath Apache 2.0, and in addition helps Codex because the optimization agent.

In Cisco’s reported analysis, FAPO beat GEPA, a state-of-the-art immediate optimizer, on 15 of 18 model-benchmark comparisons. On the 2 benchmarks the place FAPO escalated to pipeline modifications, the imply acquire over GEPA reached +33.8pp.

TL;DR

  • FAPO is a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to focus on accuracy, open supply beneath Apache 2.0.
  • It escalates by three ranges — immediate, parameter, then chain construction — utilizing step-level failure attribution to determine what to alter subsequent.
  • In Cisco’s analysis, FAPO beat GEPA on 15 of 18 model-benchmark comparisons, with a +14.1pp imply acquire.
  • On HoVer and IFBench, the place it escalated to pipeline modifications, FAPO gained all six pairs at a +33.8pp imply acquire; AIME was GEPA’s solely win, inside sampling noise.
  • Guardrails towards overfitting embody training-split-only inspection, immutable variant recordsdata, and an impartial reviewer on each proposal.

What’s FAPO

FAPO is a multi-tenant analysis and optimization framework. A tenant is a self-contained optimization undertaking. Every tenant listing holds one activity’s prompts, dataset, chain definition, scorer, and config. Tenants keep remoted, so unrelated duties optimize facet by facet with out interference.

The core engine is called hephaestus and is domain-agnostic. It handles analysis, chain execution, and scoring. Chains are LangGraph state graphs that course of every take a look at case. Out of the field, FAPO helps three suppliers: OpenAI, Baseten, and SageMaker.

The one enter you need to carry is a dataset. It’s paired inputs and anticipated outputs that outline success. FAPO splits it right into a validation set and a held-out take a look at set. The validation set drives iteration; the take a look at set is used just for a remaining one-shot analysis. From a activity description, Claude can scaffold the remaining: the preliminary immediate, the chain, and the scorer.

How the Optimization Loop Works

As soon as the items exist, FAPO runs a closed loop till goal accuracy is reached. Every cycle runs six phases:

  1. Consider — run the chain on the dataset, accumulate per-case scores and step-level outputs.
  2. Attribute — classify failures by root trigger utilizing rule-based heuristics plus LLM evaluation.
  3. Suggest — generate a variant concentrating on the dominant failure cluster.
  4. Evaluate — an impartial agent validates the proposal for scope compliance and information leakage.
  5. Evaluate — settle for the variant provided that it improves on the earlier finest, in any other case reject.
  6. Iterate — proceed till goal accuracy is reached or the optimization funds is exhausted.

The system works at three escalating ranges. Immediate edits are lowest price and tried first. Parameter modifications modify config values like retrieval_k or temperature. Structural modifications alter chain topology, corresponding to including a self-reflection node or switching to a ReAct sample. FAPO exhausts one degree earlier than escalating to the following.

Step attribution kinds failures into 4 lessons. Retrieval failures return empty or irrelevant content material. Cascading failures start when an early step produces empty output. Format failures conceal the proper reply inside textual content the scorer can not parse. Reasoning failures happen when good inputs nonetheless produce a improper conclusion. Format and reasoning points are prompt-addressable. Retrieval and cascade points are structural-addressable.

Guardrails hold the optimizer from overfitting. It inspects solely training-split circumstances, whereas validation and take a look at expose mixture scores solely. Each variant is a brand new immutable file, by no means edited in place. An impartial reviewer checks every proposal earlier than it runs.

The Benchmark Case: FAPO vs. GEPA

Cisco staff evaluated FAPO towards GEPA (Generalized Evolutionary Immediate Structure), a state-of-the-art immediate optimization technique. GEPA makes use of evolutionary search with genetic operators to optimize prompts for multi-step pipelines. Each programs began from equivalent baseline pipelines and prompts. FAPO may escalate to structural modifications when attribution discovered bottlenecks. GEPA was restricted to prompt-level optimization.

The comparability spanned six benchmarks and three activity fashions: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as each FAPO’s orchestrator and GEPA’s reflector. Scores beneath are averaged throughout the three activity fashions.

Benchmark Baseline GEPA FAPO Achieve vs. GEPA
HoVer 35.9 48.5 83.8 +35.3pp
IFBench 35.7 48.5 80.7 +32.2pp
LiveBench-Math 51.0 52.6 62.0 +9.4pp
HotpotQA 50.9 61.8 68.3 +6.5pp
Papillon 73.6 90.7 94.9 +4.2pp
AIME 16.7 16.0 12.9 -3.1pp

FAPO gained 15 of 18 model-benchmark comparisons, with a imply acquire of +14.1pp over GEPA. On HoVer and IFBench, the place FAPO escalated to pipeline modifications, it gained all six model-benchmark pairs. The imply acquire there was +33.8pp. On the 4 benchmarks with out structural modifications, FAPO nonetheless gained 9 of 12 by immediate optimization alone. AIME was the one benchmark the place GEPA led, by 3.1pp. The hole is smaller than the usual deviation throughout stochastic trials.

A functionality comparability exhibits the design distinction reported by Cisco. Each row beneath displays the supply description of the 2 programs.

Functionality GEPA FAPO
Optimization ranges Immediate textual content solely Immediate → parameter → structural
Can change chain construction No Sure, when attribution finds bottlenecks
How it’s pushed Evolutionary search with genetic operators Claude Code or Codex agent loop
Outcome throughout 18 model-benchmark pairs Reference Wins 15 of 18; +14.1pp imply

The place It Suits: Use Instances

FAPO targets multi-step LLM pipelines, not single prompts. Just a few concrete examples:

  • Multi-hop query answering: A sequence retrieves paperwork, extracts information, causes over proof, and codecs a solution. In Cisco’s documented walkthrough, a multi-hop QA chain rose from 39.3% to 70.3% validation precise match throughout two iterations. Attribution then flagged the remaining failures as retrieval-limited, signaling a structural repair. Individually, on the HotpotQA benchmark, FAPO reached 68.3% take a look at accuracy versus GEPA’s 61.8%.
  • Instruction following: On IFBench, format-constraint failures pushed FAPO to escalate past prompts, reaching 80.7% take a look at accuracy.
  • Classification: A software-name-to-category activity will be scaffolded by Claude Code, then optimized to exact-match targets.
  • ReAct brokers: An MCP workflow extension optimizes a tool-calling ReAct agent utilizing trajectory scoring and LLM-as-Decide scoring.

Getting Began

The quickest path is to let Claude Code create the tenant recordsdata. From the repo, describe your activity in plain English, then add a JSONL dataset. Every line is one take a look at case with case_id, task_type, context, anticipated, and metadata:

{"case_id": "1", "task_type": "qa", "context": {"query": "What's the capital of France?"}, "anticipated": {"reply": "Paris"}, "metadata": {}}
{"case_id": "2", "task_type": "qa", "context": {"query": "What's 2 + 2?"}, "anticipated": {"reply": "4"}, "metadata": {}}

A scorer compares the chain output to the anticipated reply. It implements validate_case to catch dangerous information early and score_case to return a composite rating:

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):
    def validate_case(self, case, scoring_profile):
        assert "reply" in case.anticipated, "Lacking 'reply' in anticipated"

    def score_case(self, case, output_text, scoring_profile):
        anticipated = case.anticipated["answer"].strip().decrease()
        predicted = output_text.strip().decrease()
        em = 100.0 if predicted == anticipated else 0.0
        return {"composite_score": em, "score_breakdown": {"exact_match": em}}

Confirm the setup with a baseline analysis:

export OPENAI_API_KEY="sk-..."
python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

Then invoke the optimization agent with a tenant, config, and success standards corresponding to composite_score >= 90. Claude Code produces a scope contract, then iterates autonomously. Each immediate variant, config, and per-variant evaluation is written to disk, so every run stays auditable. An area read-only UI referred to as FAPO Explorer browses the artifacts afterward.

Strengths and Weaknesses

Strengths

  • Pipeline-aware scoring attributes failures to the step that triggered them, not simply the ultimate output.
  • Three-level escalation handles failures that prompts alone can not repair.
  • Guardrails towards overfitting: training-split-only inspection, immutable variants, and an impartial reviewer.
  • Open supply beneath Apache 2.0, with each Claude Code and Codex supported.

Weaknesses

  • Optimization high quality is bounded by the dataset’s high quality and protection, which you need to provide.
  • The undertaking is current, so impartial manufacturing observe information are nonetheless restricted.
  • The default loop will depend on agentic coding instruments (Claude Code or Codex) relatively than a standalone optimizer.

Interactive Explainer


Related Articles

Latest Articles