Wednesday, July 1, 2026

Persistent Latent Reminiscence for Multi-Hop LLM Brokers: How a 6G Handover Paper Closes the Agent Chilly-Begin


A humorous-but-real tour of ILCP-for-agents — a β-VAE compressor, an Xn-style transport, a gated MLP projector, and the unreasonably handy realization that I had already solved this precise drawback for 6G handovers. The agent-side V1 is the wiring; the receipts on this publish are the 6G paper’s receipts, correctly labelled, as a result of trustworthy writing is the entire level of this collection.

— the capstone — of the “Manufacturing-Grade Agentic Inference” collection. Every half eliminated one form of redundant work from an agentic LLM pipeline. Half 1 killed redundant prefill (don’t learn the identical doc twice). Half 2 killed redundant ready (don’t queue fifty brokers single-file). Half 3 killed redundant CPU round-trips (don’t bounce each retrieval off the GPU). Half 4 (this publish, and the ultimate one) kills redundant context rebuilds — the agent equal of throwing away your hidden state each time the dialog palms off to a brand new specialist.

Key Takeaways

The issue: in a multi-hop agent pipeline, each time the management shifts from agent A to agent B, the receiver throws A’s hidden state away and rebuilds the context from a immediate string. That’s structurally the identical “post-handover chilly begin” a consumer gear (UE) suffers when it strikes between two base stations (from supply to focus on), the place the goal base station re-initialises the per-user recurrent state from scratch.

The repair: compress the sender’s recurrent state right into a tiny latent payload, transport it throughout the hand-off, and let the receiver use it as a soft-prompt prefix as a substitute of re-prefilling the whole lot from textual content. The identical “compute as soon as, fan out shared state” lesson the collection has been hammering since Half 1, utilized throughout reasoning hops as a substitute of inside one pipeline.

The ‘un’standard receipts: the underlying technique is Inductive Latent Context Persistence (ILCP), a peer-reviewed paper I co-authored not too long ago, accepted at AI4NextG @ ICML 2026. On the Vienna 4G/5G drive-test, ILCP eliminates ping-pong handovers fully (0.0% vs 6.5% no-transfer baseline, 22.6% Transformer baseline), recovers post-handover accuracy at +5.1 pp common / +13.3 pp peak, and runs finish to finish at 7.7 ms p99 per handover resolution on the identical GTX 1080 as the remainder of this collection.

The trustworthy half: these numbers are 6G radio handover numbers, not LLM-agent numbers. The agent-side V1 on this publish (ilcp-for-agents) is the wiring — a β-VAE compressor, an in-process transport, a gated MLP projector, and a Qwen2.5-7B harness — and its agent-side benchmarks are explicitly future work. I’m refusing to launder RAN receipts as LLM receipts even the place the temptation is excessive.

The kicker: the telecom thread that ran by way of Components 1–3 as analogy is, in Half 4, my revealed analysis fixing the identical drawback in two completely different industries. The collection closes the loop.

TL;DR: Multi-hop LLM brokers at the moment hand off context as a string. Agent A finishes reasoning, summarises it into immediate textual content, and Agent B reads that string from scratch — the receiver’s KV cache, consideration sample, and any partial computation Agent A constructed up are all discarded. That is the agent model of the post-handover chilly begin that 5G/6G base stations undergo when a UE (cell gadget) strikes between two base stations: the goal base station re-initialises the per-UE recurrent state and has to rebuild it from scratch. We solved that drawback with a way referred to as Inductive Latent Context Persistence (ILCP): a β-VAE compresses a 128-dim GRU state right into a 128-byte latent payload, transports it over the usual 3GPP Xn interface, after which a gated MLP tasks it into the goal base station’s state house at handover. On the Vienna 4G/5G drive-test, ILCP eliminates ping-pong handovers (0.0% vs 6.5% no-transfer baseline), recovers post-handover next-cell accuracy by +5.1 pp common / +13.3 pp peak within the 50–250 ms post-handover window, and runs at 7.7 ms p99 per resolution on a single GTX 1080. This half maps that very same protocol onto LLM agent hand-offs: ilcp-for-agents learns to compress a pooled hidden abstract, transport it throughout the hand-off, and mission it again right into a receiver-side soft-prompt prefix. V1 is the wiring (PyTorch, Qwen2.5-7B-Instruct, β-VAE, gated MLP, in-process transport, toy exact-match metric). The contribution right here is the architectural switch, not the numbers.

Github Repo: https://github.com/AnubhabBanerjee/ILCP-for-Brokers

(Fast confession earlier than we begin: I got here at this complete collection from a 5G/6G RAN engineering background. The telecom angles in Components 1, 2, and three had been the analogies. This half stops being an analogy. The mechanism I’m mapping onto LLM brokers is similar mechanism, written by the identical co-authors, that closes the post-handover chilly begin in 6G radio entry networks. That’s the collection capstone. There’s a complete part on the side-by-side — part 7 — however additionally it is why this publish exists within the form it does.)

Structure psychological mannequin — preserve this open when you learn.

Agent A context → masked-mean-pool → β-VAE encoder → z (32-dim latent) → in-process TransportPayload → β-VAE decoder + gated MLP → Okay reminiscence tokens → torch.cat onto Agent B query embeds → grasping decode

Every little thing beneath is simply commentary on one piece of that line.

Compress, transport and mission

1. A confession: I solved this drawback earlier than I knew I had it

In Half 3, we went to barely absurd lengths to maintain our tensors precisely the place they belong: on the silicon. By writing a customized CUDA kernel for Prime-Okay retrieval, we killed the redundant CPU round-trips that drag down agentic RAG. The philosophy was absolute—as soon as the GPU computes a wealthy, high-dimensional state, you don’t transfer it, and also you definitely don’t destroy it. And but, the second that highly-optimized retriever finishes its job and passes the baton to the subsequent specialist in your pipeline, normal frameworks drive you to do precisely that. We guard our tensor state with our lives inside a single node, solely to voluntarily throw it within the trash the second we cross a reasoning hop.

Let me dramatize the agent hand-off the way in which each multi-hop pipeline does it at present.

You: “Agent A, learn this 50-page report, create a abstract and hand it off to Agent B for fact-checking.”

Agent A: “Positive. Loading mannequin. Studying the report. Pooling the context. Constructing consideration over paragraph 47. Forming my opinions. ✅”

GPU works for 30 seconds.

Agent A: “Accomplished. Here’s a 200-token abstract I’m very pleased with.”

You: “Nice. Forwarding to Agent B.”

Agent A: “Wait, how precisely are you forwarding?”

You: “…as a string? Within the immediate?”

Agent A: “Proper. So you might be sending Agent B my ultimate string. Not my hidden state. Not my calculated consideration over the 50 pages I simply learn. Not the truth that paragraph 47 was unusually load-bearing. Not the calibrated confidence I constructed up. Simply the string.”

You: “That’s the way it works, sure.”

Agent A: “Cool. Cool cool cool. Have enjoyable, B. 👋”

Agent B: “Hey, I’m a wonderful, stateless new child. Loading mannequin. Studying Agent A’s string from scratch. Constructing context. Pooling. Forming opinions. ✅”

GPU spends one other 30 seconds doing primarily the identical work Agent A simply completed doing.

You: “…is there a method to skip the second context build-up and a focus calculation?”

Agent B: “What second learn?”

That, proper there, is what each “multi-agent swarm” I’ve ever seen really appears to be like like beneath the lid. Every hand-off is a string-shaped throat that the sender’s inner state can not squeeze by way of. The receiver will get the output textual content and rebuilds context from textual content — which is the costliest factor a transformer normally does, and the factor this collection has spent three elements making an attempt to persuade you to cease doing inside one pipeline run.

Humorous truth is that I’ve written about this drawback earlier than, solely simply not for LLM brokers.

In 2026 my co-author and I revealed a paper referred to as Inductive Latent Context Persistence: Closing the Publish-Handover Chilly Begin in 6G Radio Entry Networks.” The setting is a cell phone (additionally referred to as consumer gear or UE) transferring between 5G/6G base stations (additionally referred to as gNBs). At each handover, the goal gNB discards the per-UE recurrent state held on the supply gNB and re-initialises the per-UE hidden state on the goal gNB. The target-side prediction mannequin then has to rebuild that state from the few post-handover radio measurements it has simply obtained, whereas the UE is already transferring. The paper calls this the post-handover chilly begin. Ringing any bell?

Now learn this paragraph from the paper’s contributions, calmly de-jargoned: “We deal with the per-user recurrent state as moveable community context. To handle the sensible difficulty that the usual inter-cell message has a small measurement funds, we present {that a} 128-byte differential replace is enough to protect the predictive high quality of a 128-dimensional GRU state throughout the handover boundary. Our proposed ILCP protocol compresses the hidden state with a β-variational autoencoder, transports it on the usual 3GPP Xn interface, and tasks it onto the goal gNB’s state house in the meanwhile of handover through a realized, gated MLP.”

In the event you swap “supply gNB” for “agent A,” “goal gNB” for “agent B,” and “radio measurements” for “tokens of the subsequent sub-task,” you’ve gotten the structure this complete publish is about. Similar paper, similar creator, simply the applying area is completely different.

The contribution of this publish isn’t the strategy — the strategy is already within the paper. The contribution of this publish is the mapping: taking ILCP and wiring it up for multi-hop LLM brokers, finish to finish, in a small PyTorch repo which you already noticed. The receipts you might be about to see in part 5 are the paper’s receipts, in 6G handover models, correctly labelled.


2. Why does the agent cold-start exist in any respect? (a one-minute crash course)

Skip this part if you happen to already wired up a multi-hop agentic pipeline previously. For everybody else, right here is the brief model.

A multi-hop agent isn’t one mannequin answering one query. It’s a number of specialised fashions taking turns at a job. A router decides intent, a planner decomposes the duty, a retriever fetches grounded context, a reasoner does the precise pondering, a security checker sniffs the output, and a finaliser writes the response. Every of those is its personal ahead go, typically its personal mannequin, typically its personal course of.

Between any two of these ahead passes, management palms off. And right here is the soiled secret that almost all agent frameworks paper over with pleasant diagrams: at each hand-off, the receiver will get textual content. Not a hidden state, not a KV cache, not even a vector — textual content. The router’s intent classification turns right into a token string. The planner’s job plan turns right into a JSON blob. The retriever’s chosen chunks flip right into a concatenated context window. The reasoner’s chain of thought turns right into a “thought:” block. Each hand-off is a tokeniser round-trip.

That sounds advantageous till you depend. A regular agentic pipeline with 4 hops over an extended shared context will tokenise and re-prefill the identical supply materials 4 occasions. Three of these reads add nothing. The router already knew, the planner already knew, the reasoner already knew. The fourth learn is doing it for the protection checker’s profit, and the protection checker goes to do it once more for the finaliser.

If that sounds just like the SwarmKV drawback from Half 1, it’s — however rotated 90 levels. Half 1 was about N brokers studying the similar doc inside one pipeline run. This half is about N brokers studying one another’s accrued context throughout separate reasoning hops. Totally different form, similar underlying tax: the receiver is throwing away costly computed state and rebuilding it from scratch from a immediate string. That’s the post-handover chilly begin, dressed up in a Python wrapper.


3. The “simply ship a latent throughout the hand-off” lightbulb (and why it’s tougher than it sounds)

The pitch is straightforward:

  1. Whereas Agent A remains to be alive, pool its working context right into a single fixed-size abstract vector sAs_AsA​.
  2. Compress sAs_AsA​ with a β-VAE right into a tiny latent zzz of, say, 32 floats. That’s 128 bytes at fp32.
  3. Hand zzz throughout the boundary as the one factor that crosses the hop.
  4. At Agent B, decode zzz and mission it by way of a gated MLP into Okay reminiscence vectors in B’s personal embedding house.
  5. Concatenate these Okay vectors in entrance of the query token embeddings and let B greedy-decode. B by no means re-reads A’s context as textual content.

That’s “compute as soon as, hand off the compressed state” — the identical lesson as Components 1–3, utilized throughout hops as a substitute of inside one pipeline. The one cause it takes greater than a 30-line PyTorch script is that three boring issues instantly break the naive method.

Drawback A: What do you really carry throughout the hand-off?

The clearest reply is “the entire KV cache.” Agent A constructed it; simply hand it to Agent B. Nonetheless, the very fact is that the KV cache is a per-context object that depends upon the mannequin, the tokenizer, the quantization, the RoPE configuration, the eye implementation, the layer depend, the top depend, the GQA ratio, the n_ctx, and the precise GGUF / safetensors hash. Hand a KV cache from a course of operating mannequin X to a course of operating mannequin Y and you’ve got shipped a binary blob the receiver can not interpret. The on-disk KV roadmap merchandise in SwarmKV’s V1 limitations exists exactly to make these invariants express; till then, each “share the KV cache between brokers” concept has to barter a vocabulary of seven matching fields earlier than the bytes imply something.

The ILCP V1 makes the boring selection on objective: don’t carry the KV cache, carry a realized abstract of what the KV cache was making an attempt to characterize. A single pooled hidden-state vector is model-version-fragile however not catastrophically so — it’s simply (hidden_size,) floats, and if Agent A and Agent B share the identical base mannequin it’s unambiguous what these floats imply. From src/brokers/qwen_encoder.py:

@staticmethod
def masked_mean_pool(last_hidden: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    """
    Accredited V1 pooling: common final-layer token vectors with padding masked to zero mass.

    Dividing by the uncooked token depend (not L2-normalizing) preserves magnitude cues about confidence
    and saturation {that a} unit-norm pool would erase earlier than the VAE bottleneck.
    """
    if last_hidden.dim() != 3:
        increase ValueError("last_hidden have to be (batch, seq, dim).")
    if attention_mask.dim() != 2:
        increase ValueError("attention_mask have to be (batch, seq).")
    masks = attention_mask.unsqueeze(-1).to(dtype=last_hidden.dtype, gadget=last_hidden.gadget)
    summed = (last_hidden * masks).sum(dim=1)
    lengths = masks.sum(dim=1).clamp(min=1.0)
    return summed / lengths

We favor masked imply over the final-layer hidden states. The remark is trustworthy concerning the design selection: we don’t L2-normalise on the way in which out, as a result of magnitude carries helpful sign about confidence and saturation {that a} unit-norm pool would erase. The downstream β-VAE bottleneck is the factor that decides what to maintain and what to drop, not the pooling layer.

Drawback B: How small can the payload get and nonetheless be helpful?

A pooled hidden vector from a 7B mannequin is (4096,) floats — sixteen kilobytes. That’s advantageous for in-process toy demos, however the second the transport turns into an actual community message, sixteen kilobytes per hand-off isn’t free. The radio paper’s complete argument is {that a} 128-byte differential replace is enough to protect the predictive high quality of a 128-dimensional GRU state, as a result of the GRU state lives on a low-dimensional task-relevant manifold and a β-VAE can discover that manifold throughout joint coaching with the downstream loss.

The agent-side V1 makes the similar architectural possibility. From src/compressor/beta_vae.py:

class BetaVAE(nn.Module):
    """
    Totally-connected β-VAE working on pooled LM hidden states.

    Why fully-connected as a substitute of convolutions: the enter is already a single vector per pattern
    (masked imply pool over time), so conv layers would add parameter overhead with out exploiting
    spatial locality the way in which a CUDA kernel would on grids.
    """

    def __init__(
        self,
        input_dim: int,
        latent_dim: int,
        hidden_dim: int,
        beta: float = 1.0,
    ) -> None:

The 2 encoder heads return μ and logvar individually, as a result of tying them would constrain curvature; the reparameterisation trick retains sampling differentiable; the closed-form KL divergence is the usual analytic expectation beneath a diagonal Gaussian encoder. None of that is new VAE work, Higgins et al. (2017) and Kipf & Welling (2016) did the heavy lifting a decade in the past, and the feedback within the code are express that none of it pretends to be. The contribution is what you connect the VAE to, not the VAE itself.

There’s a small helper in the identical file that exists purely for trustworthy reporting:

def latent_payload_bytes(latent_dim: int, dtype: torch.dtype) -> int:
    """
    Report transferable payload measurement in bytes for README receipts (not assumed 128-byte telecom payload).

    Component measurement follows torch.dtype component alignment; that is the on-wire analog for in-process transport.
    """
    # torch.finfo / element_size offers byte width for floating dtypes utilized in z tensors.
    if not dtype.is_floating_point:
        increase ValueError("latent_payload_bytes expects a floating dtype for z.")
    width = torch.tensor([], dtype=dtype).element_size()
    return int(latent_dim) * int(width)

The docstring says “not assumed 128-byte telecom payload.” The agent-side code refuses to assume the 6G paper’s 128-byte quantity — it computes the agent-side payload from the precise latent dim and dtype on the run that you simply ran, so the README receipt for the agent facet will inform its personal fact. That single helper is all the “don’t launder RAN numbers as LLM numbers” coverage, in seven traces.

Drawback C: How does the receiver really use the latent?

Right here is the place the radio paper and the LLM mapping diverge somewhat, as a result of the receiver isn’t the identical form of object within the two domains. Within the radio paper, the receiver is a goal base station operating its personal GRU + heterogeneous graph transformer, and the projection block lands the decoded latent again into the goal’s recurrent state house. Within the agent-side V1, the receiver is a frozen Qwen2.5-7B-Instruct decoder, and the projection block has to land the decoded latent into one thing the decoder can attend over.

The cleanest factor a frozen decoder can attend over is its personal token embedding house. So the projector lifts the latent into Okay reminiscence vectors that reside in the identical vector house as actual token embeddings, and the receiver concatenates them in entrance of the query. From src/projector/gated_mlp.py:

class GatedLatentToMemoryProjector(nn.Module):
    """
    Map z ∈ R^{latent_dim} to reminiscence ∈ R^{Okay × model_hidden}.

    The gate makes use of a SiLU nonlinearity (easy ReLU) so gradients don't die on unfavorable pre-activations
    the way in which they'd with a plain sigmoid gate saturating at initialization.
    """

The trunk is a small MLP; the gate head and the worth head break up off it; the gate is sigmoided, the worth is left uncooked, and an elementwise product squashes out spurious latent instructions earlier than the ultimate reshape into (batch, Okay, model_hidden). Two paths, one product, one reshape. The remark about making use of the gate in fp32 even when the LM runs in bf16 is a Pascal-era footnote: the GTX 1080 used because the canonical reference for this collection doesn’t have a full-speed bf16 ALU, so gating in fp32 is simply well mannered to the {hardware}.

The precise concatenation occurs contained in the harness. From src/brokers/harness.py:

prefix = torch.cat([memory_tokens.unsqueeze(0), q_embeds], dim=1)

That’s the complete act. Reminiscence tokens go in entrance; query token embeds comply with. The receiver decodes from this prefix utilizing inputs_embeds as a substitute of input_ids, and the remainder of the era loop is a typical grasping decode that reuses KV cache entries the way in which any manufacturing decoder would after the primary large ahead go.

These three issues — what to hold, how small to make it, methods to inject it on the opposite facet — are all the substance of “ship a latent throughout the hand-off.” Part 4 walks by way of how the items glue collectively finish to finish.


4. The compress → transport → mission pipeline (the actually-cool half)

The entire agent-side pipeline is six steps, with the wiring residing in 4 recordsdata. Image it as a horizontal pipe.

Step 0:  Assemble compressor + projector + transport at engine init   (load_ilcp_modules)
Step 1:  Encode Agent A's context to a pooled abstract s_A              (QwenContextEncoder.pooled_embedding_for_text)
Step 2:  Compress s_A to a latent z through the β-VAE encoder imply μ       (BetaVAE.encode)
Step 3:  Pack z right into a TransportPayload (CPU staged bytes)             (InProcessTransport.pack)
Step 4:  Unpack on the receiver and mission z to Okay reminiscence tokens       (InProcessTransport.unpack + GatedLatentToMemoryProjector)
Step 5:  Concat reminiscence in entrance of query embeds and greedy-decode   (greedy_generate_from_memory_prefix)

Let’s stroll each with the true code. Snippets are intentionally brief; the total recordsdata are tiny and value trying into.

Step 1 — Pool Agent A’s context

That is the boring step that makes the remainder of the pipeline attainable. Agent A reads its working context with the identical Qwen2.5-7B-Instruct that Agent B will run, however as a substitute of decoding tokens out, we ask the mannequin to offer us the final-layer hidden states and pool them right into a single vector. From src/brokers/qwen_encoder.py:

@torch.inference_mode()
def encode_contexts(
    self,
    texts: checklist[str],
    max_length: int = 512,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Return pooled embeddings (batch, hidden) and the eye masks used for auditing shapes.

    torch.inference_mode() disables model counter bookkeeping fully vs no_grad for barely decrease overhead
    when sweeping 1000's of contexts for compressor dataset development on a funds GPU.
    """
    batch = self.tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )
    batch = {ok: v.to(self.gadget) for ok, v in batch.objects()}
    outputs = self.lm(**batch, output_hidden_states=True, use_cache=False)
    last_hidden = outputs.hidden_states[-1]
    pooled = self.masked_mean_pool(last_hidden, batch["attention_mask"])
    return pooled, batch["attention_mask"]

One ahead go with output_hidden_states=True, the final layer’s hidden states, the masked imply pool from part 3, out. The pooled tensor is (batch, hidden_size), which for Qwen2.5-7B is (1, 4096) per pattern. That’s the s_A vector the paper’s part 4 talks about — apart from a cellphone transferring between cell towers, the identical vector is constructed by a 128-dim GRU over radio measurements. Totally different sensors, similar position.

Step 2 — Compress to a latent z

From src/brokers/harness.py:

@torch.inference_mode()
def ilcp_memory_from_context(self, context: str, gadget: torch.gadget) -> torch.Tensor:
    """
    Compress→transport→mission pipeline returning receiver reminiscence tensor (Okay, D) on `gadget`.

    Utilizing the VAE encoder imply μ (not a stochastic pattern) stabilizes multi-trial latency and high quality
    comparisons the way in which a deployed system would freeze stochasticity after calibration.
    """
    s_a = self.encode_sender_summary(context).to(gadget)
    mu, _logvar = self.vae.encode(s_a.unsqueeze(0))
    z = mu.squeeze(0)
    payload = self.transport.pack(z)
    z_b = self.transport.unpack(payload, gadget=gadget)
    mem = self.projector(z_b.unsqueeze(0)).squeeze(0)
    return mem

That’s the complete compress→transport→mission hop. The docstring makes a quiet however necessary manufacturing selection express: we use the encoder imply μ at inference, not a stochastic reparameterised pattern. The radio paper does the identical factor in deployment; the stochasticity is just helpful throughout coaching. Freezing it after calibration is what each shipped VAE-bottleneck system does, and the remark says so.

Contained in the β-VAE, encode returns imply and logvar; we drop logvar as a result of we’re not sampling right here. From src/compressor/beta_vae.py:

def encode(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Map a batch of abstract embeddings to diagonal-Gaussian parameters.

    Returning logvar as a substitute of std avoids a sqrt throughout coaching and improves numerical stability
    when variances grow to be tiny (avoids division blow-ups in KL closed type).
    """
    # Flatten elective center dimensions so the encoder at all times sees (batch, input_dim).
    h = self._encoder_body(x)
    mu = self._enc_mu(h)
    logvar = self._enc_logvar(h)
    return mu, logvar

Normal β-VAE encoder. The 2 separate heads for μ and logvar exist as a result of tying them would constrain the curvature of the latent geometry, which the remark lays out in a single sentence so the subsequent reader doesn’t need to surprise.

Step 3 — Pack right into a TransportPayload

The transport boundary is intentionally express. Despite the fact that Agent A and Agent B share the identical Python course of in V1, the hand-off is wrapped in a serialise/deserialise step in order that future variations can swap in an actual community — gRPC, shared-memory ring buffers, a wire protocol — with out rewriting the decision websites. From src/transport/in_process.py:

@dataclass(frozen=True)
class TransportPayload:
    """
    Immutable container for the latent tensor plus minimal metadata for audit logs.

    Freezing the dataclass prevents unintended in-place mutation that may desync byte-size receipts
    between the sending and receiving agent in multi-threaded harnesses.
    """

    latent: torch.Tensor
    dtype_name: str
    latent_dim: int

    def byte_length(self) -> int:
        """
        Compute precise serialized byte size for README tables (torch.save makes use of pickle; we use uncooked bytes).

        Uncooked contiguous bytes mirror the "payload over Xn" story extra truthfully than pickling full tensors.
        """
        return int(self.latent.numel() * self.latent.element_size())

byte_length() is the agent-side analog of the radio paper’s “128 B payload over Xn.” It computes the precise transferable byte depend from the precise tensor, no spec-sheet assumptions. The radio paper says 128 bytes as a result of the 3GPP Xn HANDOVER REQUEST has a strict optional-IE measurement funds. The agent facet doesn’t have that constraint but, however it’s wired to measure no matter measurement it lands on, so when it does get an actual community transport, the receipt might be trustworthy.

The pack step itself is small and deliberate:

def pack(self, z: torch.Tensor) -> TransportPayload:
    """
    Detach from autograd, transfer to staging gadget, and file dtype for round-trip constancy.

    Detaching breaks the graph on objective: transport is a hand-off boundary the place sender gradients cease.
    """
    z_staged = z.detach().to(self.gadget).contiguous()
    return TransportPayload(
        latent=z_staged,
        dtype_name=str(z_staged.dtype),
        latent_dim=int(z_staged.form[-1]),
    )

detach() breaks the autograd graph on objective. The hand-off boundary is a logical boundary, not only a reminiscence transfer. Sender gradients cease at this line; the receiver builds its personal graph from the unpacked latent if it desires to. The radio paper imposes the identical boundary on the Xn interface, for a similar cause — the supply and goal base stations are completely different processes on completely different machines, and a cross-process autograd graph is a nasty concept no matter business.

Step 4 — Unpack and mission to Okay reminiscence tokens

def unpack(self, payload: TransportPayload, gadget: torch.gadget | str) -> torch.Tensor:
    """
    Materialize z on the receiver gadget earlier than the projector lifts it into reminiscence embeddings.

    clone() prevents aliasing if the identical payload object had been by accident reused throughout two brokers.
    """
    return payload.latent.clone().to(gadget, non_blocking=True)

The receiver materialises z by itself gadget, clone()s defensively so two receivers studying the identical payload object can not stomp on one another, then palms it to the gated MLP. The gated MLP’s ahead, from src/projector/gated_mlp.py:

def ahead(self, z: torch.Tensor) -> torch.Tensor:
    """
    Return form (batch, Okay, D) appropriate for torch.cat alongside the sequence dimension with token embeds.

    Making use of the gate in float32 even when the LM runs in bf16 can scale back numerical noise on shopper GPUs
    that lack full-speed bf16 ALUs (Pascal-era {hardware} notice for GTX 1080 baselines).
    """
    # Guarantee z is rank-2 so batch matmul paths keep vectorized on large tensor cores when accessible.
    if z.dim() != 2:
        increase ValueError("GatedLatentToMemoryProjector expects z formed (batch, latent_dim).")
    h = self._trunk(z)
    gate = torch.sigmoid(self._gate_head(h))
    worth = self._value_head(h)
    # Elementwise product suppresses spurious instructions earlier than reshaping into reminiscence tokens.
    mem_flat = gate * worth
    return mem_flat.view(z.measurement(0), self.num_memory_tokens, self.model_hidden)

Trunk → gate head and worth head break up → elementwise product → reshape to (batch, Okay, D). The output D is the LM’s hidden measurement, so every of the Okay reminiscence tokens already lives within the mannequin’s embedding house and might be concatenated alongside actual token embeddings with no separate space-matching layer.

Step 5 — Concat in entrance of the query and greedy-decode

Now the receiver really solutions. From src/brokers/harness.py:

@torch.inference_mode()
def greedy_generate_from_memory_prefix(
    lm: nn.Module,
    tokenizer,
    embed_layer: nn.Module,
    memory_tokens: torch.Tensor,
    question_prompt: str,
    max_new_tokens: int,
    gadget: torch.gadget,
) -> str:
    """
    Grasping decoding ranging from concatenated delicate prompts + query token embeddings.

    The loop alternates between a large prefix ahead (first step) and thin single-token steps that
    reuse KV cache entries the way in which a manufacturing decoder would after a hand-off body arrives.
    """
    q_batch = tokenizer(question_prompt, return_tensors="pt", truncation=True, max_length=512)
    q_ids = q_batch["input_ids"].to(gadget)
    q_embeds = embed_layer(q_ids)
    prefix = torch.cat([memory_tokens.unsqueeze(0), q_embeds], dim=1)
    attention_mask = torch.ones(prefix.form[:2], gadget=gadget, dtype=torch.lengthy)
    previous = None
    embed_step = prefix
    generated: checklist[int] = []
    for _ in vary(max_new_tokens):
        out = lm(
            inputs_embeds=embed_step,
            attention_mask=attention_mask,
            past_key_values=previous,
            use_cache=True,
            return_dict=True,
        )
        previous = out.past_key_values
        logits = out.logits[:, -1, :]
        next_id = int(logits.argmax(dim=-1).merchandise())
        generated.append(next_id)
        next_emb = embed_layer(torch.tensor([[next_id]], gadget=gadget, dtype=torch.lengthy))
        embed_step = next_emb
        add = torch.ones((1, 1), gadget=gadget, dtype=torch.lengthy)
        attention_mask = torch.cat([attention_mask, add], dim=1)
    return tokenizer.decode(generated, skip_special_tokens=True)

One large prefix ahead over [memory_tokens, question_embeds], then normal one-token-at-a-time grasping decode reusing past_key_values. The receiver by no means tokenises Agent A’s authentic context. It doesn’t need to. The data that may have come from studying that context is already current within the Okay reminiscence tokens, projected instantly into the receiver’s embedding house.

For comparability, the chilly baseline that V1 measures in opposition to is the usual agent circulation — the receiver will get the total context plus the query as input_ids and solutions from textual content:

def _format_agent_prompt(context: str, query: str) -> str:
    """
    Construct the cold-start string that forces the receiver LM to re-read all the Agent A context.

    Retaining the instruction delimiter fashion steady throughout branches isolates the ILCP impact from immediate drift.
    """
    return (
        "You might be Agent B. Learn the context rigorously, then reply the query with a brief span.nn"
        f"Context:n{context}nnQuestion: {query}nnAnswer:"
    )

Chilly path: learn the entire passage, do the entire prefill, reply. ILCP path: get a latent, mission it, decode from the prefix. Similar job, two contracts, one in all them does the second learn and one in all them doesn’t. That’s the V1.


5. The receipts (from the telecom paper, NOT from LLM brokers)

That is the part the place each earlier a part of this collection put a benchmark desk. Half 4 is the half the place I’ve to be trustworthy about which receipts I’m really allowed to place in entrance of you.

Fast notice on methodology earlier than anybody reaches for the rocks: each quantity on this part is from “Inductive Latent Context Persistence: Closing the Publish-Handover Chilly Begin in 6G Radio Entry Networks” (Banerjee & Awan, Nokia Munich, accepted at AI4NextG @ ICML 2026; preprint arXiv:2605.00593v2, June 2026). The paper evaluates ILCP on the Vienna 4G/5G drive-test, a multi-cell, multi-tier city radio hint with dense cell overlap, 31 handover occasions within the held-out take a look at break up, per-step measurements at 100 Hz, with 1000-bootstrap 95% confidence intervals on each reported worth. The inference {hardware} is one NVIDIA GTX 1080 (8 GB), Intel i7-8700K, 16 GB RAM — which is, conveniently, the identical canonical reference GPU as the opposite elements of this collection.

Technique Acc@t=0 (%) HOF (%) Ping-pong (%) Ovr
ZK-HGT (no-transfer baseline) 87.1 (74–97) 12.9 (3–26) 6.5 (0–16) 75.6
GAT-Temporal 22.6 (10–39) 77.4 (61–90) 61.3 (45–77) 19.3
Transformer-Temporal 77.4 (61–90) 22.6 (10–39) 22.6 (10–39) 66.9
LSTM 12.9 (0–26) 87.1 (74–97) 83.9 (71–94) 6.3
3GPP A3/A5 (rule) 100.0 (100–100) 0.0 (0–0) 3.2 (0–10) 72.4
ILCP (ours) 83.9 (71–94) 16.1 (3–32) 0.0 (0–0) 74.1

These are 6G handover metrics, in 6G handover models, on a 6G handover hint. Acc@t=0 is “did the mannequin decide the right subsequent serving cell in the meanwhile of handover?” HOF is the fraction of handover occasions the place the anticipated subsequent cell is wrong. Ping-pong is the fraction of handovers reversed inside a brief window — operationally probably the most painful failure mode, as a result of each reversed handover is wasted control-plane signalling on a community that already has a thousand different issues to do. The ZK-HGT baseline (Zero-Information HGT) is the in any other case similar structure with out ILCP — similar heterogeneous graph spine, similar GRU, similar candidate scorer — however with the per-user recurrent state re-initialised at each handover. ZK-HGT is the clear ablation that isolates the impact of cross-handover state persistence; ILCP is ZK-HGT plus a transferred latent.

The one most operationally necessary row is the 0.0% ping-pong fee for ILCP versus 6.5% for ZK-HGT and 22.6% for the Transformer baseline. In dense future deployments with overlapping small cells, ping-pongs are precisely what destroys mobility high quality of service. The paper’s part 5.1 calls out that the A3/A5 rule reaches 100% accuracy on the clear hint solely as a result of the handover labels within the hint had been themselves generated by an A3/A5-like rule, in order that comparability is a sanity examine, not an actual win. Learn the paper for the cautious dialogue of why the unperturbed A3/A5 row must be handled as a label-leakage artefact and never a quantity to chase.


6. “OK, however how is that this completely different from prefix caching / RadixAttention / RAG reminiscence?”

Cheap query, and value answering instantly, as a result of the inference-infra world has plenty of overlapping primitives and an HPC reader will ask this within the first remark.

  • vLLM prefix caching / TGI session caches. Glorious in case your shared prefix is request-scoped or session-scoped. They cache the KV state inside one serving runtime so a follow-up request from the identical session doesn’t re-prefill the identical prefix. They don’t survive throughout a hand-off the place Agent B is a unique mannequin, a unique course of, a unique machine, and even only a completely different llama_context — the KV blob is tied to the native context. ILCP-for-agents is explicitly cross-process moveable by development, as a result of the transported object is a realized abstract in a transportable latent house, not a KV blob within the engine’s non-public reminiscence format.
  • SwarmKV (Half 1 of this collection). Closest cousin in the identical creator’s physique of labor, with a important distinction: SwarmKV followers the similar KV cache out to N branches that every one share one doc inside one pipeline run. ILCP-for-agents goes the opposite path — it persists state throughout hops the place Agent B’s job is completely different from Agent A’s. SwarmKV is “compute as soon as, fan out inside a run.” ILCP-for-agents is “compress as soon as, switch throughout hops.” Collectively they cowl each axes of the redundant-recomputation drawback.
  • SGLang RadixAttention. Tree-shaped prefix sharing inside a serving runtime — lovely for a lot of requests with shared prefixes, once more scoped to the runtime. Not designed at hand a transportable, model-version-tolerant abstract to a unique course of operating a unique specialised agent.
  • Retrieval-augmented era (RAG) reminiscence. Shops chunks of textual content in a vector DB and retrieves them at question time. Helpful, however the unit of switch is textual content — which suggests the receiver nonetheless has to tokenize and prefill. ILCP transfers a realized latent that the receiver consumes through inputs_embeds, skipping the tokeniser and the text-side prefill of the persevered content material fully.

One-line instinct: prefix caching is a serving-runtime trick for one consumer’s repeating immediate; RAG is a textual content database that the mannequin nonetheless has to learn each time; SwarmKV is a within-run KV fan-out; ILCP is the cross-hop architectural primitive that every one of these aren’t, lifted from a peer-reviewed 6G paper as a result of that business occurred to want it sooner. Totally different issues, complementary primitives, regularly co-deployable in the identical constructing.


7. Plot twist — this isn’t a plot twist (the telecom anchor, stated plainly)

That is the part the place, in Components 1–3, I confessed that the GPU work was secretly a telecom drawback in disguise. In Half 4 it isn’t a confession anymore — the codebase is the disguise being lifted off the paper.

For readers with no 3GPP background, right here is the one-paragraph decoder ring. In a 5G or 6G cell community, a cellphone (the UE — consumer gear) is being served by one base station (the gNB) at a time. When the cellphone strikes, it will definitely will get handed over to a brand new gNB. To try this handover nicely, the community must predict which gNB the cellphone ought to be served by subsequent. Trendy realized approaches try this prediction with a graph neural community (GNN) over the native topology and a recurrent module (usually a GRU) over the cellphone’s latest radio measurements. The cellphone’s evolving hidden state — its “is it strolling? is it on a tram? is it about to lose line-of-sight?” context — lives in that GRU. At handover, that GRU state is thrown away, and the goal gNB has to re-initialise the per-UE recurrent state and rebuild it from the few measurements it has simply obtained. The paper calls this the post-handover chilly begin. ILCP is the protocol that fixes it: compress the source-side GRU state with a β-VAE, transport the latent over the usual 3GPP Xn interface (the inter-base-station message bus), and mission it again into the goal gNB’s state house at handover through a realized gated MLP. The entire thing matches in a 128-byte differential replace piggy-backed on the prevailing HANDOVER REQUEST message.

Learn that paragraph and the structure from sections 3 and 4 of this publish facet by facet. Inform me with a straight face these are completely different issues.

6G NR handover (on the gNB) ILCP-for-agents (on the LLM)
UE measurement and mobility historical past on the supply gNB Agent A’s working context (pooled hidden abstract)
128-dim GRU recurrent state h_u on the supply gNB 4096-dim pooled hidden vector s_A on the sender
β-VAE compressor encodes h_u to a 32-dim latent β-VAE compressor encodes s_A to a latent z
128-byte FP32 payload over the 3GPP Xn HANDOVER REQUEST TransportPayload bytes over in-process transport (network-protocol-agnostic in V1)
Goal gNB gated MLP tasks the latent into goal state house Gated MLP tasks z into Okay reminiscence tokens within the LM embedding house
Mixed through h_new = LayerNorm(decoded_h + γ ⊙ MLP([decoded_h, x_new])) Mixed through torch.cat([memory_tokens, q_embeds]) feeding inputs_embeds
Reduces post-HO cold-start hole (paper: peak +13.3 pp, avg +5.1 pp) Reduces hand-off cold-start (agent-side measurement is roadmap, not but shipped)
Eliminates ping-pong handovers within the take a look at break up (paper: 0.0% vs 6.5%) Ought to scale back “Agent B doesn’t fairly get what A meant” follow-up correction loops (agent-side measurement is roadmap)

The left column is revealed, peer-reviewed, measured on the Vienna 4G/5G drive-test, and accepted on the AI4NextG workshop at ICML 2026. The proper column is the V1 wiring in ilcp-for-agents plus an trustworthy “but” subsequent to each measurement declare. That mapping is all the cause this publish exists.


8. Trustworthy caveats (as a result of the feedback are coming)

In the event you got here right here to search out what’s mistaken with this mission — congratulations for coming this far. That will help you, from the LIMITATIONS part of the README and the inline code feedback:

  1. No agent-side numerical receipts in V1. That is the only greatest caveat and it deserves to be on the high. The agent-side ilcp-for-agents V1 ships wiring and a toy information/toy_handoff.json with 5 examples evaluated beneath a strict exact-match metric. There is no such thing as a agent-side three-trial benchmark marketing campaign, no p99 latency desk, no bytes-per-payload sweep, no high quality research in opposition to an actual held-out QA dataset. Each numerical declare on this publish comes from the 6G paper, labelled accordingly, and the agent-side numbers are explicitly on the roadmap. I’m not laundering RAN receipts as LLM receipts. In the event you got here right here anticipating “the V1 agent benchmark,” it doesn’t exist but, and pretending in any other case would betray all the honest-receipts thesis of this collection.
  2. Lossy state. V1 strikes a pooled hidden abstract, not full activations or a KV tensor. The β-VAE bottleneck is on objective, nevertheless it is a bottleneck. There’s a actual threat of dropped element that issues for the receiver’s job — and the appropriate method to uncover what’s being dropped is to run the agent-side benchmarks the V1 has not but shipped. The README says this plainly.
  3. Toy metric. Precise match in opposition to 5 hand-written examples is a wiring examine, not an open-domain QA declare. It catches “did the mannequin produce a string in any respect” and “did the harness wire the prefix accurately.” It doesn’t catch “is the reply good.” Changing the toy JSON with an actual held-out job is roadmap merchandise #1.
  4. In-process transport in V1. The transport boundary is logically express (the TransportPayload is a deliberate pack/unpack step) however the V1 wire is torch.Tensor.detach().to('cpu'), not an actual community name. An actual gRPC or shared-memory ring buffer is roadmap, behind the identical steady interface so name websites don’t change.
  5. Pascal + bitsandbytes. The README is express that NF4 4-bit loading through bitsandbytes could also be unavailable or unstable on some sm_61 Pascal stacks. The autumn-back is torch_dtype=torch.float16 on GPU or torch.float32 on CPU, or set ILCP_MODEL_ID to a smaller instruct mannequin. Whichever path really ran in your receipts run must be disclosed; the README places that disclosure within the receipts narrative file, not within the revealed metric.
  6. Frozen receiver. V1 treats the receiver LM as frozen and adapts solely the projector. That’s the most cost-effective attainable wager, and it’s the proper start line — however a extra succesful mapping may permit mild receiver-side adaptation. The radio paper does the analog of receiver-side adaptation through the gated mixture h_new = LayerNorm(decoded_h + γ ⊙ MLP([decoded_h, x_new])), the place x_new is the goal gNB’s personal freshly noticed context. That equation is the precise form an LLM-side receiver adaptation ought to most likely take, and it’s on the roadmap.
  7. Sender pooling is a single-vector abstract. The masked-mean-pool over the final-layer hidden states is the authorized V1 pooling. It’s not the one selection — last-token pooling, attention-pooled summaries, or a small realized pooler all exist — and the V1 explicitly doesn’t declare that masked-mean is perfect. It claims it’s reproducible and straightforward to audit. Part 3 of the radio paper makes the analogous “single fixed-size abstract per hand-off” selection for a similar cause: it’s the easiest contract that may be measured.

Every little thing on this checklist is on the roadmap. None of it modifications the architectural declare. The purpose of placing it in writing is that you shouldn’t need to dig for it — and the second a benchmark weblog publish hides its caveats is the second its numbers cease being reliable. (Half 1 stated precisely this in its personal honest-caveats part. Half 4 inherits the coverage verbatim.)


9. The V1 ceiling, and the collection capstone

That is the ultimate half. There is no such thing as a Half 5 to defer the arduous issues to. So as a substitute of pointing ahead, this part factors backwards throughout the entire collection and asks what we really shipped.

The thesis was easy and the four-part form held:

  • Half 1 — Redundant prefill. SwarmKV: run prefill as soon as, fan the KV cache out to many branches that share the identical doc. The repair was “compute as soon as, copy the bytes.” It saved 48.69% end-to-end and 98.09% of the second department’s activation latency on the canonical GTX 1080.
  • Half 2 — Redundant ready. Kube-TimeSlice-Profiler: when many brokers share one GPU through Kubernetes CUDA time-slicing, the median lies and the p99 tells the reality. The repair was not “by no means share” — sharing is how a swarm of brokers affords its silicon — it was “measure the tail and cease trusting pod part.” The identical canonical GPU, the identical Qwen-class mannequin, a small trustworthy device that turns “the GPU feels sluggish” right into a degradation issue with three decimals.
  • Half 3 — Redundant CPU round-trips. CUDA-TopK-Retrieval: the agentic RAG retrieval hop desires to remain on the GPU. The repair was a 343-line CUDA kernel that retains similarity + Prime-Okay on the gadget, as much as 8.57× sooner finish to finish on the identical canonical GPU on the Okay values that really matter, with the Okay=100 ceiling documented truthfully.
  • Half 4 — Redundant context rebuilds. ILCP-for-agents: agent hand-offs at the moment drop and rebuild context. The repair is a compress → transport → mission protocol lifted, nearly line for line, from a peer-reviewed 6G handover paper. V1 ships the wiring; the agent-side measurements are express future work. The transferable factor is the structure, not the radio numbers.

4 elements, 4 sorts of redundant work, one underlying lesson: refusing to recompute beats each intelligent algorithm. Bin packing, paged consideration, speculative decoding, MoE routing, mixture-of-depths, all genuinely spectacular. None of them prevent something in comparison with simply not doing the identical work twice. Most trendy methods work arduous. The well-engineered ones work much less.

And another lesson, the one I didn’t admire till I sat down to put in writing Half 4: good infrastructure concepts migrate throughout industries earlier than they migrate throughout groups. Cell broadcast, HARQ soft-combining, OFDMA time slicing, MIMO beam codebooks, post-handover state switch — all 4 of the bottlenecks this collection tackled had been ones the radio entry community bumped into first, typically twenty years earlier than the LLM crowd had a reputation for them. Half 1 was SIB broadcast in transformer costume. Half 2 was TDMA in a Kubernetes ConfigMap. Half 3 was UE-side beam choice by way of a CUDA kernel. Half 4 was Xn-side state persistence by way of PyTorch. Totally different decade, completely different protocol stack, similar drawback form.

That was the collection. I’m not pretending it’s completed engineering — each half has a V2 value writing, each repo has a roadmap longer than its README — however the four-part thesis is on the web page, and the identical canonical GTX 1080 is within the receipts on the backside of each publish. Manufacturing-minded methods engineering, validated on a funds GPU. That was at all times the purpose.


10. Wrap

In the event you construct agentic LLM infrastructure for a residing: please go take a look at how your multi-hop pipeline palms off between specialised brokers. If the hop is a string concatenation adopted by a tokenise-and-re-prefill on the receiver, you might be paying the cold-start tax. The repair doesn’t require new transformer tips; it requires accepting that the appropriate unit of switch between brokers is a realized abstract, not a immediate string.

In the event you construct telecom methods for a residing: The 6G handover paper beneath Half 4 is the solely peer-reviewed receipt on this complete mission, and it sits beneath all 4 elements as an trustworthy reminder that the radio world has been fixing these structure issues for twenty years whereas the LLM world was nonetheless arguing about immediate templates. Come on over. The compute is nice, the deadlines are softer than yours, and the cold-start drawback feels acquainted.

If you’re a newbie who has been studying this collection and bought all the way in which to Half 4: congratulations, you now perceive extra about why agentic AI inference is tough than 80% of the folks constructing it for a residing. You additionally perceive why “the bottleneck this month is X” isn’t a brand new drawback — it’s nearly at all times an previous drawback rotated into a brand new vocabulary. Go discover the previous drawback, as a result of most likely someone already solved it so that you don’t need to.


Disclaimer: The illustrations on this article had been generated utilizing AI (Claude Opus 4.8). They’re illustrative, not photographic, and any labels seen inside the pictures are stylized moderately than authoritative — discuss with the article physique and the code itself for exact perform names, metric values, and structure particulars.

Related Articles

Latest Articles