Constructing a Guardrailed LLM Buying and selling Danger-Supervisor Agent for AAPL

TL;DR

Many LLM buying and selling methods fail as a result of it’s troublesome to foretell tomorrow’s returns: a activity no mannequin can do reliably [Chan, 2024].

Constructing on that perception, this technique asks DeepSeek to do one thing totally different: act as a threat supervisor that decides how a lot to take a position every day, not which course the market will transfer.

A month-to-month walk-forward loop builds a contemporary LLM coverage desk. Guardrail thresholds are reoptimized month-to-month, too.

Onerous guardrails: volatility cease, drawdown cease, and a pressured re-entry mechanism sit on high of the LLM to stop catastrophic losses.

Every thing is verified out-of-sample (OOS) from January 2023 onward.

Enhancements to the technique are offered as additional analysis.

Stipulations

To get essentially the most out of this weblog, familiarity with a number of foundational ideas will assist. On the Python aspect, you have to be snug working with pandas DataFrames and numpy arrays for time-series manipulation, and with making API calls by way of the requests library. If you wish to brush up, Python for Buying and selling: A Step-By-Step Information covers the necessities. For the backtesting methodology, understanding what out-of-sample analysis means and the way efficiency metrics akin to Sharpe, Sortino, Calmar, and most drawdown are computed is assumed all through; What Is Backtesting & Tips on how to Backtest a Buying and selling Technique Utilizing Python is a stable refresher if wanted.

On the technique aspect, this weblog sits on the intersection of three concepts: market regimes, LLM-assisted decision-making, and risk-aware place sizing, every lined intimately within the steps that comply with. For market regimes and methods to discretize steady worth indicators into labeled states, Market Regime utilizing Hidden Markov Mannequin offers the conceptual grounding. The walk-forward loop that runs by way of Steps 7 and eight is defined from first rules in Stroll-Ahead Optimization (WFO): A Framework for Extra Dependable Backtesting. For the position-sizing layer, volatility concentrating on and the Kelly instinct touched on in Place Sizing Methods and Methods in Buying and selling, covers the mechanics. Lastly, this weblog AI Foreign exchange Backtesting with LLM Regime Labels: DeepSeek vs KMeans in Python extends the DeepSeek workflow launched in; if the concept of passing compact numeric summaries to an LLM and parsing strict JSON coverage tables is new to you, learn that one first.

A typical purpose LLM-powered buying and selling methods fail is that they are asking the fallacious query. You would possibly ask it: ‘Will the inventory go up or down tomorrow?’ That could be a next-day prediction activity, and no model-statistical, machine-learned, or language-based has a dependable edge on it for a liquid large-cap inventory like AAPL.

So what’s the proper query? Analysis has highlighted the important thing distinction between predicting returns and volatility [Chan, 2024]: Here’s a reframe that would doubtlessly work higher: as an alternative of asking the LLM to foretell course, ask it to evaluate threat. Particularly: ‘Given what I learn about how this inventory has behaved in every market state over the previous three years, does as we speak’s state appear like one the place I needs to be absolutely invested, or ought to I pull again?’

That’s what this technique does. DeepSeek reads a desk of historic statistics: imply return, normal deviation, and Sharpe-like rating per market state, and outputs a coverage: for every state, how a lot ought to we make investments? The reply is at all times between 50% and 100% lengthy. No shorts. No leverage. Only a calibrated, regime-aware publicity.

On high of the LLM layer, onerous guardrails present a security web: if realized volatility spikes (with acceleration affirmation) or the technique’s fairness drawdown exceeds a threshold (with trend-break affirmation), it routinely de-risks, with a built-in re-entry mechanism that stops it from being locked flat without end.

By the top of this submit, you’ll perceive each line of the implementation: from function engineering to the LLM immediate to the walk-forward backtest loop.

Prepared?

Let’s construct it!

Each good backtest begins with a single cell you possibly can edit to alter all the experiment. Consider it because the technique’s management panel. Right here we outline three layers of settings: information parameters, price assumptions, and the publicity mapping that interprets the LLM’s output into an precise place dimension.

The three publicity constants: LONG_FULL_EXP, LONG_HALF_EXP, and FLAT_EXP: are the bridge between the LLM’s qualitative judgment and the portfolio’s precise threat. Discover that FLAT_EXP is 0.5, not 0.0. That is intentional: the LLM being cautious means ‘cut back publicity’, not ‘exit completely’. It retains the technique taking part out there’s optimistic drift even when the mannequin is unsure.

Why FLAT = 0.5? In a bull market like AAPL 2023–2026, going to zero money on each cautious sign would have price roughly 10 share factors of CAGR. The 0.5 flooring preserves participation whereas nonetheless speaking the LLM’s risk-off intent.

The MAX_FLAT_DAYS and REENTRY_SIZE settings handle a refined bug that may silently destroy a method: the guardrail impasse. We are going to return to this once we cowl the guardrail logic in Step 5.

In algorithmic buying and selling: no information, no technique. However it isn’t sufficient to obtain information: you must obtain the best information. For a inventory like AAPL that has break up a number of instances since 1990, this issues enormously.

The auto_adjust=True flag is non-negotiable. With out it, the break up days produce monumental faux return spikes that poison each downstream calculation: rolling volatility, z-scores, development scores: every part. We additionally add a sanity examine instantly after function engineering:

This sort of defensive examine takes two traces and has saved multiple backtest from producing fully fallacious outcomes.

Uncooked costs inform you the place the market is. Options inform you what temper it’s in. We compute seven indicators from OHLCV information, all with a single purpose: to explain the present market state in a compact, interpretable means.

We invite you to think about extra options and technical indicators akin to those supplied by the ta-lib library. You could find an set up information right here.

Steady options are onerous to purpose about and produce too many mixtures. As a substitute, we bucket every sign into two or three classes, then mix them right into a single state string. The result’s 12 distinct states: sufficiently small for the LLM to purpose about and for every state to have a whole bunch of historic examples.

A couple of design selections value noting. The vol threshold makes use of a rolling 252-day median slightly than a hard and fast quantity: this makes it adaptive throughout years of information. What was ‘excessive volatility’ within the low-vol 2017 surroundings can be ‘regular’ within the post-COVID surroundings. Utilizing the rolling median handles this routinely.

The ensuing state strings are deliberately human-readable. Once we go them to DeepSeek, the mannequin can use its world information about what ‘trending upward in a peaceful, overbought market’ traditionally implies for threat: which is precisely the sort of qualitative judgment we would like from it.

That is the guts of the technique. As soon as a month, we ask DeepSeek to learn the historic statistics for every state and output a coverage desk: for every state, ought to we be absolutely invested, partially invested, or pull again?

The important thing design determination is how we body the duty. You possibly can attempt a method the place you ask the LLM to select a course (LONG / SHORT / FLAT),however . The perfect framing is to ask it to behave as a threat supervisor, not a forecaster.

Computing the State Statistics

First, we compute the statistics the LLM will purpose about. This operate makes use of a one-day lag to keep away from lookahead bias: the state on the shut of day t-1 is used to foretell the return on day t.

The sharpe_like column is crucial quantity within the desk. A strongly optimistic worth means traditionally, when the market was on this state, AAPL tended to have good threat/reward the subsequent day. Strongly damaging means the alternative. Close to zero means the proof is inconclusive.

The Immediate: Framing the LLM as a Danger Supervisor

The system immediate is the place the technique’s philosophy lives. Learn it rigorously: each sentence was chosen intentionally.

Three rules information this immediate design:

1. Reframe the duty explicitly

‘Your job is NOT to foretell tomorrow’s worth course’: we state what the mannequin ought to NOT do earlier than saying what it ought to. Language fashions are delicate to activity framing, and with out this line the mannequin tends to float again into direction-forecasting mode.

2. Default to motion, penalize inaction

‘Default bias: LONG. Most states needs to be LONG.’ This counteracts the mannequin’s pure tendency to be conservative when unsure. In a long-only technique on a inventory with optimistic long-run drift, staying flat has an actual price: you miss the market’s optimistic anticipated return.

3. Strict JSON output with retry

Asking for strict JSON and offering the precise schema prevents the mannequin from including prose across the output that breaks parsing. We additionally added a retry mechanism and raised max_tokens to 2000: earlier variations with 900 tokens would truncate the response mid-JSON, inflicting silent parse failures.

Professional tip: All the time cache LLM responses. The cache_key contains the month, image, mannequin identify, and a model tag (‘longonly-v1’). Re-running the backtest additional API calls as soon as the cache is heat. The cache file is a plain JSON dictionary you possibly can examine on to see precisely what the LLM determined every month.

The LLM outputs a qualitative judgment (LONG full / LONG half / FLAT). We have to convert that right into a quantity: the fraction of capital to take a position. This occurs in two steps: first map the motion to a base publicity, then scale by volatility concentrating on.

Publicity Mapping

This mapping is the technique’s persona. The three publicity ranges (1.0 / 0.8 / 0.5) outline how aggressively the LLM’s threat indicators translate into place modifications. These are tunable parameters within the settings cell: if the LLM is just too cautious (calling FLAT ceaselessly), elevating FLAT_EXP to 0.6 retains extra capital at work with out altering the immediate.

Volatility Concentrating on

Volatility concentrating on retains the technique’s threat contribution roughly fixed no matter market regime. When AAPL’s realized vol is 0.30 (harassed) and TARGET_VOL is 0.25, the dimensions issue is 0.25/0.30 ≈ 0.83: the technique routinely holds barely much less. When vol is 0.15 (calm), the dimensions is 0.25/0.15 ≈ 1.67, however MAX_LEVERAGE=1.0 caps it so we by no means apply leverage.

Momentum Tilt

A last overlay tilts place dimension by +/-15% primarily based on 63-day momentum, earlier than making use of the MAX_LEVERAGE cap:

When 63-day cumulative return is optimistic (uptrend), the place scales up by 15%. When damaging (downtrend), it scales down by 15%. This offers the technique a momentum bias with out altering the LLM’s qualitative judgment.

The ultimate place is assembled in build_agent_positions, which feeds the continual state_lag and vol_lag so the primary day of every month will not be artificially pressured flat:

Execution timing: pos[t] is the place held throughout day t: determined on the shut of t-1. It earns ret[t] immediately, with no additional shift. This can be a single-lag execution mannequin: observe state at shut t-1, set place, maintain by way of shut t, gather return. No lookahead.

The LLM makes choices primarily based on historic statistics. However markets can behave in ways in which haven’t any historic precedent: flash crashes, shock macro bulletins, sudden liquidity occasions. Onerous guardrails present a final line of protection that operates independently of the LLM’s judgment.

Three triggers can override the LLM’s place:

Volatility cease: if realized vol exceeds vol_stop AND short-term vol (vol5) exceeds 80% of the vol-stop threshold (acceleration situation), reduce to zero. Requiring each circumstances prevents false exits on single-day vol spikes that shortly revert.

Drawdown cease: if the technique’s fairness has fallen greater than dd_limit from its peak AND the development is damaged (63-day momentum is damaging AND worth is beneath the 50-day transferring common), reduce to zero. The trend-confirmation filter prevents exits throughout bull-market pullbacks the place the drawdown is non permanent.

Cooldown: keep flat for cooldown_days after both cease fires.

Warning zone: if the drawdown exceeds 60% of dd_limit (a softer threshold), halve the present place slightly than exiting completely. This creates a three-tier response: warning (halve) at 60% of the restrict, onerous cease (exit plus cooldown) at 100%, and compelled re-entry after MAX_FLAT_DAYS.

The Impasse Downside and its Repair

There’s a refined bug that typically exists in lots of drawdown-stop implementations, and it silently destroyed this technique’s efficiency for over 650 consecutive buying and selling days till we caught and glued it. Here’s what occurs:

The drawdown cease fires → place goes to zero → fairness freezes (zero place means zero return) → peak stays above frozen fairness → drawdown examine fires once more subsequent day → nonetheless zero place → fairness remains to be frozen → loop repeats without end.

A believable repair to this example could be the next re-entry counter:

After MAX_FLAT_DAYS=10 (roughly two buying and selling weeks), the re-entry gate opens and the technique comes again at full dimension (REENTRY_SIZE=1.0 × the agent’s proposed place). As soon as re-entered, fairness begins accruing once more and both recovers above the drawdown threshold or triggers one other 20-day wait. Both means, the impasse is damaged.

Month-to-month Guardrail Re-optimisation

The guardrail thresholds (vol_stop, dd_limit, max_delta, cooldown_days) are grid-searched on the coaching window each month, matching the LLM coverage cadence. This implies the guardrails adapt to new market circumstances on the identical frequency because the LLM threat evaluation, at the price of extra grid-search computation per run (~36 mixtures every month).

Why month-to-month? Protecting each the LLM coverage and the guardrail thresholds on the identical month-to-month cadence ensures the chance gates at all times mirror the identical coaching window because the coverage they’re defending.

Stroll-forward backtesting is the closest factor we now have to sincere out-of-sample testing with out really ready for the long run. The thought is easy: at every cut-off date, the mannequin solely is aware of what it will have identified at the moment. No hindsight. No future information leaking into previous choices.

Our loop runs month-to-month: for every OOS month from January 2023 onward, it trains on the earlier three years of information, builds the LLM coverage, and trades the subsequent month. Crucially, fairness and positions are carried repeatedly: there isn’t a reset to 1.0 firstly of every month.

After the loop, we sew the month blocks into one steady OOS collection and run each the agent-only and guardrailed fairness curves:

As soon as the walk-forward loop completes, we compute an ordinary set of risk-adjusted efficiency metrics for all three curves and show them aspect by aspect. Having the buy-and-hold benchmark in the identical desk is crucial: it’s the most sincere check of whether or not the technique provides worth.

A couple of issues to note about these numbers:

CAGR vs Sharpe inform totally different tales

The technique’s CAGR (12.5% and 12.8% relying on configuration) seems to be poor in opposition to buy-and-hold’s 27.1%. However that comparability could be considerably deceptive as a result of the methods take fully totally different quantities of threat. The brokers run at 14.1% and 13.9% annualized vol vs AAPL’s 25%. On a risk-adjusted foundation (Sharpe ratio), the hole is way smaller.

The guardrails price return however cut back drawdown

Including guardrails reduces CAGR barely: the additional warning has a price. However most drawdown drops from -22% to -18%, and through stress durations the guardrails prevented bigger losses. The agent+guardrails model might be the best selection for a risk-managed portfolio; agent-only is the best baseline to measure how a lot the guardrails helped.

The sincere takeaway

With the present state options and LLM immediate, the technique doesn’t beat buy-and-hold on CAGR. The worth the LLM offers is threat modulation, not alpha technology. In case your purpose is to take part in AAPL’s upside whereas chopping the worst drawdowns, the technique can doubtlessly obtain that. In case your purpose is to outperform a passive index, you want higher predictive options: which is precisely the course to develop subsequent.

Determine: AAPL OOS fairness curves (January 2023 – current), all rebased to 1.0.

Studying the fairness curves

Three observations stand out from the chart. First, all three curves rise by way of 2023–2024, confirming the technique participates in AAPL’s uptrend. Second, the Agent + Guardrails curve constantly stays near the Agent-only line throughout rallies however detaches downward extra slowly throughout corrections, that divergence is the guardrail doing its job: trimming publicity earlier than losses compound. Third, Purchase & Maintain finishes highest as a result of AAPL’s 2023–2026 uptrend was unusually sturdy with shallow corrections; in a choppier or bear market, the decrease volatility and smaller drawdown of the guardrailed technique would translate into a transparent risk-adjusted benefit (Calmar ratio: 0.91 vs. 0.88 for Purchase & Maintain).

The technique we now have simply constructed is a working basis. As beforehand talked about, we’re right here to not give you the most effective technique, however to provide you insights, suggestions and their causes so you possibly can consider bettering your technique primarily based on them and even enhance ours for additional analysis.

Sign High quality

Multi-horizon momentum. Add 5-day and 63-day cumulative returns alongside the present 20-day development rating. The three horizons collectively give the LLM context about whether or not the inventory is in an early, mid, or late-stage development.

Earnings blackout. Power the place to FLAT_EXP within the two buying and selling days surrounding every AAPL quarterly earnings launch. Earnings dates are publicly identified upfront, so this introduces no lookahead bias and prevents the technique’s worst gap-risk occasions.

Macro regime filter. Add a binary function for whether or not the S&P 500 is above or beneath its 200-day transferring common (you possibly can even optimize that window). This offers the LLM a market-wide context that’s particularly helpful throughout broad corrections the place single-stock options lag.

Quantity z-score. Add as we speak’s quantity relative to its 20-day common as a state dimension. Unusually low quantity typically precedes a reversal; unusually excessive quantity confirms momentum.

Place Sizing

Kelly-fraction hints. Compute a fractional Kelly dimension per state from the coaching statistics and go it to the LLM as a sizing trace. Use 25% of full Kelly to keep away from overbetting on noisy estimates, and confirm it improves OOS Sharpe earlier than preserving it.

Regime-conditional leverage cap. Throughout VOL_HIGH states, tighten MAX_LEVERAGE to 0.6 whatever the vol-targeting output. This prevents rebuilding a full place right into a still-stressed market after a guardrail re-entry.

Steady dimension output. Ask the LLM to output a steady publicity in [0, 1] slightly than three discrete ranges. This produces finer-grained place modulation and smoother fairness curves, at the price of tighter output validation.

Guardrail Robustness

Relative drawdown restrict. Exchange absolutely the DD cease with one which compares technique fairness to buy-and-hold over the identical window. This prevents false exits throughout bull markets the place an absolute drawdown remains to be a relative outperformance.

Noise-filtered vol cease. Require three consecutive days of elevated volatility earlier than the cease fires, slightly than reacting to a single day. This removes most spurious triggers from one- or two-day spikes that shortly revert.

Separate cooldowns. Give the vol cease and the drawdown cease their very own cooldown counters (for instance, 3 days and 15 days respectively). Vol spikes normalise shortly; drawdowns can persist for weeks.

Guardrail optimisation frequency. Take a look at annual, semi-annual, and quarterly reoptimisation cadences to seek out the best steadiness between adaptability and overfitting. For slow-moving regimes, annual reoptimisation could also be extra strong than the present quarterly default.

LLM Prompting

Comparative rating immediate. Ask the LLM to rank all states by threat/reward and assign publicity by rank tier (high 50% get full, backside 25% get FLAT). This prevents the degenerate case the place the mannequin assigns LONG to each state as a result of every seems to be acceptable in isolation.

Regime classification first. Ask the LLM to categorise the general regime (BULL, CHOP, or STRESS) earlier than setting per-state publicity. This anchors all choices to a coherent macro view and produces a loggable regime label you possibly can plot over the OOS interval.

Multi-model comparability. Run the identical immediate by way of a second mannequin (GPT-4o or Claude) and examine OOS insurance policies. Systematic settlement between fashions could be a stronger proof of sign high quality than outcomes from a single mannequin.

Analysis

Rolling Sharpe chart. Plot the rolling one-month Sharpe over the OOS interval. A single mixture Sharpe quantity hides whether or not the sting is persistent or concentrated in a single fortunate window.

Vol-targeted benchmark. Add a vol-targeted buy-and-hold (at all times lengthy, identical vol-scaling because the technique) to the metrics desk. If the LLM technique can’t beat this benchmark, the vol-targeting module is doing all of the work.

Transaction price sensitivity sweep. Report outcomes at 0.5, 1.0, and a pair of.0 bps whole price. A method that solely works at best-case IBKR pricing is fragile; one which holds up at 2.0 bps has a extra sturdy edge.

Q: Why use an LLM for this in any respect? Could not a easy rule do the identical factor?

Sure: and we constructed precisely that comparability into the pocket book (POLICY_MODE=’rule’). Run each and examine the metrics. If the rule beats the LLM, the LLM is not including worth and you must drop it. The purpose of together with each is mental honesty: the LLM ought to solely keep within the technique if it demonstrably improves one thing. In our checks, the LLM’s profit was in smoother decision-making throughout borderline states: circumstances the place the statistics are ambiguous and a rule-based threshold would flip noisily.

Q: What’s the guardrail impasse and why does it matter?

When a drawdown cease fires and place goes to zero, fairness freezes. A frozen fairness means the drawdown by no means mathematically improves. So the drawdown examine fires once more the subsequent day: and the subsequent: locking the technique flat indefinitely. The MAX_FLAT_DAYS re-entry counter fixes this by forcing a partial re-entry after one buying and selling month whatever the drawdown degree.

Q: Can I exploit this on different shares?

Sure, with two changes. First, change SYMBOL and re-download information. Second, the LLM coverage cache is keyed by image, so you will have contemporary API requires the brand new ticker. The structure works for any liquid single inventory or ETF with an extended worth historical past. For property with much less historical past than AAPL, cut back TRAIN_YEARS to 2, or the state statistics shall be too sparse.

Q: How a lot does the DeepSeek API price to run this?

The walk-forward loop makes roughly 40 API calls (one per OOS month) with max_tokens=2000 every. DeepSeek-chat is among the many most cost-efficient frontier fashions: the total OOS run prices beneath $0.10 at present pricing. The coverage cache means re-runs are free as soon as the cache is heat.

Q: What’s the subsequent step to enhance efficiency?

The one highest-impact enchancment is healthier state options. The present three-feature state area (development, vol, z-score) is coarse. Attempt utilizing ta-lib technical indicators to enhance the sign high quality meaningfully.

What we constructed here’s a framework for serious about LLM-assisted threat administration: not a magic alpha-generating machine. The sincere abstract is that this: the LLM earns its place on this technique not by predicting the market, however by making nuanced distinctions between market states {that a} easy rule would deal with with a blunt threshold.

The structure has three layers, and every does one thing distinct. Function engineering and state bucketing translate uncooked worth motion into one thing a language mannequin can purpose about. The LLM immediate, framed as a threat supervisor slightly than a forecaster, units a month-to-month publicity coverage throughout these states. And the onerous guardrails: with their re-entry mechanism: be sure that no single unhealthy patch of markets can completely sideline the technique.

The outcomes present a method that participates meaningfully in AAPL’s upside whereas chopping most drawdown by 45% (from -33% to -18%) versus buy-and-hold. That could be a authentic and helpful property for an actual portfolio. Is it the ultimate phrase? Completely not. The state options are coarse, the LLM immediate is model 1, and the guardrail thresholds might be smarter. However the framework is sound, the walk-forward methodology is sincere, and each design determination is traceable.

The easiest way to study from that is to run it, break it, and rebuild it. Change the immediate. Add a momentum function. Swap AAPL for SPY. See what occurs. The pocket book is constructed to be modified: every part flows from the settings cell on the high.

To discover the fundamentals of Quant Buying and selling, examine our Studying Monitor: Quantitative Buying and selling for Learners.

For LLM utilization for buying and selling, discover the Buying and selling Utilizing LLM: Ideas and Methods monitor, which offers sensible hands-on insights into implementing LLM fashions for buying and selling.

If you happen to’re a severe learner, you possibly can take the Govt Programme in Algorithmic Buying and selling (EPAT), which covers statistical modeling, machine studying, and superior buying and selling methods with Python.

[Chan, 2024] E. P. Chan, “Machine Studying in Buying and selling,” YouTube, 2024. Obtainable: https://www.youtube.com/watch?v=VzF-tvz3DAk&t=411s

Be aware:

The technique concept originated from the creator

The weblog content material was created with the help of an AI massive language mannequin and

The weblog content material was curated/edited by the creator.

Disclaimer: This weblog is for academic and illustrative functions solely. Buying and selling in monetary markets entails substantial threat of loss. The code and ideas mentioned right here are usually not monetary recommendation. All the time train warning and totally perceive any automated buying and selling system earlier than deploying it in a reside surroundings.