believe v133: FUSION-DUAL on E-mini S&P 500 Futures
A three-head ensemble combining an order-flow sequence model, a gradient-boosted 5-minute bar model, and an MBP-10 microstructure classifier, validated under purged K-fold with embargo and deployed through marketable-limit OCA brackets on CME.
Abstract
We present believe v133, a production quantitative strategy on E-mini
S&P 500 futures (ES) traded on CME through Interactive Brokers. The architecture is
a three-head ensemble we call FUSION-DUAL: a tick-level order-flow
sequence model (Tick ML), a gradient-boosted tree model on 52 five-minute bar features
(XGB 5m), and a LightGBM classifier on 17 MBP-10 order-book features
(F2_dom_v133). Each head runs at qty=1 with its own
marketable-limit entry and exchange-side OCA bracket.
F2_dom_v133 achieves AUC 0.826 ± 0.015 under purged K-fold cross-validation with a 10-minute embargo, over 1.45 M triple-barrier-labelled samples drawn from 24.3 M MBP-10 snapshots between 2026-01-27 and 2026-04-15. Against 78 days of real live-captured tick data, the full stack achieves 70.6% hit rate and profit factor 1.35 with F2_dom enabled, versus 67.7% hit rate and near-breakeven profit factor with F2_dom disabled.
Thesis. The edge is not in the forecast. It is in the combination of an honest label, a non-leaky validation protocol, a microstructure view of the order book, and an execution path that refuses to pay the 1-tick market-order tax. Strip any one of those out and the apparent edge evaporates.
Data & Coverage
The training store contains two years of ES tick history (2024-04 to
2026-04) and three months of MBP-10 order-book snapshots
(2026-01-27 to 2026-04-15). The tick history is layered by
fidelity:
| Range | Source | Level | Primary use |
|---|---|---|---|
| 2024-04 → 2026-02 | Databento OHLCV-1s | Synthetic 1s bars | Long-horizon feature support |
| 2026-02 → 2026-04 | Databento MBP-10 | Real trades + quotes | Short-horizon training |
| 2026-01-28 → live | Sierra Chart capture | Native tick feed | Deterministic replay fold |
Short-horizon models are trained exclusively on real ticks; long-horizon features (e.g.
20-day realised volatility, 6-month regime baselines) may span the synthetic range but
are masked across contract rolls and the daily CME maintenance window
(17:00-18:00 ET).
The MBP-10 store draws from three independent sources — Databento historical, IB live depth from our production VM, and an external L2 feed — reconciled daily at matched timestamps. Discrepancies above tolerance are flagged before any training consumes the data.
The training store uses content-addressed Parquet shards keyed by
{instrument}/{feed}/{date}/{shard}.parquet. Every model card records the
exact shard hashes it consumed, so each published model can be reproduced byte-for-byte
from the hash list and the training recipe.
Triple-Barrier Labelling with Embargo
We avoid next-bar-direction labels, which are the single most common source of phantom edge in retail ML-trading literature. A next-bar label bears almost no relationship to what the deployed strategy actually experiences: a take-profit, a stop-loss, or a time-based exit.
For every candidate entry at time t we place three barriers:
The label is the first barrier touched. This is a López de Prado
triple-barrier label. Production choices for v133 are h_tp = 12
ticks, h_sl = 8 ticks, and Δt in the order of a few minutes
for the short-horizon heads — the same geometry the live execution layer applies.
The embargo is the pair of time windows around each test fold during which training samples are dropped. We use a 10-minute embargo; samples whose label window overlaps either the test fold or the embargo band are purged entirely. The purpose is to remove information leakage through autocorrelated features and overlapping labels, which would otherwise inflate validation AUC by 5-15 points in our own A/B measurements.
Feature Engineering
The three heads work from different representations of the tape. We describe each head’s feature space separately, because combining them into a single “feature set” would hide the fact that the heads are deliberately uncorrelated in their inputs.
4.1 Tick ML features (order flow)
Signed trade size, local order-flow imbalance, VPIN
(Volume-synchronised PIN), Kyle’s λ (price impact per unit volume), microprice
drift, realised-volatility terms on decaying windows, time-of-day encoding. These are
streamed per tick; the sequence model consumes them as an ordered vector of recent
ticks.
4.2 XGB 5m features (bar level)
52 features computed on each closed 5-minute OHLCV bar, grouped into five families:
Trend strength
12 featuresSigned returns across multiple lookbacks; rolling slope; breakout strength; directional persistence measures.
Range dynamics
10 featuresTrue range, range-of-ranges, high-low expansion, inside-bar context, compressed-range flags.
Relative position
10 featuresPosition in the day, position in the session, distance to prior-day high/low, opening-range anchors.
Session context
10 featuresMinute-of-RTH, ETH indicator, pre-/post-lunch masks, economic-release proximity flags.
Volatility regime
10 featuresRealised vol at multiple horizons, vol-of-vol, dispersion around the mean, regime indicator encodings.
4.3 F2_dom features (MBP-10)
Seventeen features computed directly from the live 10-level order book. No lagged prices, no technical indicators on price, no feature that could double-count information already in the tape:
| Feature | Description |
|---|---|
| book_imb | Aggregate L1-L10 bid size minus ask size, normalised |
| tob_ratio | Top-of-book size ratio (L1 bid / L1 ask) |
| top3_imb | Imbalance across the top three price levels only |
| bid_grad_1..10 | Size gradient down the bid stack |
| ask_grad_1..10 | Size gradient up the ask stack |
| imb_std | Rolling 30-second std of book_imb |
| mid_mom | Microprice drift over the last N book updates |
| spread_ticks | Bid-ask spread in integer ticks |
| depth_ratio | Ratio of L1 size to L2-L10 aggregate size |
| queue_age | Time since last L1 price-level touch |
Model Architectures
Order-flow sequence model
Rob’s LSTM-style network
Compact sequence model that ingests a rolling window of order-flow features and emits a short-horizon directional probability. Trained on triple-barrier labels with a 10-minute embargo.
seconds
~12 feats
triple-barrier
isotonic
Bar-level gradient boosting
John’s XGBoost (not LSTM)
Gradient-boosted trees on 52 bar-level features per closed 5-minute OHLCV bar. Walk-forward out-of-sample validated, one-month held-out test fold with 1-week embargo around training.
5-min bar
52 feats
XGBoost
monthly
MBP-10 microstructure classifier
LightGBM on 17 book features
Binary-direction classifier on the 17 MBP-10 features above, trained on 1.45 M triple-barrier-labelled samples. AUC 0.826 ± 0.015 under purged K-fold with 10-minute embargo.
sub-minute
17 feats
LightGBM
weekly
All three heads are calibrated out of fold using isotonic regression fit on the held-out predictions. A raw classifier score is not a probability; isotonic calibration gives the downstream gating logic a meaningful number to threshold.
FUSION-DUAL: how the heads combine
FUSION-DUAL is not a meta-model that scores the heads. We considered and rejected that design: a god-model over three already-noisy signals is an obvious overfitting magnet, and its calibration window shrinks to almost nothing.
Instead, each head runs a simple, independent gate and emits its own signal to the execution layer:
- — Each head fires at
qty=1through its own marketable-limit entry. - — Each head attaches its own exchange-side OCA bracket at order submission.
- — Heads do not veto each other; two heads firing the same direction simply result in two independent qty=1 brackets.
- — The ensemble effect is in the joint distribution of their decisions across days, not in a central scoring function.
Why this beats a stacked meta-model. A stacked model needs a large, clean, non-leaky out-of-fold prediction set to learn over. The only clean OOF set we have is also our live fold, which we refuse to burn. Independent heads avoid the problem.
Execution Model
Execution is a first-class citizen, not an afterthought. The believe execution layer follows three rules:
- Marketable limit orders, not market orders. The limit is placed at the far touch plus a small cushion. When the book does not support the fill, the order does not fire and the trade is skipped.
- Exchange-side OCA brackets. Every entry is submitted with take-profit and stop-loss legs as a single One-Cancels-All group matched by the exchange. The client never relies on its own polling to exit a position.
- BE+2 lock on MFE ≥ 4 ticks. Once maximum favourable excursion reaches four ticks, the stop is amended to entry plus two ticks. Mechanical, non-adaptive.
Production sizing is qty=1 per head. Pyramid tiers exist in the codebase
and have been validated in simulation, but are disabled in v133 while the MBP-10 regime
is still being characterised against live fills.
The backtester prices the execution layer honestly: commission and exchange fees on every simulated fill, a 1-tick penalty on any market-order fallback, and random 1-2 tick slippage on stop fills. Marketable-limit entries pay zero entry slippage when the book supports the fill. This is important — it is the single most decisive choice separating a strategy that ships from one that only works in a spreadsheet.
Validation Protocol
The validation pipeline is the same for all three heads, parameterised only by retrain cadence. The common stack is:
- — Purged K-fold (K=5) with time-ordered splits — no shuffling.
- — 10-minute embargo on either side of each test fold.
- — Triple-barrier labels with matching
h_tp,h_sl,Δtfor training and backtest. - — Isotonic calibration fit on the out-of-fold predictions.
- — Classification metric: AUC with fold-to-fold standard deviation reported.
- — Strategy metrics: hit rate and profit factor on the full backtest, commission and slippage priced.
- — Daily backtest-vs-live parity job replays each live session against the captured tick stream and reconciles fills.
Each retrain generates a model card recording training window, out-of-fold AUC with confidence interval, calibration diagnostics, feature-importance snapshot (top 16 gain-ranked), and the exact Parquet shard hashes consumed. A model cannot reach production without a model card.
Statistical Results
9.1 F2_dom_v133 classification
| Fold | Samples | AUC | Notes |
|---|---|---|---|
| 1 | ~290k | 0.841 | High-spread volatility window |
| 2 | ~290k | 0.819 | Quiet regime |
| 3 | ~290k | 0.832 | Mixed regime |
| 4 | ~290k | 0.821 | Event-heavy (CPI, NFP) |
| 5 | ~290k | 0.817 | Most recent, closest to live |
| Mean ± SD | ~1.45M | 0.826 ± 0.015 | Purged K=5, 10-min embargo |
9.2 Triple-barrier label distribution
| Outcome | Share | Interpretation |
|---|---|---|
| Upper barrier (+12t) | ~41% | Take-profit realised |
| Lower barrier (-8t) | ~44% | Stop-loss realised |
| Vertical barrier | ~15% | Time-out; neither touched |
9.3 Full-stack F2_dom ablation (retained from v131)
| Configuration | Hit rate | Profit factor | Note |
|---|---|---|---|
| Stack, F2_dom disabled | 67.7% | ~1.02 | Near-breakeven pre-commission |
| Stack, F2_dom enabled | 70.6% | 1.35 | Walk-forward consistent |
9.4 Adaptive regime gate — 79-day RTH comparison (v133)
Full believe stack, RTH only, 2026-01-29 through 2026-04-17 (79 trading days). Same tape, same models, same order logic. The only difference between the two rows is whether the v133 adaptive regime gate is on or off.
| Configuration | Trades | Win rate | Profit factor | Net (BT $) | Max DD (BT $) |
|---|---|---|---|---|---|
| No gate (v131 behaviour) | 9,361 | 75.9% | 1.63 | +131,198 | 11,372 |
| Adaptive gate on (v133) | 9,136 | 76.1% | 1.66 | +134,164 | 7,938 |
The adaptive gate suppresses 225 trades (−2.4%) concentrated in
HIGH_VOL_CHOP and LOW_VOL_QUIET
after two-consecutive-losing-day states. Net BT P&L rises by $2,966; peak drawdown
falls by $3,434 (−30.2%). The design goal was a cleaner equity curve at matched
profit tier — the table is the proof.
9.5 Determinism proof
Two independent runs of the v133 backtester were executed on the same captured tape,
same configuration, same seed scope. The backtester is deterministic across all signal
paths; the only non-deterministic element is the stop-fill model, which applies a
random 1-2 tick penalty at trigger time (--stop-slippage 2).
| Quantity | Run A | Run B | Delta |
|---|---|---|---|
| Trade count | identical | identical | 0 |
| Win rate | identical | identical | 0.0 pt |
| Commission | identical | identical | $0.00 |
| Net P&L | baseline | baseline ± $112 | 0.086% |
A $112 variance over $130,398 of net backtest P&L (0.086%) is entirely attributable to the stop-slippage RNG and bounds the noise floor of every comparison on the site. In other words, any BT delta smaller than ~0.1% is within stop-slippage noise and cannot be called a signal.
Deliberately not published on this page: daily P&L, cumulative equity curve of the live account, Sharpe ratios on sub-quarter windows, or single-fold “best” metrics. Those are either noise, cherry-picks, or both. The figures above are the honest answer to “what does this system do?”
Limits & Known Failure Modes
Every serious quant paper should include a section on what the system does badly. Ours includes:
- — Microstructure decay. If the top three F2_dom features degrade together, exchange-level liquidity has reorganised and the model’s prior is stale.
- — Overfitting to the 78-day capture fold. It is our highest-fidelity fold and also the easiest to overfit. We hold out the most recent week on principle.
- — Event gaps. FOMC-style discrete shocks are regions of low coverage; the model is more confident than it should be in the minutes either side. Size is explicitly attenuated.
- — Retrain-cadence drag. The weekly F2_dom retrain lags regime shifts by up to a week. Between retrains, the head can continue firing a stale prior; the monitoring stack is tuned to catch this within one session.
- — Bridge single point of failure. One Python process on one Azure VM is the only thing that submits orders. We accept the concentration and compensate with supervisor, watchdog, and an explicit refuse-to-deploy policy during CME maintenance.
Current Version: believe v133
Shipped 2026-04-17. Relative to v131, v133 introduces the adaptive regime
gate (2-day rolling P&L auto-toggle that blocks HIGH_VOL_CHOP and
LOW_VOL_QUIET after two consecutive losing days), cuts the session mask down to
RTH + ETH_EUROPE only, and formalises the determinism proof
for the backtester (0.086% P&L variance, entirely from stop-slippage RNG). The bridge
also picks up a DOM-pipeline reliability fix (41a4b67) that
removes a silent-swallow exception path on live book subscription errors. Historical
lineage: v131 introduced the F2_dom microstructure head relative to v125.
| Component | Status in v133 | Next |
|---|---|---|
| F2_dom | AUC 0.826 ± 0.015 | Expand to 20+ features, roll feature-set audit |
| XGB 5m | Walk-forward retrain monthly | Explore gradient-boosting alternative libraries |
| Tick ML | Monthly retrain, isotonic calibrated | Add attention-layer variant as candidate head |
| Execution | Marketable limit + OCA bracket | Re-enable pyramid tiers after regime audit |
| Adaptive gate | New in v133. 2-day rolling P&L toggle, blocks HIGH_VOL_CHOP + LOW_VOL_QUIET on 2-loss streak | Per-regime half-size mode as an alternative to hard block |
| Session mask | RTH + ETH_EUROPE enabled. ETH_PRE / ETH_POST / ETH_ASIA disabled (negative BT P&L) | Re-audit quarterly against forward tape |
| Data corpus | Tick 2026-01-29 → 2026-04-17 (952k ticks). DOM 2026-01-27 → 2026-04-17 (25M snaps, Databento + IB live, $125.69 spend). Separated into dedicated LFS repos. | Continuous append on the rolling 6-month retrain window |
| Backtester | Deterministic replay. Two-run variance 0.086% from stop-slippage RNG | Faster parallel fold evaluation |
| DOM bridge | Silent-swallow bug fixed (41a4b67); live-DOM errors now visible + alarmed | Add structured exception codes per subscription path |
| Monitoring | log.bhf.capital live, feature-drift alarms, parity job | Publish read-only dashboard to investors |
This site is informational. Nothing on it is an offer, a solicitation, or investment advice. Past walk-forward and backtest results do not guarantee future live performance. Commission, exchange fees, slippage, and regime changes can materially affect results.