believe v133: FUSION-DUAL on E-mini S&P 500 Futures

A three-head ensemble combining an order-flow sequence model, a gradient-boosted 5-minute bar model, and an MBP-10 microstructure classifier, validated under purged K-fold with embargo and deployed through marketable-limit OCA brackets on CME.

BHF Capital / Rare Bird Holdings LLC believe v133 2026-04-16 R. G. Williams

1. Abstract

Abstract

We present believe v133, a production quantitative strategy on E-mini S&P 500 futures (ES) traded on CME through Interactive Brokers. The architecture is a three-head ensemble we call FUSION-DUAL: a tick-level order-flow sequence model (Tick ML), a gradient-boosted tree model on 52 five-minute bar features (XGB 5m), and a LightGBM classifier on 17 MBP-10 order-book features (F2_dom_v133). Each head runs at qty=1 with its own marketable-limit entry and exchange-side OCA bracket.

F2_dom_v133 achieves AUC 0.826 ± 0.015 under purged K-fold cross-validation with a 10-minute embargo, over 1.45 M triple-barrier-labelled samples drawn from 24.3 M MBP-10 snapshots between 2026-01-27 and 2026-04-15. Against 78 days of real live-captured tick data, the full stack achieves 70.6% hit rate and profit factor 1.35 with F2_dom enabled, versus 67.7% hit rate and near-breakeven profit factor with F2_dom disabled.

Thesis. The edge is not in the forecast. It is in the combination of an honest label, a non-leaky validation protocol, a microstructure view of the order book, and an execution path that refuses to pay the 1-tick market-order tax. Strip any one of those out and the apparent edge evaporates.

2. Data & Coverage

Data & Coverage

The training store contains two years of ES tick history (2024-04 to 2026-04) and three months of MBP-10 order-book snapshots (2026-01-27 to 2026-04-15). The tick history is layered by fidelity:

Range	Source	Level	Primary use
2024-04 → 2026-02	Databento OHLCV-1s	Synthetic 1s bars	Long-horizon feature support
2026-02 → 2026-04	Databento MBP-10	Real trades + quotes	Short-horizon training
2026-01-28 → live	Sierra Chart capture	Native tick feed	Deterministic replay fold

Short-horizon models are trained exclusively on real ticks; long-horizon features (e.g. 20-day realised volatility, 6-month regime baselines) may span the synthetic range but are masked across contract rolls and the daily CME maintenance window (17:00-18:00 ET).

The MBP-10 store draws from three independent sources — Databento historical, IB live depth from our production VM, and an external L2 feed — reconciled daily at matched timestamps. Discrepancies above tolerance are flagged before any training consumes the data.

The training store uses content-addressed Parquet shards keyed by {instrument}/{feed}/{date}/{shard}.parquet. Every model card records the exact shard hashes it consumed, so each published model can be reproduced byte-for-byte from the hash list and the training recipe.

3. Labelling

Triple-Barrier Labelling with Embargo

We avoid next-bar-direction labels, which are the single most common source of phantom edge in retail ML-trading literature. A next-bar label bears almost no relationship to what the deployed strategy actually experiences: a take-profit, a stop-loss, or a time-based exit.

For every candidate entry at time t we place three barriers:

upper barrier:     +h_tp ticks above entry
lower barrier:     -h_sl ticks below entry
vertical barrier:  t + Δt

The label is the first barrier touched. This is a López de Prado triple-barrier label. Production choices for v133 are h_tp = 12 ticks, h_sl = 8 ticks, and Δt in the order of a few minutes for the short-horizon heads — the same geometry the live execution layer applies.

The embargo is the pair of time windows around each test fold during which training samples are dropped. We use a 10-minute embargo; samples whose label window overlaps either the test fold or the embargo band are purged entirely. The purpose is to remove information leakage through autocorrelated features and overlapping labels, which would otherwise inflate validation AUC by 5-15 points in our own A/B measurements.

4. Feature Engineering

Feature Engineering

The three heads work from different representations of the tape. We describe each head’s feature space separately, because combining them into a single “feature set” would hide the fact that the heads are deliberately uncorrelated in their inputs.

4.1 Tick ML features (order flow)

Signed trade size, local order-flow imbalance, VPIN (Volume-synchronised PIN), Kyle’s λ (price impact per unit volume), microprice drift, realised-volatility terms on decaying windows, time-of-day encoding. These are streamed per tick; the sequence model consumes them as an ordered vector of recent ticks.

4.2 XGB 5m features (bar level)

52 features computed on each closed 5-minute OHLCV bar, grouped into five families:

Trend strength

12 features

Signed returns across multiple lookbacks; rolling slope; breakout strength; directional persistence measures.

Range dynamics

10 features

True range, range-of-ranges, high-low expansion, inside-bar context, compressed-range flags.

Relative position

10 features

Position in the day, position in the session, distance to prior-day high/low, opening-range anchors.

Session context

10 features

Minute-of-RTH, ETH indicator, pre-/post-lunch masks, economic-release proximity flags.

Volatility regime

10 features

Realised vol at multiple horizons, vol-of-vol, dispersion around the mean, regime indicator encodings.

4.3 F2_dom features (MBP-10)

Seventeen features computed directly from the live 10-level order book. No lagged prices, no technical indicators on price, no feature that could double-count information already in the tape:

Feature	Description
book_imb	Aggregate L1-L10 bid size minus ask size, normalised
tob_ratio	Top-of-book size ratio (L1 bid / L1 ask)
top3_imb	Imbalance across the top three price levels only
bid_grad_1..10	Size gradient down the bid stack
ask_grad_1..10	Size gradient up the ask stack
imb_std	Rolling 30-second std of book_imb
mid_mom	Microprice drift over the last N book updates
spread_ticks	Bid-ask spread in integer ticks
depth_ratio	Ratio of L1 size to L2-L10 aggregate size
queue_age	Time since last L1 price-level touch

5. Model Architectures

Model Architectures

Tick ML

Order-flow sequence model

Rob’s LSTM-style network

Compact sequence model that ingests a rolling window of order-flow features and emits a short-horizon directional probability. Trained on triple-barrier labels with a 10-minute embargo.

Horizon
seconds

Inputs
~12 feats

Labels
triple-barrier

Calibration
isotonic

XGB 5m

Bar-level gradient boosting

John’s XGBoost (not LSTM)

Gradient-boosted trees on 52 bar-level features per closed 5-minute OHLCV bar. Walk-forward out-of-sample validated, one-month held-out test fold with 1-week embargo around training.

Horizon
5-min bar

Inputs
52 feats

Family
XGBoost

Retrain
monthly

F2_dom v133

MBP-10 microstructure classifier

LightGBM on 17 book features

Binary-direction classifier on the 17 MBP-10 features above, trained on 1.45 M triple-barrier-labelled samples. AUC 0.826 ± 0.015 under purged K-fold with 10-minute embargo.

Horizon
sub-minute

Inputs
17 feats

Family
LightGBM

Retrain
weekly

All three heads are calibrated out of fold using isotonic regression fit on the held-out predictions. A raw classifier score is not a probability; isotonic calibration gives the downstream gating logic a meaningful number to threshold.

6. FUSION-DUAL

FUSION-DUAL: how the heads combine

FUSION-DUAL is not a meta-model that scores the heads. We considered and rejected that design: a god-model over three already-noisy signals is an obvious overfitting magnet, and its calibration window shrinks to almost nothing.

Instead, each head runs a simple, independent gate and emits its own signal to the execution layer:

— Each head fires at qty=1 through its own marketable-limit entry.
— Each head attaches its own exchange-side OCA bracket at order submission.
— Heads do not veto each other; two heads firing the same direction simply result in two independent qty=1 brackets.
— The ensemble effect is in the joint distribution of their decisions across days, not in a central scoring function.

Why this beats a stacked meta-model. A stacked model needs a large, clean, non-leaky out-of-fold prediction set to learn over. The only clean OOF set we have is also our live fold, which we refuse to burn. Independent heads avoid the problem.

7. Execution Model

Execution Model

Execution is a first-class citizen, not an afterthought. The believe execution layer follows three rules:

Marketable limit orders, not market orders. The limit is placed at the far touch plus a small cushion. When the book does not support the fill, the order does not fire and the trade is skipped.
Exchange-side OCA brackets. Every entry is submitted with take-profit and stop-loss legs as a single One-Cancels-All group matched by the exchange. The client never relies on its own polling to exit a position.
BE+2 lock on MFE ≥ 4 ticks. Once maximum favourable excursion reaches four ticks, the stop is amended to entry plus two ticks. Mechanical, non-adaptive.

Production sizing is qty=1 per head. Pyramid tiers exist in the codebase and have been validated in simulation, but are disabled in v133 while the MBP-10 regime is still being characterised against live fills.

The backtester prices the execution layer honestly: commission and exchange fees on every simulated fill, a 1-tick penalty on any market-order fallback, and random 1-2 tick slippage on stop fills. Marketable-limit entries pay zero entry slippage when the book supports the fill. This is important — it is the single most decisive choice separating a strategy that ships from one that only works in a spreadsheet.

8. Validation Protocol

Validation Protocol

The validation pipeline is the same for all three heads, parameterised only by retrain cadence. The common stack is:

— Purged K-fold (K=5) with time-ordered splits — no shuffling.
— 10-minute embargo on either side of each test fold.
— Triple-barrier labels with matching h_tp, h_sl, Δt for training and backtest.
— Isotonic calibration fit on the out-of-fold predictions.
— Classification metric: AUC with fold-to-fold standard deviation reported.
— Strategy metrics: hit rate and profit factor on the full backtest, commission and slippage priced.
— Daily backtest-vs-live parity job replays each live session against the captured tick stream and reconciles fills.

Each retrain generates a model card recording training window, out-of-fold AUC with confidence interval, calibration diagnostics, feature-importance snapshot (top 16 gain-ranked), and the exact Parquet shard hashes consumed. A model cannot reach production without a model card.

9. Statistical Results

Statistical Results

9.1 F2_dom_v133 classification

Fold	Samples	AUC	Notes
1	~290k	0.841	High-spread volatility window
2	~290k	0.819	Quiet regime
3	~290k	0.832	Mixed regime
4	~290k	0.821	Event-heavy (CPI, NFP)
5	~290k	0.817	Most recent, closest to live
Mean ± SD	~1.45M	0.826 ± 0.015	Purged K=5, 10-min embargo

9.2 Triple-barrier label distribution

Outcome	Share	Interpretation
Upper barrier (+12t)	~41%	Take-profit realised
Lower barrier (-8t)	~44%	Stop-loss realised
Vertical barrier	~15%	Time-out; neither touched

9.3 Full-stack F2_dom ablation (retained from v131)

Configuration	Hit rate	Profit factor	Note
Stack, F2_dom disabled	67.7%	~1.02	Near-breakeven pre-commission
Stack, F2_dom enabled	70.6%	1.35	Walk-forward consistent

9.4 Adaptive regime gate — 79-day RTH comparison (v133)

Full believe stack, RTH only, 2026-01-29 through 2026-04-17 (79 trading days). Same tape, same models, same order logic. The only difference between the two rows is whether the v133 adaptive regime gate is on or off.

Configuration	Trades	Win rate	Profit factor	Net (BT $)	Max DD (BT $)
No gate (v131 behaviour)	9,361	75.9%	1.63	+131,198	11,372
Adaptive gate on (v133)	9,136	76.1%	1.66	+134,164	7,938

The adaptive gate suppresses 225 trades (−2.4%) concentrated in HIGH_VOL_CHOP and LOW_VOL_QUIET after two-consecutive-losing-day states. Net BT P&L rises by $2,966; peak drawdown falls by $3,434 (−30.2%). The design goal was a cleaner equity curve at matched profit tier — the table is the proof.

9.5 Determinism proof

Two independent runs of the v133 backtester were executed on the same captured tape, same configuration, same seed scope. The backtester is deterministic across all signal paths; the only non-deterministic element is the stop-fill model, which applies a random 1-2 tick penalty at trigger time (--stop-slippage 2).

Quantity	Run A	Run B	Delta
Trade count	identical	identical	0
Win rate	identical	identical	0.0 pt
Commission	identical	identical	$0.00
Net P&L	baseline	baseline ± $112	0.086%

A $112 variance over $130,398 of net backtest P&L (0.086%) is entirely attributable to the stop-slippage RNG and bounds the noise floor of every comparison on the site. In other words, any BT delta smaller than ~0.1% is within stop-slippage noise and cannot be called a signal.

Deliberately not published on this page: daily P&L, cumulative equity curve of the live account, Sharpe ratios on sub-quarter windows, or single-fold “best” metrics. Those are either noise, cherry-picks, or both. The figures above are the honest answer to “what does this system do?”

10. Limits & Failure Modes

Limits & Known Failure Modes

Every serious quant paper should include a section on what the system does badly. Ours includes:

— Microstructure decay. If the top three F2_dom features degrade together, exchange-level liquidity has reorganised and the model’s prior is stale.
— Overfitting to the 78-day capture fold. It is our highest-fidelity fold and also the easiest to overfit. We hold out the most recent week on principle.
— Event gaps. FOMC-style discrete shocks are regions of low coverage; the model is more confident than it should be in the minutes either side. Size is explicitly attenuated.
— Retrain-cadence drag. The weekly F2_dom retrain lags regime shifts by up to a week. Between retrains, the head can continue firing a stale prior; the monitoring stack is tuned to catch this within one session.
— Bridge single point of failure. One Python process on one Azure VM is the only thing that submits orders. We accept the concentration and compensate with supervisor, watchdog, and an explicit refuse-to-deploy policy during CME maintenance.

11. Current Version

Current Version: believe v133

Shipped 2026-04-17. Relative to v131, v133 introduces the adaptive regime gate (2-day rolling P&L auto-toggle that blocks HIGH_VOL_CHOP and LOW_VOL_QUIET after two consecutive losing days), cuts the session mask down to RTH + ETH_EUROPE only, and formalises the determinism proof for the backtester (0.086% P&L variance, entirely from stop-slippage RNG). The bridge also picks up a DOM-pipeline reliability fix (41a4b67) that removes a silent-swallow exception path on live book subscription errors. Historical lineage: v131 introduced the F2_dom microstructure head relative to v125.

Component	Status in v133	Next
F2_dom	AUC 0.826 ± 0.015	Expand to 20+ features, roll feature-set audit
XGB 5m	Walk-forward retrain monthly	Explore gradient-boosting alternative libraries
Tick ML	Monthly retrain, isotonic calibrated	Add attention-layer variant as candidate head
Execution	Marketable limit + OCA bracket	Re-enable pyramid tiers after regime audit
Adaptive gate	New in v133. 2-day rolling P&L toggle, blocks HIGH_VOL_CHOP + LOW_VOL_QUIET on 2-loss streak	Per-regime half-size mode as an alternative to hard block
Session mask	RTH + ETH_EUROPE enabled. ETH_PRE / ETH_POST / ETH_ASIA disabled (negative BT P&L)	Re-audit quarterly against forward tape
Data corpus	Tick 2026-01-29 → 2026-04-17 (952k ticks). DOM 2026-01-27 → 2026-04-17 (25M snaps, Databento + IB live, $125.69 spend). Separated into dedicated LFS repos.	Continuous append on the rolling 6-month retrain window
Backtester	Deterministic replay. Two-run variance 0.086% from stop-slippage RNG	Faster parallel fold evaluation
DOM bridge	Silent-swallow bug fixed (41a4b67); live-DOM errors now visible + alarmed	Add structured exception codes per subscription path
Monitoring	log.bhf.capital live, feature-drift alarms, parity job	Publish read-only dashboard to investors

This site is informational. Nothing on it is an offer, a solicitation, or investment advice. Past walk-forward and backtest results do not guarantee future live performance. Commission, exchange fees, slippage, and regime changes can materially affect results.