Home Methods Data Performance Philosophy Research Live Dashboard

1. The FUSION-DUAL design

believe is a three-head ensemble on a single instrument. Each head observes a different representation of the same market and emits an independent trading decision. We call the overall architecture FUSION-DUAL because the two original heads (Rob’s tick ensemble and John’s 5-minute XGBoost) were joined in v133 by a third, F2_dom, which reads the live MBP-10 order book.

No head is a supervisor of any other. They run side by side with independent signal gates and independent OCA brackets. This is deliberate: a single fragile “confidence gate” across all three would re-introduce the overfitting we designed the ensemble to escape.

2. Triple-barrier labelling

We label every training sample using the triple-barrier method (López de Prado, Advances in Financial Machine Learning, 2018). For each candidate entry at time t, three barriers are placed:

The label is the first barrier touched. This is what a real trader sees: a take-profit, a stop-loss, or a time-out. It is not a naive “direction of the next candle” label, which is the single most common source of phantom edge in the literature.

Why it matters. A model trained on next-bar direction learns nothing about the path between now and the decision point. Triple-barrier forces the model to learn whether the entry survives its own stop — exactly the question execution actually asks.

3. Purged K-fold CV with embargo

Time-series K-fold is broken by default. Overlapping labels and autocorrelated features leak information across fold boundaries, inflating validation AUC by 5-15 points in our own A/B studies. We use purged K-fold with embargo:

F2_dom_v133 reports AUC 0.826 ± 0.015 across five purged folds on 1.45M labelled samples. The ± is the fold-to-fold standard deviation, not a bootstrap confidence interval — it is the honest answer to “how stable is this metric”.

4. F2_dom feature set

Seventeen features, all derived from the live MBP-10 snapshot. No lagged prices, no technical indicators on price, no feature that could double-count information already in the tape:

FeatureDescription
book_imbAggregate L1-L10 bid size minus L1-L10 ask size, normalised
tob_ratioTop-of-book size ratio (L1 bid / L1 ask)
top3_imbImbalance across the top three price levels only
bid_grad_1..10Size gradient down the bid stack
ask_grad_1..10Size gradient up the ask stack
imb_stdRolling 30-second std of book_imb
mid_momMicroprice drift over the last N book updates
spread_ticksBid-ask spread in integer ticks
depth_ratioRatio of L1 size to L2-L10 aggregate
queue_ageTime since last price-level touch

The feature names above are the canonical 10 of the 17 — the rest are decomposed gradients (bid/ask per level) we group as bid_grad_*, ask_grad_*. Full list ships with each model card in the repo.

5. Calibration

Raw classifier output is not a probability. We fit isotonic regression on out-of-fold predictions so that when F2_dom says 0.70, it really does fire roughly 70% of the time in the labelled direction. Calibration happens on the same purged folds as validation — we never calibrate on the training fold and then report an AUC like the calibration was free.

6. Execution model

A perfect signal with sloppy execution is a bad signal. believe executes with three rules:

Baseline sizing is qty=1 per model. Pyramid tiers exist in code but are disabled in production v133 while the MBP-10 regime is still being measured. No martingales. No averaging into losers.

7. Commission and slippage in the backtest

The believe backtester does not reward you for being optimistic. Every simulated fill pays commission; every market-order fallback pays a 1-tick penalty; stop fills apply a random 1-2 tick slippage from the trigger. Marketable-limit entries pay zero entry slippage when the book supports the fill — and when it does not, the order simply does not fire and the trade is skipped.

This is the single most important discipline on the page. Most “profitable” ES strategies die the moment you price commission and slippage honestly.

8. Adaptive regime gate (v133)

v133 introduces a single, deliberately simple top-level gate: a 2-day rolling P&L auto-toggle that blocks two of the six regime buckets when the stack is in a losing streak.

Regime labels are produced by a lightweight realised-vol × range-expansion classifier on 30-minute windows: six buckets (HIGH_VOL_TREND, HIGH_VOL_CHOP, MID_VOL_TREND, MID_VOL_BALANCED, LOW_VOL_TREND, LOW_VOL_QUIET). The two blocked buckets are the ones whose per-regime P&L contribution was most negatively correlated with drawdown excursions in the 79-day backtest.

What the gate buys you. On a 79-day RTH backtest (Jan 29 – Apr 17 2026), the gate takes the stack from 9,361 trades / 75.9% WR / PF 1.63 / MaxDD $11,372 to 9,136 trades / 76.1% WR / PF 1.66 / MaxDD $7,938. Fewer trades, slightly higher hit rate, same profit tier, ~30% less peak drawdown. Quantified on Performance.

9. Session selection (v133)

Previous versions ran the full 24-hour CME clock. v133 disables the three sessions whose per-session P&L was negative on the 79-day tape:

10. Retrain cadence

Each model ships with a retrain schedule:

11. What we explicitly do not do