Methods | BHF Capital

1. The FUSION-DUAL design

believe is a three-head ensemble on a single instrument. Each head observes a different representation of the same market and emits an independent trading decision. We call the overall architecture FUSION-DUAL because the two original heads (Rob’s tick ensemble and John’s 5-minute XGBoost) were joined in v133 by a third, F2_dom, which reads the live MBP-10 order book.

Tick ML (Rob). Sequence model on the tick stream. Inputs: signed trade size, local imbalance, VPIN, Kyle’s λ, microprice drift, realised-vol terms, time-of-day. Short-horizon directional probability.
XGB 5m (John). Gradient-boosted trees on 52 bar-level features built from closed 5-minute OHLCV bars: trend strength, relative position, session context, volatility regime, range dynamics.
F2_dom (v133). LightGBM classifier on 17 features extracted from the MBP-10 order book: book imbalance, top-of-book ratio, top-three imbalance, bid gradients L1-L10, ask gradients L1-L10, rolling imbalance std, midprice momentum.

No head is a supervisor of any other. They run side by side with independent signal gates and independent OCA brackets. This is deliberate: a single fragile “confidence gate” across all three would re-introduce the overfitting we designed the ensemble to escape.

2. Triple-barrier labelling

We label every training sample using the triple-barrier method (López de Prado, Advances in Financial Machine Learning, 2018). For each candidate entry at time t, three barriers are placed:

an upper barrier at +h ticks above entry,
a lower barrier at -h ticks below entry,
a vertical barrier at t + Δt.

The label is the first barrier touched. This is what a real trader sees: a take-profit, a stop-loss, or a time-out. It is not a naive “direction of the next candle” label, which is the single most common source of phantom edge in the literature.

Why it matters. A model trained on next-bar direction learns nothing about the path between now and the decision point. Triple-barrier forces the model to learn whether the entry survives its own stop — exactly the question execution actually asks.

3. Purged K-fold CV with embargo

Time-series K-fold is broken by default. Overlapping labels and autocorrelated features leak information across fold boundaries, inflating validation AUC by 5-15 points in our own A/B studies. We use purged K-fold with embargo:

Purge. Any training sample whose label window overlaps the test fold is removed.
Embargo. A 10-minute band on either side of the test fold is removed from training, to prevent leakage through residual autocorrelation.
Five folds sliding forward in time, no shuffling. Early folds can use less training data than late folds — we accept that, because the alternative is a lie.

F2_dom_v133 reports AUC 0.826 ± 0.015 across five purged folds on 1.45M labelled samples. The ± is the fold-to-fold standard deviation, not a bootstrap confidence interval — it is the honest answer to “how stable is this metric”.

4. F2_dom feature set

Seventeen features, all derived from the live MBP-10 snapshot. No lagged prices, no technical indicators on price, no feature that could double-count information already in the tape:

Feature	Description
`book_imb`	Aggregate L1-L10 bid size minus L1-L10 ask size, normalised
`tob_ratio`	Top-of-book size ratio (L1 bid / L1 ask)
`top3_imb`	Imbalance across the top three price levels only
`bid_grad_1..10`	Size gradient down the bid stack
`ask_grad_1..10`	Size gradient up the ask stack
`imb_std`	Rolling 30-second std of book_imb
`mid_mom`	Microprice drift over the last N book updates
`spread_ticks`	Bid-ask spread in integer ticks
`depth_ratio`	Ratio of L1 size to L2-L10 aggregate
`queue_age`	Time since last price-level touch

The feature names above are the canonical 10 of the 17 — the rest are decomposed gradients (bid/ask per level) we group as bid_grad_*, ask_grad_*. Full list ships with each model card in the repo.

5. Calibration

Raw classifier output is not a probability. We fit isotonic regression on out-of-fold predictions so that when F2_dom says 0.70, it really does fire roughly 70% of the time in the labelled direction. Calibration happens on the same purged folds as validation — we never calibrate on the training fold and then report an AUC like the calibration was free.

6. Execution model

A perfect signal with sloppy execution is a bad signal. believe executes with three rules:

Marketable limit orders, not market orders. The limit is priced at the far touch plus a small cushion. Fills are achieved without paying the 1-tick market-order tax embedded in most retail backtests.
Exchange-side OCA brackets. Every entry is submitted with a take-profit and a stop. The CME matches them as a One-Cancels-All group; the client never relies on its own PnL polling to close a trade.
Break-even + 2-tick lock on MFE ≥ 4 ticks. When maximum favourable excursion reaches 4 ticks, the stop is amended to entry + 2 ticks. Small, mechanical, non-adaptive.

Baseline sizing is qty=1 per model. Pyramid tiers exist in code but are disabled in production v133 while the MBP-10 regime is still being measured. No martingales. No averaging into losers.

7. Commission and slippage in the backtest

The believe backtester does not reward you for being optimistic. Every simulated fill pays commission; every market-order fallback pays a 1-tick penalty; stop fills apply a random 1-2 tick slippage from the trigger. Marketable-limit entries pay zero entry slippage when the book supports the fill — and when it does not, the order simply does not fire and the trade is skipped.

This is the single most important discipline on the page. Most “profitable” ES strategies die the moment you price commission and slippage honestly.

8. Adaptive regime gate (v133)

v133 introduces a single, deliberately simple top-level gate: a 2-day rolling P&L auto-toggle that blocks two of the six regime buckets when the stack is in a losing streak.

Each completed trading session contributes a signed P&L to a 2-day rolling window.
After two consecutive losing days, the gate blocks HIGH_VOL_CHOP and LOW_VOL_QUIET regime labels from firing entries. TRENDING and BALANCED regimes continue to trade.
After two consecutive winning days, the gate unblocks both regimes and the stack returns to its full regime surface.
The gate is pure state on yesterday’s realised P&L. No lookahead, no parameter fit to future windows — the toggle reads a value that already happened.

Regime labels are produced by a lightweight realised-vol × range-expansion classifier on 30-minute windows: six buckets (HIGH_VOL_TREND, HIGH_VOL_CHOP, MID_VOL_TREND, MID_VOL_BALANCED, LOW_VOL_TREND, LOW_VOL_QUIET). The two blocked buckets are the ones whose per-regime P&L contribution was most negatively correlated with drawdown excursions in the 79-day backtest.

What the gate buys you. On a 79-day RTH backtest (Jan 29 – Apr 17 2026), the gate takes the stack from 9,361 trades / 75.9% WR / PF 1.63 / MaxDD $11,372 to 9,136 trades / 76.1% WR / PF 1.66 / MaxDD $7,938. Fewer trades, slightly higher hit rate, same profit tier, ~30% less peak drawdown. Quantified on Performance.

9. Session selection (v133)

Previous versions ran the full 24-hour CME clock. v133 disables the three sessions whose per-session P&L was negative on the 79-day tape:

RTH — 9:30–16:00 ET. Enabled. Carries the stack.
ETH_EUROPE — London/Frankfurt overlap. Enabled. Modestly positive contribution.
ETH_PRE, ETH_POST, ETH_ASIA — Disabled. All three produced negative aggregate P&L on the backtest window and were removed from the live session mask. See Architecture for the per-session breakdown.

10. Retrain cadence

Each model ships with a retrain schedule:

F2_dom — weekly retrain on rolling 6-month window. Feature importance tracked across 16 core features each retrain; drift alarms trigger a model hold if any feature’s importance rank shifts by more than two positions.
XGB 5m — monthly walk-forward retrain on the trailing 12 months. Test fold is the most recent month, with 1-week embargo.
Tick ML — monthly retrain with the same embargoed walk-forward scheme, plus a held-out month each quarter that never enters training.

11. What we explicitly do not do

We do not use a single “god model” that scores a vote of the other models. It is an obvious overfitting magnet.
We do not re-weight models based on recent P&L. The v133 adaptive gate toggles regime eligibility, not model weights — no model’s score is ever scaled by how recently it won.
We do not trade outside ES. Cross-instrument diversification is often a band-aid for a model that does not work.
We do not hide behind Sharpe ratios on in-sample windows.