1. Headline numbers
Past results do not guarantee future performance. The figures above are walk-forward validation on F2_dom and a 79-day backtest on the v133 stack with adaptive gate on, RTH only. Nothing on this page is a live account statement.
2. F2_dom walk-forward AUC
The microstructure head is the most heavily validated component of the stack. Across five purged folds with a 10-minute embargo, over 1.45M labelled samples drawn from the full MBP-10 history:
| Fold | Samples | AUC | Notes |
|---|---|---|---|
| 1 (earliest) | ~290k | 0.841 | Highest spread vol in window |
| 2 | ~290k | 0.819 | Quiet regime, feature importance shifts |
| 3 | ~290k | 0.832 | Mixed regime |
| 4 | ~290k | 0.821 | Event-heavy (CPI, NFP) |
| 5 (latest) | ~290k | 0.817 | Most recent, closest to live |
| Mean ± SD | ~1.45M | 0.826 ± 0.015 | Purged K=5, 10-min embargo |
Reading this honestly: fold 5 is the most recent and the closest to live. It is also the lowest AUC in the set. We treat that as the realistic upper bound for deployed performance, not the mean.
3. Full-stack backtest — adaptive gate vs no gate (v133)
The full believe stack (tick ML + XGB 5m + F2_dom) is run against the 79-day live tick
capture from 2026-01-29 through 2026-04-17,
RTH only, with commission, marketable-limit entry slippage rules, and random 1-2 tick
stop slippage applied. Same tape, same models, same order logic — the only difference
between rows is whether the v133 adaptive regime gate is on or off.
| Configuration | Trades | Hit Rate | Profit Factor | Net (BT $) | Max DD (BT $) |
|---|---|---|---|---|---|
| No gate (v131 behaviour) | 9,361 | 75.9% | 1.63 | +131,198 | 11,372 |
| Adaptive gate on (v133) | 9,136 | 76.1% | 1.66 | +134,164 | 7,938 |
| Delta | −225 | +0.2 pt | +0.03 | +$2,966 | −$3,434 (−30.2%) |
Reading this table. The adaptive gate suppresses ~2.4% of trades but does so in the exact regimes where the stack was losing money. Net P&L goes up, peak drawdown falls ~30%, and the equity curve is visibly cleaner through the two worst weeks of the window. Numbers are backtest dollars on a 1-lot ES stack, not live account P&L.
3b. F2_dom ablation (retained from v131)
For context, the historical F2_dom ablation measured on an earlier 78-day tape. The microstructure head remains the largest single driver of the stack’s edge.
| Configuration | Hit Rate | Profit Factor | Result |
|---|---|---|---|
| Stack, F2_dom disabled | 67.7% | 1.02 | Near-breakeven pre-commission |
| Stack, F2_dom enabled | 70.6% | 1.35 | Walk-forward consistent |
The lift from the F2_dom head is real, small, and consistent across folds. This is what a microstructure signal is supposed to do: tilt the edge, not replace it.
4. Triple-barrier label distribution
Before trusting any classification metric, you should see the label distribution. Below is the distribution of the three barrier outcomes across the 1.45M F2_dom training samples:
| Outcome | Share | Interpretation |
|---|---|---|
| Upper barrier touched (+12 ticks) | ~41% | Take-profit realised |
| Lower barrier touched (-8 ticks) | ~44% | Stop-loss realised |
| Vertical barrier (time-out) | ~15% | Neither touched; exit at expiry |
The label set is close to balanced and not dominated by time-outs, which is a precondition for the AUC number above to be meaningful.
5. Feature importance (F2_dom v133)
Importance is gain-based, averaged over the five purged folds. We track the top 16 each retrain and alert on large rank shifts. The table below is the v133 snapshot; absolute gain values are withheld because they are retrain-specific and not decision-useful off the training host.
| Rank | Feature | Family |
|---|---|---|
| 1 | book_imb | Aggregate imbalance |
| 2 | tob_ratio | Top of book |
| 3 | top3_imb | Near-touch imbalance |
| 4 | mid_mom | Microprice drift |
| 5 | imb_std | Rolling imbalance vol |
| 6 | bid_grad_2 | Bid gradient L2 |
| 7 | ask_grad_2 | Ask gradient L2 |
| 8 | spread_ticks | Spread |
| 9 | depth_ratio | Depth skew |
| 10 | queue_age | Price-level staleness |
6. What we deliberately do not publish
- Daily P&L. A marketing site is not a trade blotter. Daily dollar moves say nothing useful about long-run edge and everything useless about noise.
- Cumulative equity curves for the live account. Same reason. They invite cherry-picking start dates.
- Sharpe ratios on short windows. Sharpe on < 1 year of intraday trading is mostly a random variable.
- Single-fold “best” metrics. If we only showed fold 1 we could claim 0.841. We show the mean and the standard deviation.
7. What we monitor day to day
- Feature drift. Any core feature’s importance rank shifting by more than two positions triggers a retrain review.
- Bracket fill rate. Measured as the share of entries that receive a marketable-limit fill. A decline in fill rate is read as a market-regime shift, not as a broker issue, until both are checked.
- Walk-forward AUC drift. Each weekly retrain produces a new fold-5 AUC. We care about the slope, not the level.
- Backtest vs live parity. Every live session is replayed the next day through the backtester against the captured tick stream. A trade that should have fired but did not is treated with the same seriousness as a trade that fired but should not have.
Nothing on this page is an offer, solicitation, or investment advice. Past walk-forward and backtest results do not guarantee future live performance. Commission, exchange fees, slippage, and regime changes can materially affect results. BHF Capital is an informational brand of Rare Bird Holdings LLC.