Philosophy | BHF Capital

1. The backtest is a contract with your future self

If a backtest looks better than live, one of two things is true: the backtest is cheating, or live has a bug. There is no third option. We treat the backtester as a contract with our future self — if it promises a fill, the broker has to deliver that fill, and if it does not, we find out why the same day.

The discipline that makes this work is boring: deterministic replay against the captured tick stream, commission on every simulated trade, slippage priced on every market-order fallback, and an explicit ban on in-sample label overlap. Nothing exotic. Nothing optimised for a headline number.

“The backtest should be the pessimistic twin of live. If it is ever the optimistic one, the model is already lying to you.”

2. Commission is not an afterthought

A high-frequency-ish strategy on ES is a game of pennies. Commission and exchange fees are the pennies you pay before the market gets a turn. We price them into every simulated fill at exactly the schedule we pay live. There is no “gross of fees” column anywhere in the believe codebase.

If a strategy only works gross of commission, it does not work.

3. Slippage is a regime variable

Slippage is not a constant. It is a function of the order book, news flow, time of day, and whether there is anyone at the other side. Our backtester applies a 1-tick penalty on any market-order fallback and a random 1-2 tick slippage on stop fills, both modelled as worst-reasonable cases. Marketable limits pay zero entry slippage when the book supports the fill — and when it does not, the trade simply does not exist in the simulated P&L.

The difference between modelling slippage honestly and modelling it charitably is the difference between a strategy that ships and one that quietly gets retired.

4. Regime detection is a haircut, not a strategy

Retail quant Twitter is full of “regime-switching” strategies that look perfect in backtest because they quietly assume the regime label is known in advance. We treat regime as a haircut on size, not a toggle on direction. If the microstructure model’s rolling AUC falls below a threshold on the latest week of live ticks, the head trades at lower confidence weight — it does not flip.

Flipping direction based on regime is a recipe for trading last week’s tape with this week’s money.

5. One instrument, three processes, no shortcuts

We do not trade CL, NQ, gold, bitcoin, or anything else. We trade ES. Three independent decision processes on ES are strictly harder to overfit than one strategy sprinkled across ten instruments. Adding instruments to smooth a P&L curve is the quantitative equivalent of putting a hat on a bald patch — the pattern shows up the moment the hat moves.

When ES stops working, we intend to find out why. We are not interested in hiding from the question.

6. Infrastructure is the strategy

believe runs on two kinds of machines, separated on purpose:

A Windows VM on Azure hosts IB Gateway, Sierra Chart, and the Python execution bridge. This is the only machine that talks to the broker and the exchange.
A Linux training cluster hosts the data pipeline, the backtester, the model training code, and the monitoring dashboard. It never sends an order.

The separation is structural. A bug in the training code cannot reach into the broker. A bug in the bridge cannot silently corrupt the training store. The two machines talk only through a tightly-scoped signed data path.

Watchdogs are CME-maintenance-aware. The daily 17:00-18:00 ET maintenance window is the single most common source of false-positive outage alerts. The watchdog knows the schedule and does not page the on-call for a scheduled quiet period.

7. Retraining is a feature, not a panic response

Models decay. That is the cost of trading a non-stationary process. Retraining is a scheduled event with a published cadence, not something we do when the P&L hurts. Panic retraining is the number one way to turn a bad week into a worse quarter.

Each retrain generates a model card containing: training window, feature-importance snapshot, out-of-fold AUC with confidence interval, calibration diagnostics, and the exact shard hashes of the data consumed. A model cannot reach production without a model card.

8. What we think the counter-trade is

We assume we are wrong in specific, knowable ways, and we try to notice them before they hurt:

Microstructure decay. If the top three features on F2_dom degrade at the same time, that is a sign exchange-level liquidity has reorganised. Acting on a stale book model in that moment is worse than not acting.
Overfitting to the 78-day capture fold. The live tick capture is our highest-fidelity fold, and the one easiest to overfit. We hold out the most recent week of it from every retrain on principle.
Selection bias on good weeks. We never claim long-run statistics from a window shorter than a quarter.

9. What “believe” actually means

The codename is a deliberate inversion. We do not believe the forecast. We believe the process: purged validation, honest commission, marketable-limit execution, separated infrastructure, scheduled retrains, deterministic replay. If the forecast is right we make money; if the forecast is wrong the execution path ensures we survive to train again.

“We do not need the model to be clever. We need the model to be honest, and the infrastructure around it to be boring.”