1. Headline coverage
Coverage window: Apr 2024 to Apr 2026 for historical ticks; the v133 continuous capture runs 2026-01-29 to 2026-04-17 (952k ticks); MBP-10 runs 2026-01-27 to 2026-04-17 (25M snapshots, Databento + IB live, $125.69 total vendor spend to date). All timestamps are exchange time (UTC, CME convention), verified against our own wall-clock capture before being written to the training store. Tick and DOM stores are separated into dedicated git repositories with Git LFS tracking for the binary shards.
2. Tick stack
The tick history is layered. Older ticks are synthetic 1-second OHLCV from Databento; recent ticks are real event-level trades and quotes; live ticks come from our own Sierra Chart capture running continuously since 2026-01-28.
| Range | Source | Level | Notes |
|---|---|---|---|
2024-04 → 2026-02 |
Databento OHLCV-1s | Synthetic 1s bars | Used for long-horizon feature windows (VPIN, RV, regime stats) |
2026-02 → 2026-04 |
Databento MBP-10 | Real ticks (trade + quote) | Training and backtest fuel for short-horizon models |
2026-01-29 → live |
Sierra Chart capture | Real ticks, native feed | Live capture on our own VM; forms the deterministic replay fold (952k ticks in v133) |
Synthetic OHLCV-1s is not a substitute for real ticks. We do not train short-horizon models on it. It exists so that long-horizon features (e.g. 20-day realised volatility, 6-month regime baselines) are well-supported on the day we start a backtest.
3. MBP-10 order book
25 million MBP-10 snapshots, spanning 2026-01-27
to 2026-04-17, covering ES on CME GLBX.MDP3. Each snapshot is
a point-in-time state of ten levels of the bid stack and ten levels of the ask stack, with
size and price at each level. Total vendor spend (Databento) to support the corpus is
$125.69 through Apr 17 2026; IB live depth is captured at no incremental
vendor cost.
Sources are layered and cross-checked, not a single vendor:
- Databento MBP-10 — the primary historical store. Clean, de-duplicated, event-time aligned.
- IB live depth — our own capture from the IB Gateway on the production VM, used as a real-time ground truth for the deployed model.
- External L2 feed (John’s) — a third-party Level-2 snapshot stream used as an independent consistency check against Databento and IB.
Three-source reconciliation. We run a daily job that compares bid/ask/size at matched timestamps across the three feeds. Discrepancies above a narrow tolerance are flagged before training sees them.
4. 5-minute bar history
The XGB 5m head is trained on closed 5-minute OHLCV bars resampled from the tick store, not pulled from a bar-vendor. This is deliberate: closing a bar from our own ticks guarantees that backtest bars match live bars exactly. It also forces us to handle the CME maintenance gap in one place, rather than inheriting whatever a vendor happened to do.
The 52-feature set is decomposed into five families: trend, range, relative position, session context, and volatility regime. Feature importance is tracked each retrain; any single feature accounting for more than 35% of gain triggers a manual review before the model ships.
5. Data quality and gaps
- CME maintenance window. 17:00-18:00 ET daily. Data is absent by design; the backtester and the live bridge both mask the window so no model trains or trades on phantom ticks.
- Holiday sessions. Partial RTH days are included; we do not exclude them from training, and we do not pretend they are full sessions in the metrics.
- Roll dates. Contract rolls are handled through a canonical front-month series with back-adjusted price only for long-horizon feature windows. Short-horizon features never cross a roll boundary.
- Outages. Capture outages on our side are recorded and excluded from the deterministic replay fold. We do not silently smooth over them.
6. Training store and retention
The training store is a tiered layout of Parquet shards on the Linux training cluster.
Each shard is content-addressed: {instrument}/{feed}/{date}/{shard}.parquet.
Every F2_dom training run records the exact shard hashes it consumed; reproducing a model
card from scratch requires nothing more than the hash list and the training recipe.
We retain every shard the system has ever trained on. Data is never overwritten.
7. Live tick capture since 2026-01-28
A dedicated Sierra Chart instance on the Azure Windows VM captures every tick we observe into a compressed append-only file. This is the most important fold in the backtester: it is the only set of ticks where we can run a deterministic replay of what the live system saw, alongside the trade journal from IB, and prove 1:1 parity between backtest and live.
Every day closed out of v133 becomes another row of the walk-forward fold. The 79-day
figure on the home page is this capture, which now spans
2026-01-29 → 2026-04-17 at 952k event-level ticks.