Home Methods Data Performance Philosophy Research Live Dashboard

1. Headline coverage

952kLive Ticks (v133)
25MMBP-10 Snaps
79 daysLive Tick Capture
ESCME GLBX.MDP3

Coverage window: Apr 2024 to Apr 2026 for historical ticks; the v133 continuous capture runs 2026-01-29 to 2026-04-17 (952k ticks); MBP-10 runs 2026-01-27 to 2026-04-17 (25M snapshots, Databento + IB live, $125.69 total vendor spend to date). All timestamps are exchange time (UTC, CME convention), verified against our own wall-clock capture before being written to the training store. Tick and DOM stores are separated into dedicated git repositories with Git LFS tracking for the binary shards.

2. Tick stack

The tick history is layered. Older ticks are synthetic 1-second OHLCV from Databento; recent ticks are real event-level trades and quotes; live ticks come from our own Sierra Chart capture running continuously since 2026-01-28.

RangeSourceLevelNotes
2024-04 → 2026-02 Databento OHLCV-1s Synthetic 1s bars Used for long-horizon feature windows (VPIN, RV, regime stats)
2026-02 → 2026-04 Databento MBP-10 Real ticks (trade + quote) Training and backtest fuel for short-horizon models
2026-01-29 → live Sierra Chart capture Real ticks, native feed Live capture on our own VM; forms the deterministic replay fold (952k ticks in v133)

Synthetic OHLCV-1s is not a substitute for real ticks. We do not train short-horizon models on it. It exists so that long-horizon features (e.g. 20-day realised volatility, 6-month regime baselines) are well-supported on the day we start a backtest.

3. MBP-10 order book

25 million MBP-10 snapshots, spanning 2026-01-27 to 2026-04-17, covering ES on CME GLBX.MDP3. Each snapshot is a point-in-time state of ten levels of the bid stack and ten levels of the ask stack, with size and price at each level. Total vendor spend (Databento) to support the corpus is $125.69 through Apr 17 2026; IB live depth is captured at no incremental vendor cost.

Sources are layered and cross-checked, not a single vendor:

Three-source reconciliation. We run a daily job that compares bid/ask/size at matched timestamps across the three feeds. Discrepancies above a narrow tolerance are flagged before training sees them.

4. 5-minute bar history

The XGB 5m head is trained on closed 5-minute OHLCV bars resampled from the tick store, not pulled from a bar-vendor. This is deliberate: closing a bar from our own ticks guarantees that backtest bars match live bars exactly. It also forces us to handle the CME maintenance gap in one place, rather than inheriting whatever a vendor happened to do.

The 52-feature set is decomposed into five families: trend, range, relative position, session context, and volatility regime. Feature importance is tracked each retrain; any single feature accounting for more than 35% of gain triggers a manual review before the model ships.

5. Data quality and gaps

6. Training store and retention

The training store is a tiered layout of Parquet shards on the Linux training cluster. Each shard is content-addressed: {instrument}/{feed}/{date}/{shard}.parquet. Every F2_dom training run records the exact shard hashes it consumed; reproducing a model card from scratch requires nothing more than the hash list and the training recipe.

We retain every shard the system has ever trained on. Data is never overwritten.

7. Live tick capture since 2026-01-28

A dedicated Sierra Chart instance on the Azure Windows VM captures every tick we observe into a compressed append-only file. This is the most important fold in the backtester: it is the only set of ticks where we can run a deterministic replay of what the live system saw, alongside the trade journal from IB, and prove 1:1 parity between backtest and live.

Every day closed out of v133 becomes another row of the walk-forward fold. The 79-day figure on the home page is this capture, which now spans 2026-01-29 → 2026-04-17 at 952k event-level ticks.