Top-of-Book Microstructure Signals for Short-Horizon Midprice Direction

This project tests whether a small, interpretable set of top-of-book state and recent trade-flow variables contains out-of-sample predictive information about short-horizon BTCUSDT midprice direction.

The study is deliberately narrow:

one instrument: Binance USDT-M perpetual BTCUSDT
top-of-book quotes and aggregate trades only
event-time labels at 10, 20, and 50 distinct quote-state updates
chronological train / validation / test splits
simple baselines and multinomial logistic regression
explicit leakage, split-boundary, and timestamp-ordering audits

The purpose is not to claim a tradable strategy. The purpose is to test whether weak statistical signal survives an honest out-of-sample protocol, and to document where it breaks.

Current Status

The current canonical experiment is the v2_float64_features 7-day checkpoint:

symbol: BTCUSDT
sample: 2024-03-01 to 2024-03-07 UTC
artifact tag: v2_float64_features
quote-event rows after feature/label construction: about 223 million
model-eligible final test rows: 28,715,045
mid_return_5 and realized_vol_20 are computed and stored as float64
the log-return-derived feature precision audit now passes
compact prediction/probability summary diagnostics are included under outputs/reports
plots and a current result manifest are available under outputs/reports
selected plots are discussed in docs/current_results.md
frozen final-test evaluation is complete for this checkpoint

The research protocol targets 30 clean calendar days, or at least 14 if 30 are not available. This 7-day v2 run should therefore be treated as a checkpoint, not the final generalization claim. A longer 14-day replication remains pending.

Research Question

At each distinct top-of-book quote-state event, can current book state and recent trade flow predict whether the midprice moves down, stays unchanged, or moves up after a short event-time horizon?

For horizon h, the label is:

y_h = sign(midprice[t + h] - midprice[t])

The primary task is ternary classification over:

-1 = down
 0 = unchanged
 1 = up

The non-zero subset is also reported as a diagnostic, but it is not the headline task.

Data and Event Construction

Raw data comes from Binance public USDT-M futures daily files:

bookTicker: top-of-book quotes
aggTrades: aggregate trades

The unit of analysis is a distinct quote-state event. Consecutive duplicate quote states with identical bid, ask, bid size, and ask size are collapsed. Event time is therefore the index of distinct top-of-book states, not raw message count.

Quote rows are rejected if:

bid price exceeds ask price
bid or ask price is non-positive
displayed bid or ask size is non-positive

The data quality report also checks quote/trade timestamp gaps and invalid records by UTC day before event construction.

Features

The main feature set is intentionally small and interpretable.

Book-state features:

relative spread
queue imbalance
log bid size
log ask size

Short-memory quote-state features:

change in bid size
change in ask size
5-event log midprice return
20-event realized volatility proxy

Recent trade-flow features over the previous one second:

trade intensity
signed trade count imbalance
signed trade volume imbalance

Trade-flow features use trades in the interval:

(quote_timestamp - 1000 ms, quote_timestamp)

Trades exactly equal to the quote-event timestamp are excluded.

Splits and Leakage Controls

The split is chronological and timestamp-based:

train: first 60% of the sample window
validation: next 20%
test: final 20%

Important controls:

no random split
no shuffling
scalers fit on train only through an sklearn pipeline
features use only current and past quote/trade information
labels use future midprices only for the target
the last 50 labeled observations of train and validation are marked as boundary drops
validation is used for threshold selection and diagnostics
final test is evaluated once for final reporting

The audit checks:

label alignment against manually recomputed event-time labels
split-boundary target spillover
quote-feature timing
float64 recomputation of log-return-derived features
equal-timestamp trade exclusion
target drift across train / validation / test
daily test stability
coefficient sign interpretation
confusion matrix behavior under class imbalance
spread-relative label magnitude
protocol-defined regime performance by spread, realized volatility, and trade intensity
For realized volatility, the train median is zero, so the high-volatility bucket should be interpreted as a nonzero-volatility / more active regime rather than a balanced half-sample split.

Audit summary:

critical_audit_passed: true
leakage_evidence: no_evidence_found
main_caveat: pass
log_feature_precision_status: pass

Models

The main comparison uses:

majority_baseline: predicts the train majority class
queue_imbalance_logistic: multinomial logistic regression using queue imbalance only
full_logistic: multinomial logistic regression using all 11 features
full_logistic_thresholded: secondary validation-selected probability threshold rule

The primary model is full_logistic. The thresholded model is a secondary operating point selected on validation macro F1 before final test evaluation.

Frozen Final Test Results

Final test metrics for the current 7-day v2 frozen experiment:

Horizon	Model	Accuracy	Macro F1	Balanced Accuracy
10	majority baseline	0.798	0.296	0.333
10	queue-imbalance logistic	0.798	0.296	0.333
10	full logistic	0.821	0.454	0.423
10	full logistic, thresholded	0.770	0.584	0.600
20	majority baseline	0.686	0.271	0.333
20	queue-imbalance logistic	0.686	0.271	0.333
20	full logistic	0.724	0.435	0.423
20	full logistic, thresholded	0.705	0.628	0.647
50	majority baseline	0.468	0.213	0.333
50	queue-imbalance logistic	0.468	0.213	0.333
50	full logistic	0.556	0.443	0.456
50	full logistic, thresholded	0.657	0.660	0.664

The full-feature logistic model beats both baselines on final-test macro F1 and balanced accuracy at all three horizons. The default argmax decision rule remains conservative and predicts unchanged very often. The validation-selected thresholded rule sacrifices some unchanged-class accuracy but recovers substantially more down/up recall.

Non-zero subset performance for the thresholded model:

Horizon	Non-zero Accuracy	Non-zero Macro F1	Non-zero Balanced Accuracy
10	0.476	0.418	0.477
20	0.591	0.485	0.591
50	0.681	0.530	0.681

Interpretation

The result supports a narrow statistical claim:

In this frozen 7-day BTCUSDT sample, simple top-of-book state and recent trade-flow features contain out-of-sample information about short-horizon event-time midprice direction, relative to majority and queue-imbalance-only baselines.

The result does not support a trading claim. The project does not model:

fees
latency
queue position
fill probability
market impact
adverse selection
executable PnL
cross-venue or cross-asset generalization

The audit also found strong target drift: validation and test contain materially more directional labels than train, especially at longer horizons. This makes random-split style evaluation inappropriate and reinforces the need for replication on a fresh, longer sample.

Limitations

The current frozen experiment is a 7-day v2 checkpoint. The final protocol target is 14 or 30 clean days, so the result should not be read as a full-sample generalization claim.
The study covers one instrument and one venue: Binance USDT-M BTCUSDT.
The model is deliberately simple and linear. This is useful for auditability and interpretation, but it does not test more flexible nonlinear learners.
Validation and test contain materially more directional labels than train. The target drift is reported explicitly and motivates replication on a longer sample.
Regime-performance tables are included, but the current regime conclusions are still based on the 7-day checkpoint sample.

Repository Structure

src/
  artifact_naming.py             # tagged artifact filename helpers
  data_loader.py                 # Binance download/load/parsing helpers
  quality_checks.py              # raw quote/trade quality checks
  event_builder_opt.py           # memory-lean distinct quote-event construction
  feature_builder_opt_float_64.py # v2 float64 log-return-derived feature builder
  split_config.py                # chronological split and boundary drops
  modeling/
    majority_baseline.py
    logistic_models.py
  evaluation/
    metrics.py
    model_comparison.py
    probability_diagnostics.py
    target_diagnostics.py

scripts/
  load_data.py
  quality_report.py
  build_events.py
  build_features_labels_opt.py
  build_splits.py
  run_majority_baseline.py
  run_queue_imbalance_logistic.py
  run_full_logistic.py
  run_probability_diagnostics.py
  run_final_test_evaluation.py
  run_regime_analysis.py
  run_research_audit.py
  save_compact_test_predictions.py
  make_result_plots.py
  write_current_result_manifest.py
  smoke_test_metrics.py

docs/
  microstructure_signal_memo.md
  current_results.md

outputs/
  reports/
  results/
  logs/

Setup

The pinned conda environment is named quant_env:

conda activate quant_env

For a fresh environment, install the Python dependencies from requirements.txt:

conda env create -f environment.yml
conda activate quant_env

Alternatively, create a virtual environment and install pinned pip dependencies:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Run the lightweight metric smoke test:

python scripts/smoke_test_metrics.py

1-Day Pipeline Smoke Run

This run exercises the full pipeline shape on a smaller date range. It is for installation and pipeline validation only; do not compare its metrics to the frozen 7-day research result.

If raw zip files are already present locally, omit --download.

START=2024-03-01
END=2024-03-01
TAG=v2_float64_features

python scripts/load_data.py --start "$START" --end "$END" --symbol BTCUSDT --download
python scripts/quality_report.py --start "$START" --end "$END" --symbol BTCUSDT
python scripts/build_events.py --start "$START" --end "$END" --symbol BTCUSDT
python scripts/build_features_labels_opt.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/build_splits.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/target_diagnostics.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/run_majority_baseline.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/run_queue_imbalance_logistic.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --max-iter 50
python scripts/run_full_logistic.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_probability_diagnostics.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_final_test_evaluation.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_regime_analysis.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --splits test

Reproducing the Current Pipeline

The current frozen run uses:

START=2024-03-01
END=2024-03-07
SYMBOL=BTCUSDT
TAG=v2_float64_features

The full 7-day run is compute-heavy. It processes hundreds of millions of quote-event rows and can produce tens of GB of local data/output artifacts. Before running the full pipeline, use the 1-day smoke run above to check that the local environment, data paths, and scripts work.

Download and parse local Binance files:

python scripts/load_data.py --start "$START" --end "$END" --symbol "$SYMBOL" --download

If the raw zip files are already present, omit --download:

python scripts/load_data.py --start "$START" --end "$END" --symbol "$SYMBOL"

Build the data quality report:

python scripts/quality_report.py --start "$START" --end "$END" --symbol "$SYMBOL"

Build distinct quote events:

python scripts/build_events.py --start "$START" --end "$END" --symbol "$SYMBOL"

Build labels and float64 log-return-derived features:

python scripts/build_features_labels_opt.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Build chronological train / validation / test splits:

python scripts/build_splits.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Freeze dataset metadata and target diagnostics:

python scripts/target_diagnostics.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Run validation models and diagnostics:

python scripts/run_majority_baseline.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/run_queue_imbalance_logistic.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --max-iter 300
python scripts/run_full_logistic.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1
python scripts/compare_validation_results.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/run_probability_diagnostics.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1

Run final test evaluation only after validation diagnostics are frozen:

python scripts/run_final_test_evaluation.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1

Run protocol regime analysis using train-median cutoffs applied unchanged to validation/test:

python scripts/run_regime_analysis.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Run the research audit:

python scripts/run_research_audit.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --experiment-tag "$TAG"

Save compact test-only prediction/probability diagnostics:

python scripts/save_compact_test_predictions.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Create compact plots and the current result manifest:

python scripts/make_result_plots.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/write_current_result_manifest.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"

Important: after run_final_test_evaluation.py, do not change features, thresholds, labels, or model selection based on this test result. Any further changes should be a new experiment on a fresh validation/test protocol.

Key Output Files

Current frozen result files:

data/processed/feature_table_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
data/processed/model_dataset_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_aggregate.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_nonzero_subset.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_daily_blocks.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_aggregate.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_nonzero_subset.csv
outputs/results/compact_test_predictions_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_full_vs_baselines_deltas.csv
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_frozen_thresholds.csv
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_thresholds.csv
outputs/reports/compact_test_predictions_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_summary.csv
outputs/reports/current_result_manifest.md
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_target_drift.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_horizon_performance.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_thresholded_vs_argmax_recall.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_performance.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_nonzero_performance.png
outputs/reports/research_audit/BTCUSDT_2024-03-01_to_2024-03-07_v2_float64_features/audit_summary.json
outputs/reports/research_audit/BTCUSDT_2024-03-01_to_2024-03-07_v2_float64_features/final_audit_conclusion.csv

Future Work

Planned extensions:

Add prediction-conditioned diagnostics using the compact probability summaries.
Add an incremental-learning model as a secondary extension using the same features, labels, splits, and validation-only model selection discipline.
Run a fresh 14-day clean-sample replication after the static and incremental pipelines are fixed.
Update the README, memo, and result review after the replication run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Top-of-Book Microstructure Signals for Short-Horizon Midprice Direction

Current Status

Research Question

Data and Event Construction

Features

Splits and Leakage Controls

Models

Frozen Final Test Results

Interpretation

Limitations

Repository Structure

Setup

1-Day Pipeline Smoke Run

Reproducing the Current Pipeline

Key Output Files

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Top-of-Book Microstructure Signals for Short-Horizon Midprice Direction

Current Status

Research Question

Data and Event Construction

Features

Splits and Leakage Controls

Models

Frozen Final Test Results

Interpretation

Limitations

Repository Structure

Setup

1-Day Pipeline Smoke Run

Reproducing the Current Pipeline

Key Output Files

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages