This project tests whether a small, interpretable set of top-of-book state and recent trade-flow variables contains out-of-sample predictive information about short-horizon BTCUSDT midprice direction.
The study is deliberately narrow:
- one instrument: Binance USDT-M perpetual
BTCUSDT - top-of-book quotes and aggregate trades only
- event-time labels at 10, 20, and 50 distinct quote-state updates
- chronological train / validation / test splits
- simple baselines and multinomial logistic regression
- explicit leakage, split-boundary, and timestamp-ordering audits
The purpose is not to claim a tradable strategy. The purpose is to test whether weak statistical signal survives an honest out-of-sample protocol, and to document where it breaks.
The current canonical experiment is the v2_float64_features 7-day checkpoint:
- symbol:
BTCUSDT - sample:
2024-03-01to2024-03-07UTC - artifact tag:
v2_float64_features - quote-event rows after feature/label construction: about 223 million
- model-eligible final test rows: 28,715,045
mid_return_5andrealized_vol_20are computed and stored asfloat64- the log-return-derived feature precision audit now passes
- compact prediction/probability summary diagnostics are included under
outputs/reports - plots and a current result manifest are available under
outputs/reports - selected plots are discussed in
docs/current_results.md - frozen final-test evaluation is complete for this checkpoint
The research protocol targets 30 clean calendar days, or at least 14 if 30 are not available. This 7-day v2 run should therefore be treated as a checkpoint, not the final generalization claim. A longer 14-day replication remains pending.
At each distinct top-of-book quote-state event, can current book state and recent trade flow predict whether the midprice moves down, stays unchanged, or moves up after a short event-time horizon?
For horizon h, the label is:
y_h = sign(midprice[t + h] - midprice[t])
The primary task is ternary classification over:
-1 = down
0 = unchanged
1 = up
The non-zero subset is also reported as a diagnostic, but it is not the headline task.
Raw data comes from Binance public USDT-M futures daily files:
bookTicker: top-of-book quotesaggTrades: aggregate trades
The unit of analysis is a distinct quote-state event. Consecutive duplicate quote states with identical bid, ask, bid size, and ask size are collapsed. Event time is therefore the index of distinct top-of-book states, not raw message count.
Quote rows are rejected if:
- bid price exceeds ask price
- bid or ask price is non-positive
- displayed bid or ask size is non-positive
The data quality report also checks quote/trade timestamp gaps and invalid records by UTC day before event construction.
The main feature set is intentionally small and interpretable.
Book-state features:
- relative spread
- queue imbalance
- log bid size
- log ask size
Short-memory quote-state features:
- change in bid size
- change in ask size
- 5-event log midprice return
- 20-event realized volatility proxy
Recent trade-flow features over the previous one second:
- trade intensity
- signed trade count imbalance
- signed trade volume imbalance
Trade-flow features use trades in the interval:
(quote_timestamp - 1000 ms, quote_timestamp)
Trades exactly equal to the quote-event timestamp are excluded.
The split is chronological and timestamp-based:
- train: first 60% of the sample window
- validation: next 20%
- test: final 20%
Important controls:
- no random split
- no shuffling
- scalers fit on train only through an sklearn pipeline
- features use only current and past quote/trade information
- labels use future midprices only for the target
- the last 50 labeled observations of train and validation are marked as boundary drops
- validation is used for threshold selection and diagnostics
- final test is evaluated once for final reporting
The audit checks:
- label alignment against manually recomputed event-time labels
- split-boundary target spillover
- quote-feature timing
float64recomputation of log-return-derived features- equal-timestamp trade exclusion
- target drift across train / validation / test
- daily test stability
- coefficient sign interpretation
- confusion matrix behavior under class imbalance
- spread-relative label magnitude
- protocol-defined regime performance by spread, realized volatility, and trade intensity
- For realized volatility, the train median is zero, so the high-volatility bucket should be interpreted as a nonzero-volatility / more active regime rather than a balanced half-sample split.
Audit summary:
critical_audit_passed: true
leakage_evidence: no_evidence_found
main_caveat: pass
log_feature_precision_status: pass
The main comparison uses:
majority_baseline: predicts the train majority classqueue_imbalance_logistic: multinomial logistic regression using queue imbalance onlyfull_logistic: multinomial logistic regression using all 11 featuresfull_logistic_thresholded: secondary validation-selected probability threshold rule
The primary model is full_logistic. The thresholded model is a secondary operating
point selected on validation macro F1 before final test evaluation.
Final test metrics for the current 7-day v2 frozen experiment:
| Horizon | Model | Accuracy | Macro F1 | Balanced Accuracy |
|---|---|---|---|---|
| 10 | majority baseline | 0.798 | 0.296 | 0.333 |
| 10 | queue-imbalance logistic | 0.798 | 0.296 | 0.333 |
| 10 | full logistic | 0.821 | 0.454 | 0.423 |
| 10 | full logistic, thresholded | 0.770 | 0.584 | 0.600 |
| 20 | majority baseline | 0.686 | 0.271 | 0.333 |
| 20 | queue-imbalance logistic | 0.686 | 0.271 | 0.333 |
| 20 | full logistic | 0.724 | 0.435 | 0.423 |
| 20 | full logistic, thresholded | 0.705 | 0.628 | 0.647 |
| 50 | majority baseline | 0.468 | 0.213 | 0.333 |
| 50 | queue-imbalance logistic | 0.468 | 0.213 | 0.333 |
| 50 | full logistic | 0.556 | 0.443 | 0.456 |
| 50 | full logistic, thresholded | 0.657 | 0.660 | 0.664 |
The full-feature logistic model beats both baselines on final-test macro F1 and
balanced accuracy at all three horizons. The default argmax decision rule remains
conservative and predicts unchanged very often. The validation-selected thresholded
rule sacrifices some unchanged-class accuracy but recovers substantially more down/up
recall.
Non-zero subset performance for the thresholded model:
| Horizon | Non-zero Accuracy | Non-zero Macro F1 | Non-zero Balanced Accuracy |
|---|---|---|---|
| 10 | 0.476 | 0.418 | 0.477 |
| 20 | 0.591 | 0.485 | 0.591 |
| 50 | 0.681 | 0.530 | 0.681 |
The result supports a narrow statistical claim:
In this frozen 7-day BTCUSDT sample, simple top-of-book state and recent trade-flow features contain out-of-sample information about short-horizon event-time midprice direction, relative to majority and queue-imbalance-only baselines.
The result does not support a trading claim. The project does not model:
- fees
- latency
- queue position
- fill probability
- market impact
- adverse selection
- executable PnL
- cross-venue or cross-asset generalization
The audit also found strong target drift: validation and test contain materially more directional labels than train, especially at longer horizons. This makes random-split style evaluation inappropriate and reinforces the need for replication on a fresh, longer sample.
- The current frozen experiment is a 7-day v2 checkpoint. The final protocol target is 14 or 30 clean days, so the result should not be read as a full-sample generalization claim.
- The study covers one instrument and one venue: Binance USDT-M
BTCUSDT. - The model is deliberately simple and linear. This is useful for auditability and interpretation, but it does not test more flexible nonlinear learners.
- Validation and test contain materially more directional labels than train. The target drift is reported explicitly and motivates replication on a longer sample.
- Regime-performance tables are included, but the current regime conclusions are still based on the 7-day checkpoint sample.
src/
artifact_naming.py # tagged artifact filename helpers
data_loader.py # Binance download/load/parsing helpers
quality_checks.py # raw quote/trade quality checks
event_builder_opt.py # memory-lean distinct quote-event construction
feature_builder_opt_float_64.py # v2 float64 log-return-derived feature builder
split_config.py # chronological split and boundary drops
modeling/
majority_baseline.py
logistic_models.py
evaluation/
metrics.py
model_comparison.py
probability_diagnostics.py
target_diagnostics.py
scripts/
load_data.py
quality_report.py
build_events.py
build_features_labels_opt.py
build_splits.py
run_majority_baseline.py
run_queue_imbalance_logistic.py
run_full_logistic.py
run_probability_diagnostics.py
run_final_test_evaluation.py
run_regime_analysis.py
run_research_audit.py
save_compact_test_predictions.py
make_result_plots.py
write_current_result_manifest.py
smoke_test_metrics.py
docs/
microstructure_signal_memo.md
current_results.md
outputs/
reports/
results/
logs/
The pinned conda environment is named quant_env:
conda activate quant_envFor a fresh environment, install the Python dependencies from requirements.txt:
conda env create -f environment.yml
conda activate quant_envAlternatively, create a virtual environment and install pinned pip dependencies:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txtRun the lightweight metric smoke test:
python scripts/smoke_test_metrics.pyThis run exercises the full pipeline shape on a smaller date range. It is for installation and pipeline validation only; do not compare its metrics to the frozen 7-day research result.
If raw zip files are already present locally, omit --download.
START=2024-03-01
END=2024-03-01
TAG=v2_float64_features
python scripts/load_data.py --start "$START" --end "$END" --symbol BTCUSDT --download
python scripts/quality_report.py --start "$START" --end "$END" --symbol BTCUSDT
python scripts/build_events.py --start "$START" --end "$END" --symbol BTCUSDT
python scripts/build_features_labels_opt.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/build_splits.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/target_diagnostics.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/run_majority_baseline.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG"
python scripts/run_queue_imbalance_logistic.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --max-iter 50
python scripts/run_full_logistic.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_probability_diagnostics.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_final_test_evaluation.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --solver saga --max-iter 50 --n-jobs -1
python scripts/run_regime_analysis.py --start "$START" --end "$END" --symbol BTCUSDT --artifact-tag "$TAG" --splits testThe current frozen run uses:
START=2024-03-01
END=2024-03-07
SYMBOL=BTCUSDT
TAG=v2_float64_features
The full 7-day run is compute-heavy. It processes hundreds of millions of quote-event rows and can produce tens of GB of local data/output artifacts. Before running the full pipeline, use the 1-day smoke run above to check that the local environment, data paths, and scripts work.
Download and parse local Binance files:
python scripts/load_data.py --start "$START" --end "$END" --symbol "$SYMBOL" --downloadIf the raw zip files are already present, omit --download:
python scripts/load_data.py --start "$START" --end "$END" --symbol "$SYMBOL"Build the data quality report:
python scripts/quality_report.py --start "$START" --end "$END" --symbol "$SYMBOL"Build distinct quote events:
python scripts/build_events.py --start "$START" --end "$END" --symbol "$SYMBOL"Build labels and float64 log-return-derived features:
python scripts/build_features_labels_opt.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Build chronological train / validation / test splits:
python scripts/build_splits.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Freeze dataset metadata and target diagnostics:
python scripts/target_diagnostics.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Run validation models and diagnostics:
python scripts/run_majority_baseline.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/run_queue_imbalance_logistic.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --max-iter 300
python scripts/run_full_logistic.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1
python scripts/compare_validation_results.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/run_probability_diagnostics.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1Run final test evaluation only after validation diagnostics are frozen:
python scripts/run_final_test_evaluation.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --solver saga --max-iter 300 --n-jobs -1Run protocol regime analysis using train-median cutoffs applied unchanged to validation/test:
python scripts/run_regime_analysis.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Run the research audit:
python scripts/run_research_audit.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG" --experiment-tag "$TAG"Save compact test-only prediction/probability diagnostics:
python scripts/save_compact_test_predictions.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Create compact plots and the current result manifest:
python scripts/make_result_plots.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"
python scripts/write_current_result_manifest.py --start "$START" --end "$END" --symbol "$SYMBOL" --artifact-tag "$TAG"Important: after run_final_test_evaluation.py, do not change features, thresholds,
labels, or model selection based on this test result. Any further changes should be a
new experiment on a fresh validation/test protocol.
Current frozen result files:
data/processed/feature_table_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
data/processed/model_dataset_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_aggregate.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_nonzero_subset.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_daily_blocks.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_aggregate.csv
outputs/results/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_nonzero_subset.csv
outputs/results/compact_test_predictions_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07.parquet
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_full_vs_baselines_deltas.csv
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_frozen_thresholds.csv
outputs/reports/final_test_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_thresholds.csv
outputs/reports/compact_test_predictions_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_summary.csv
outputs/reports/current_result_manifest.md
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_target_drift.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_horizon_performance.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_thresholded_vs_argmax_recall.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_regime_performance.png
outputs/reports/plots/plot_v2_float64_features_BTCUSDT_2024-03-01_to_2024-03-07_nonzero_performance.png
outputs/reports/research_audit/BTCUSDT_2024-03-01_to_2024-03-07_v2_float64_features/audit_summary.json
outputs/reports/research_audit/BTCUSDT_2024-03-01_to_2024-03-07_v2_float64_features/final_audit_conclusion.csv
Planned extensions:
- Add prediction-conditioned diagnostics using the compact probability summaries.
- Add an incremental-learning model as a secondary extension using the same features, labels, splits, and validation-only model selection discipline.
- Run a fresh 14-day clean-sample replication after the static and incremental pipelines are fixed.
- Update the README, memo, and result review after the replication run.