AurumQ-RL · An Open-Source Reinforcement Learning Stock-Selection Framework for the China A-Share Market
中文:一份面向 A 股的因子工程 + 强化学习选股参考实现,附完整的迭代史、消融实验、生产化决策与教训。 English: A factor-engineering + reinforcement-learning stock-picking reference implementation for China A-shares, shipped with the full iteration history, ablations, productionization decisions, and lessons learned.
📊 China A-share · 🤖 PPO/A2C/SAC · 🚀 GPU Train + CPU Infer · 📈 Alpha101 + GTJA Alpha191 + Main-Force + Hot-Money + Northbound · 🧪 26 Phases of Open Experiments
中文. AurumQ-RL 是一个针对 A 股市场特有微观结构(T+1、±10% 涨跌停、主板/科创/创业/北交分层、ST 风险警示、申万一级行业、龙虎榜、北向、游资席位、筹码分布)做工程化封装的强化学习选股开源项目。仓库内含:(1) 一个 polars-native 的因子计算引擎,覆盖 105 个 WorldQuant Alpha101 + 191 个国泰君安 Alpha191(合计 296 个量价因子)外加 11 个 A 股私有因子族(mf_, mfp_, hm_, hk_, inst_, mg_, cyq_, senti_, sh_, fund_, ind_, mkt_, gtja_, tech_, cmf_, zt_);(2) Stable-Baselines3 PPO 训练栈,针对 RTX 4070 12 GB 做了 GPU 化重构(per-stock 编码器 + CUDA-resident rollout buffer + 索引化观测);(3) 完整的 14-phase 训练栈演化史,从最初 11% GPU 利用率到 1M-step 隔夜训练;(4) 26-phase 模型实验史,覆盖奖励重设计、长 panel 消融、rank-z 假设检验、SHAP 剪枝、事件衰减编码等关键转折;(5) ONNX 导出 + CPU 推理生产管线。核心实证发现:(a) rank-z 跨截面归一化会在长 panel 训练中销毁 5-6 bps 的跨年因子幅度信号;(b) 5 年训练窗口已是 plateau,2018-2019 数据零边际贡献;(c) Strategy D(top-K 仓位按分数加权)能与任何基模型叠加 +7-10 bps 的 mean_y;(d) 二值事件标志直接进 LayerNorm 是 −33% 准确率回归的元凶,必须用 exp-decay τ=10d 编码;(e) cyq 筹码因子的回填 vs 实采分布漂移导致 1.5× T-1 hit 回归,根因是 z-score 不抹平时序 regime 跳变。当前生产状态(2026-05-11):10 个 model_version 同时在 Celery Beat 18:50-19:00 排程,新进 best 是 path5_long(H1 校准 mean_y +0.02882,T1_hit 55.8%)。
English. AurumQ-RL is an open-source RL stock-selection framework engineered around the China A-share market's unique microstructure (T+1 settlement; ±10% daily limit per board; main-board vs ChiNext/STAR/BSE segmentation; ST risk-warning; SW Tier-1 industry; Dragon-Tiger List; Northbound; hot-money seats; chip distribution). The repository ships (1) a polars-native factor engine covering 105 WorldQuant Alpha101 + 191 Guotai Junan Alpha191 (296 price-volume factors total) plus 11 private A-share factor families (mf_, mfp_, hm_, hk_, inst_, mg_, cyq_, senti_, sh_, fund_, ind_, mkt_, gtja_, tech_, cmf_, zt_); (2) a Stable-Baselines3 PPO training stack with a GPU-vectorized rewrite for the RTX 4070 12 GB (per-stock encoder + CUDA-resident rollout buffer + index-only observations); (3) the complete 14-phase training-stack evolution from 11 % GPU utilization at bring-up to overnight 1M-step training; (4) the 26-phase modeling experiment history covering reward redesign, long-panel ablations, the rank-z hypothesis test, SHAP-based pruning, and event-decay encoding; (5) ONNX export + CPU inference production pipeline. Key empirical findings. (a) Per-day cross-sectional rank-z destroys 5-6 bps of cross-year factor amplitude signal in long-panel training. (b) Five-year training window is the plateau; 2018-2019 data contribute nothing. (c) Strategy D top-K score-weighted sizing compounds +7-10 bps mean_y onto any base model. (d) Binary event flags fed directly into LayerNorm cause a −33 % accuracy regression and must be encoded as exp-decay with τ=10d. (e) Chip-distribution (cyq) backfill versus real-sample distribution shift drove a 1.5× T-1 hit regression — root cause is that cross-section z-score does not equalize mid-stream temporal regime shifts. Current production state (2026-05-11): 10 model versions co-scheduled in the Celery Beat 18:50-19:00 budget; the new best is path5_long (H1 calibrated mean_y +0.02882, T1_hit 55.8 %).
中文 · English · Phase Timeline
- §1 引言:为什么 A 股市场需要专门的 RL 框架
- §2 系统总览:数据契约、因子前缀、宇宙过滤
- §3 因子库:296 量价因子 + 13 个 A 股私有因子族
- §4 训练方法演进:14 个 Phase 的工程史
- §5 模型实验史:26 个 Phase 的研究决策
- §6 监督学习对照赛道(paris 侧 P0/P2/Path 1-6)
- §7 实证发现:6 条改变方向的结论
- §8 生产流水线:每日 18:30-19:00 评分预算
- §9 工程教训:从踩坑到守则
- §10 上手与复现
- §11 路线图、引用、许可
- §12 研究范式分类与未来方向
- §1 Introduction: Why A-shares Need a Dedicated RL Framework
- §2 System Overview: Data Contract, Factor Prefixes, Universe Filter
- §3 Factor Library: 296 Price-Volume + 13 Private A-share Families
- §4 Training-Stack Evolution: 14 Phases of Engineering
- §5 Modeling Experiment History: 26 Phases of Research Decisions
- §6 Supervised-Learning Companion Track (paris side P0/P2/Path 1-6)
- §7 Empirical Findings: Six Conclusions That Changed Direction
- §8 Production Pipeline: Daily 18:30-19:00 Scoring Budget
- §9 Engineering Lessons: From Pitfalls to Operating Rules
- §10 Quick Start and Reproduction
- §11 Roadmap, Citation, License
- §12 Research Paradigms and Future Directions
Phase 0 Synthetic pipeline-up (~pre-2026-04-29)
Phase 1 First real-data alpha101 PPO 2026-04-29..30
Phase 2 Combined 355-col panel + wider net 2026-04-30
Phase 3 R1/R2/R3 smoke-round tuning 2026-05-01 morning
Phase 4 fps scaling, IPC ceiling discovery 2026-05-01 noon
Phase 5 Realizations → GPU framework redesign 2026-05-01 PM
Phase 6/7 GPU-vectorized framework + 50k smoke 2026-05-01 evening
Phase 8 GPURolloutBuffer (CUDA-resident) 2026-05-01 evening
Phase 9 IndexOnlyRolloutBuffer + n_steps=1024 2026-05-01 late evening
Phase 10 Optimizer orphan + LayerNorm + dual pooling 2026-05-01 night
Phase 11/12 bf16 autocast / target_kl=0.10 (eliminated)
Phase 13 PPO SGD perf-probe 2026-05-01 late night
Phase 14 TF32 + unique-date + 1M overnight 2026-05-02 → 03 early
Phase 15 RL serving integration (champion) 2026-05-02..03
Phase 16-19 Eval correction + multi-seed ensemble 2026-05-03
Phase 20 Long-data PPO 2026-05-05
Phase 21 V2 forward_10d REJECTED 2026-05-05..06
Phase 22 Main-wave reward redesign 2026-05-06
Phase 23 Episode targets cleanup 2026-05-06
Phase 24/25 Tech-factor + importance-weight REJECT 2026-05-07
Phase 26A-G cyq fix + event-decay tech (26F prod) 2026-05-07..08
# SL companion (paris side, AurumQ repo)
P0 Wave label ablation (Method A wins) 2026-05-09
P2 5-seed ensemble (wave_t3_lgbm_v2) 2026-05-09
P3 PPO residual ALGORITHM_SPEC v2 2026-05-09
Path 1-6 Multi-path SL exploration 2026-05-10
Long-panel Hybrid / path1_long / path5_long 2026-05-10..11
中文. A 股市场和欧美/港股/加密的微观结构差异远大于表面上的"也是股票"。本项目所有训练、回测、因子计算一律仅在 A 股主板 + 非 ST + 未退市(约 3000 只)上跑,这是 CLAUDE.md 的硬约束。理由是 regime homogeneity:
- 板别价格限制差异:主板 ±10%、ST/*ST ±5%、科创板 ±20%、创业板 ±20%、北交所 ±30%。把这五个池子混进同一个训练 batch,模型必须额外学一个"我现在在哪个板"的元变量,跨年泛化崩溃。
- T+1 结算:当日买入次日才能卖。RL 训练时的"动作-奖励"延迟和欧美/加密 T+0 完全不同;用 t→t+1 的同日 PnL 等价于看穿未来。
- 集合竞价 vs 连续竞价:9:15-9:30 集合竞价是 A 股最重要的"标价"事件之一,
stk_auction.bid_*字段在kpl_list里需要单独抓,常规pct_chg/amount不适用。 - 披露节奏:主板财报 + 业绩预告时点高度同步,跨年信号同质性强;科创板/北交所披露规则差异显著。
- 流动性结构:主板日均成交额 vs 北交所差 1-2 个数量级,分位归一化(rank-z)会把北交所的微弱信号放大到和主板同尺度,污染监督信号。
English. A-share microstructure differs from US/HK/crypto markets in ways that defeat naive "stocks are stocks" transfer. All training, backtest, and factor computation are restricted by hard constraint (CLAUDE.md) to main-board, non-ST, non-delisted stocks (~3000 names). Reasons:
- Board-specific price limits: main-board ±10 %, ST/*ST ±5 %, STAR/ChiNext ±20 %, BSE ±30 %. Mixing these into a single training batch forces the model to learn a "which board am I on" meta-variable, destroying cross-year generalization.
- T+1 settlement: bought today cannot be sold until tomorrow. The action-reward lag in RL training is fundamentally different from T+0 US/crypto. Using same-day t→t+1 PnL is look-ahead.
- Call auction vs continuous trading: the 09:15–09:30 call auction is one of the most informative tape events in A-shares.
stk_auction.bid_*fields inkpl_listneed special extraction; standardpct_chg/amountdo not apply. - Disclosure cadence: main-board earnings + pre-announcement timing is tightly synchronized, giving cross-year signal homogeneity; STAR/BSE differ.
- Liquidity structure: main-board daily turnover vs BSE differs by 1–2 orders of magnitude; rank-z normalization would inflate BSE micro-signals to the same scale as main-board, polluting supervision.
中文. 传统量化做法把 alpha101/gtja191 当 96 个排名分数线性加权,找最优权重向量。问题:
- 截面 IC 是低维度量:300 个因子 × 200 天 = 6 万样本估 300 个权重,过拟合容易,跨年泛化差。
- 非线性交互被忽略:
alpha_026 × cyq_winning_ratio在 30% 行业暴露上限下的表现可能远超两者线性和,线性模型抓不到。 - 奖励/成本不进训练目标:传统模型最小化 IC residual,但真实回报是「扣除 30bp 双边费 + T+1 不能反手 + 单行业 30% 上限」之后的实现收益,目标函数和评估指标不一致。
- 状态依赖动作:今天选哪 50 只取决于全市场截面分布,不是任意 200 只独立打分加总。
RL 用神经网络直接学映射「当前因子截面 + 持仓状态 → top-K 动作」,把成本、流动性、约束都搬进环境。代价是:样本效率低、调参成本高、可解释性差。本项目实事求是地认为「RL 不一定比线性 alpha 加权强」,因此并行维护一条监督学习(SL)赛道作为对照(见 §6);事实证明 SL 赛道的 path5_long 当前是综合最佳(H1 +0.02882),但 RL 赛道的 phase15e_150k_grand_champion 在 Sharpe 维度(OOS +6.27)仍是不可替代的多样性来源。
English. Conventional quant treats alpha101/gtja191 as ~96 ranking scores to linearly combine. Problems:
- Cross-sectional IC is a low-dimensional statistic. 300 factors × 200 days = 60k samples to estimate 300 weights — overfitting prone, poor cross-year generalization.
- Nonlinear interactions go unmodeled.
alpha_026 × cyq_winning_ratiounder a 30 %-industry-cap can dominate either factor alone; linear models cannot capture this. - Cost/constraint are absent from the training objective. Traditional models minimize IC residual; realized return is net of 30bp round-trip cost, T+1 inability to reverse, and 30 % single-industry exposure cap. Objective and evaluation diverge.
- State-dependent action. Today's top-50 picks depend on the full market cross-section, not 200 independent scores summed.
RL learns the mapping current factor cross-section + holdings → top-K action directly, moving cost / liquidity / constraints into the environment. Tradeoffs: poor sample efficiency, expensive hyperparameter tuning, weak interpretability. This project honestly does NOT assume RL is always better than linear alpha aggregation — we maintain a parallel supervised-learning (SL) track as control (§6). Today the SL track's path5_long is the all-around best (H1 +0.02882), but the RL track's phase15e_150k_grand_champion remains an irreplaceable Sharpe-dimension diversity contributor (OOS +6.27).
中文. 本项目是:
- ✅ RL 选股算法 + Gymnasium 环境的参考实现
- ✅ A 股微观结构(T+1 / 涨跌停 / ST / 板别 / 行业暴露上限)的工程化封装
- ✅ 多源因子的消费者(按列名前缀识别,输入有什么就用什么)
- ✅ 离线训练 → ONNX 导出 → CPU 推理的端到端流水线
- ✅ 完整的研究决策记录:每个 phase 都记录了 stack 变更 / 量化证据 / 拒绝原因,便于复现和审计
不是:
- ❌ 实盘交易系统(无券商接口、无下单 API)
- ❌ 数据采集工具(不内置任何数据 API key;用户自己 pipeline 写 Parquet)
- ❌ 因子计算库(因子计算可选——
aurumq_rl.factors提供 296 个 polars 实现,但用户可以完全自己算) - ❌ 高频交易(日频选股,T+1 持仓)
English. This project is:
- ✅ Reference implementation of an RL stock-picking algorithm + Gymnasium environment
- ✅ Engineering encapsulation of A-share microstructure (T+1 / price limits / ST / board / industry cap)
- ✅ A consumer of multi-source factors (prefix-recognized; uses whatever is in your Parquet)
- ✅ End-to-end offline-train → ONNX-export → CPU-infer pipeline
- ✅ Complete research decision log: every phase records stack diff, quantitative evidence, rejection reason — auditable and reproducible
Is NOT:
- ❌ A live trading system (no broker adapter, no order API)
- ❌ A data-ingestion tool (no API keys bundled; you write the Parquet)
- ❌ A factor-computation library (optional —
aurumq_rl.factorsoffers 296 polars implementations, but you can roll your own) - ❌ High-frequency trading (daily-frequency stock picking under T+1)
中文. 本项目是 AurumQ 平台的一个开源子模块,整个生态分三层:
| 层 | 项目 | 职责 |
|---|---|---|
| 策略 DSL | AQML | .aqml 声明式策略,可读 / 可验证 / AI 可生成的筛选 + 打分 + 风控规则 |
| 因子 + RL | AurumQ-RL (本项目) | 因子工程、A 股约束、PPO/A2C/SAC、ONNX 推理 |
| 平台 | AurumQ (闭源) | Web 平台 + REST API + 模拟盘 + 风控引擎 + AI 投研 |
典型工作流:先用 AQML 写策略意图 → 用 AurumQ-RL 把因子列和约束塞进模型训练 → 训出的 .onnx 回到 AurumQ 平台跑模拟盘和实时排程。
English. This project is an open-source submodule of the AurumQ platform; the ecosystem has three layers:
| Layer | Project | Role |
|---|---|---|
| Strategy DSL | AQML | .aqml declarative strategy — human-readable, validatable, AI-generatable screening + scoring + risk rules |
| Factors + RL | AurumQ-RL (this repo) | Factor engineering, A-share constraints, PPO/A2C/SAC, ONNX inference |
| Platform | AurumQ (proprietary) | Web platform + REST API + paper trading + risk engine + AI research |
Typical workflow: write strategy intent in AQML → feed factor columns and constraints into AurumQ-RL for training → the resulting .onnx returns to the AurumQ platform for paper trading and real-time scheduling.
中文. 项目对外契约就一句话:给我一份 Parquet,我就能训练。Parquet 必含:
ts_code(str): Tushare 风格代码XXXXXX.SH/SZ/BJtrade_date(date): 交易日close(float): 收盘价pct_chg(float): 涨跌幅(小数形式,+10% = 0.10,不是 10.0)vol(float): 成交量(== 0 视为停牌)- 因子列(至少包含一组前缀):
alpha_*/gtja_*/mf_*/mfp_*/hm_*/hk_*/inst_*/mg_*/cyq_*/senti_*/sh_*/fund_*/ind_*/mkt_*/tech_*/cmf_*/zt_*
可选字段(提供则使用,不提供则自动降级):
is_st(bool): ST 标记,缺则按全 False 处理days_since_ipo(int): 上市以来交易日数(用于新股 60 日保护)industry_code(int): 申万一级行业编码(用于 30% 行业暴露上限)is_hs300/is_zz500(bool): 是否成分股,按 trade_date 历史变更(支持「2024-01 在 300、2024-06 调出」的时变性)
数据怎么来不是本项目关心的事,三种取数方式:
- 用
scripts/generate_synthetic.py一键生成 10 MB 合成数据 demo - 用
scripts/export_factor_panel.py从 PostgreSQL 自己的数据仓库抽取(含 SQL 模板,含 HS300/ZZ500 成员标志支持) - 自己用任何工具(pandas / DuckDB / Spark)造一份满足契约的 Parquet 写入
English. The single contract is: give me a Parquet, I will train. Required columns:
ts_code(str): Tushare-style codeXXXXXX.SH/SZ/BJtrade_date(date)close(float)pct_chg(float): decimal form, +10 % = 0.10, NOT 10.0vol(float): 0 means suspended- Factor columns under at least one prefix:
alpha_*/gtja_*/mf_*/mfp_*/hm_*/hk_*/inst_*/mg_*/cyq_*/senti_*/sh_*/fund_*/ind_*/mkt_*/tech_*/cmf_*/zt_*
Optional fields (auto-fallback when absent): is_st, days_since_ipo, industry_code, is_hs300, is_zz500 (time-varying by trade_date).
Three ways to get data: (1) synthetic demo (generate_synthetic.py), (2) export from your own PG warehouse (export_factor_panel.py), (3) BYO Parquet from any tool that meets the contract.
中文. data_loader.py 通过列名前缀识别因子组,输入 Parquet 中存在的前缀就被自动加载,不存在的自动跳过。这套设计的核心是:项目本身不知道你给的是哪些因子,只要列名前缀对得上就一律纳入观测。
| 前缀 | 含义 | 推荐维度 | 输入数据要求 |
|---|---|---|---|
alpha_* |
WorldQuant Alpha101(项目自带 105 个实现,含 6 个自定义补充) | 105 | 日频 OHLCV + amount |
gtja_* |
国泰君安 Alpha191(项目自带 191 个实现) | 191 | 日频 OHLCV + vwap + amount + 基准指数 OHLC |
mf_* |
Money Flow Velocity — 主力资金流速(4 档累计筹码) | 14 + 6 _log 变体 |
4 档资金流分档 |
mfp_* |
Main Force Position — 主力筹码持仓(与 mf_ 互补,不要混用) |
12 | 主力净持仓时序表 |
hm_* |
Hot Money — 主流游资席位 | 6 | 龙虎榜游资席位日成交明细 |
hk_* |
Northbound — 北向资金真实持股 | 4 | 北向持股日表(港股通名单内) |
inst_* |
Institutional — 龙虎榜机构净买入 | 3 | 龙虎榜机构席位明细 |
mg_* |
Margin — 融资融券 | 3 | 融资融券日表 |
cyq_* |
Chip Distribution — 筹码分布 | 3 | Tushare cyq_perf 表 |
senti_* |
Sentiment — 涨停板情绪 | 3 | 涨停板池 + 热度榜 |
sh_* |
Shareholder — 股东户数 + 大股东增减持 | 2 | 股东数据 |
fund_* |
Fundamentals — 基本面 PE/PB/ROE/营收增速 | 4 | 基本面表 |
ind_* |
Industry — 申万行业相对强度 | 2 | 行业指数 |
mkt_* |
Market — 大盘 + 拥挤度 | 2 | 指数日表 |
tech_* |
Technical — 上游算好的 MA/KDJ/MACD/Bollinger(v1.1 后 30 列) | 30 | 已 z-score 的 OHLCV 技术指标 |
cmf_* |
Chaikin Money Flow — 60d/120d 累计资金流 | 2 | 量价资金流派生 |
zt_* |
涨停板 stats — 30d/60d 涨停频次、首板、连板 | 6 | 涨停板池 + 历史 |
总维度灵活:纯 Alpha101 = 105 维 / Alpha101 + GTJA191 = 296 维 / 全部 17 前缀 ≈ 360 维 / 自定义任意子集。
StockPickingConfig.n_factors 决定取前 N 个因子(按字母序),多余的丢弃,不足的报错。
English. data_loader.py recognizes factor groups by column-name prefix; whatever prefixes are present in your Parquet get loaded, whatever is absent gets skipped. The design assumes the project does not know which factors you have — match the prefix, get included as observation.
(See the Chinese table above for the 17-prefix breakdown. Total flexible: pure alpha101 = 105 dims / alpha+gtja = 296 / all 17 prefixes ≈ 360.)
中文. 默认 UniverseFilter.MAIN_BOARD_NON_ST 应用六道 AND 闸门:
data_ok: 当日有日线数据 ANDvol > 0(剔除停牌)main_board:60[0135]\d{3}.SH∪00[0123]\d{3}.SZ(剔除 300***/688***/689***/4xx/8xx/9xx)listed:days_since_ipo ≥ 60(新股 60 日保护)not_delisted:stock_info.delist_date IS NULLnot_st:is_st == FalseAND stock_name 不含 "ST" / "*ST" / "退"not_suspended: 当日非停牌(vol > 0 也涵盖此条件)
应用顺序很重要:先 data_ok 再 main_board,避免在停牌日按板别 regex 算返回值时遇到 NaN/Null 行的 regex 失败。
如要自定义:
from aurumq_rl.data_loader import UniverseFilter, load_panel
# 全市场(仅排 ST + 停牌)
panel = load_panel("data.parquet", universe_filter=UniverseFilter.ALL_NON_ST)
# 只跑沪深 300
panel = load_panel("data.parquet", universe_filter=UniverseFilter.HS300)English. Default UniverseFilter.MAIN_BOARD_NON_ST applies six AND-gates: data_ok (has bar + vol > 0), main_board (regex 60[0135]\d{3}.SH ∪ 00[0123]\d{3}.SZ), listed (days_since_ipo ≥ 60), not_delisted, not_st (is_st = False AND stock_name excludes ST/*ST/退), not_suspended. Ordering matters: data_ok before main_board to avoid regex on NaN. Alternative filters: ALL_NON_ST, HS300, ZZ500, or supply your own callable.
中文. 仓库布局:
aurumq-rl/
├── src/aurumq_rl/
│ ├── env.py # StockPickingEnv (Gymnasium)
│ ├── gpu_env.py # GPU-vectorized env (Phase 6+)
│ ├── portfolio_weight_env.py # 连续权重组合环境(马科维茨扩展)
│ ├── data_loader.py # Parquet → numpy/cuda 面板(多前缀识别)
│ ├── policies/
│ │ ├── per_stock_encoder.py # Deep Sets 风格 per-stock 编码器
│ │ └── shared_policy.py # 早期 flat MLP(保留参考)
│ ├── rollout/
│ │ ├── gpu_rollout_buffer.py # CUDA-resident rollout buffer
│ │ └── index_only_buffer.py # 索引化观测(Phase 9+)
│ ├── inference.py # ONNX CPU 推理
│ ├── onnx_export.py # SB3 → ONNX 导出
│ ├── price_limits.py # 板别动态涨跌停
│ ├── reward_functions.py # Return / Sharpe / Sortino / Mean-Variance / MainWaveHold
│ ├── main_wave_labels.py # Phase 22 — MA5/MA10 死叉 + 5d cap 持仓回报标签
│ ├── metrics.py # 训练指标 JSONL 读写
│ ├── wandb_integration.py # 实验跟踪(默认离线)
│ ├── sb3_callbacks.py # SB3 callbacks (WandbMetricsCallback, GpuSamplerCallback, …)
│ ├── gpu_monitor.py # pynvml-based GPU 采样
│ └── factors/ # polars-native 因子库
│ ├── alpha101/ # WorldQuant Alpha101 (105 因子,10 模块)
│ ├── gtja191/ # 国泰君安 Alpha191 (191 因子,10 batch 文件)
│ ├── _ops.py # 25+ 通用算子(ts_sum / ts_corr / cs_rank / decay_linear / regbeta / ...)
│ ├── registry.py # ALPHA101_REGISTRY + GTJA191_REGISTRY
│ └── _docs.py # markdown 文档生成器
├── scripts/
│ ├── train.py # 训练入口 V1(CLI)
│ ├── train_v2.py # 训练入口 V2(Phase 21+ Dict obs,被 Phase 22 V1 main_wave 回退后保留)
│ ├── infer.py # 推理入口(CLI)
│ ├── eval_backtest.py # 测试集 IC / Sharpe / 等权净值曲线
│ ├── _eval_main_wave_v1.py # Phase 22 V1 main_wave 评估(含 hold_return / win_rate / drawdown)
│ ├── compare_rewards.py # 多 reward 类型对比训练
│ ├── export_factor_panel.py # PG → Parquet 数据抽取(含 SQL 模板)
│ ├── generate_synthetic.py # 合成 demo 数据生成
│ ├── oss_download_resumable.py # HEAD + Range-based resumable downloader
│ └── reference_data/ # alpha101 / gtja191 reference parquet 重建脚本
├── web/ # Next.js 16 dashboard(runs/ 可视化)
├── data/
│ ├── README.md # 数据格式 + 列名约定
│ └── synthetic_demo.parquet # 10 MB 开箱即跑
├── docs/
│ ├── ARCHITECTURE.md
│ ├── FACTORS.md # 因子前缀约定 + 列名规范
│ ├── TRAINING_HISTORY.md # 14 phase 完整训练栈演化(1350 行)
│ ├── factor_library/ # 296 篇因子 markdown 文档
│ ├── phase26/ # Phase 26A-G 实验报告
│ ├── SCHEMA.md
│ ├── TRAINING.md
│ └── INFERENCE.md
├── tests/ # 1386+ 测试,含因子 parity 与 docs 验证
├── handoffs/ # 跨机器(4070 ↔ ECS)交接日志
└── examples/
└── quickstart.py # 端到端示例
English. Repository layout (see Chinese tree above for full structure). Three key code surfaces:
src/aurumq_rl/env.pyandgpu_env.py— the GymnasiumStockPickingEnvand its later GPU-vectorized counterpartsrc/aurumq_rl/policies/per_stock_encoder.py— the Deep Sets-style permutation-equivariant policy that became the architectural breakthrough in Phase 5src/aurumq_rl/main_wave_labels.py— Phase 22's MA5/MA10 death-cross + 5d-cap hold-return reward that broke the 5.72 % random baseline for the first time
中文. 项目对硬件有两条红线:
- 本地 ECS(8C14G)严禁运行训练。PyTorch 安装即占 ~3 GB RSS,训练时 OOM 必杀;7-worker
ProcessPoolExecutor曾把整台主机 OOM-killed + 重启。训练只能在 GPU 实例(推荐本地 RTX 4070+ 或云端 RTX 4090 / A10 / V100)。 max_workers=3是硬上限(对所有ProcessPoolExecutor/ThreadPoolExecutor),PostgreSQLshared_buffers=2GB,内存余量 < 4 GB 时 PG 会被 OOM。
实证训练成本(i7-13700K + RTX 4070 12 GB + 64 GB DDR5):
| 配置 | 因子数 | 训练步数 | wall time | 备注 |
|---|---|---|---|---|
| smoke (Phase 0) | 16 | 1k | 90s | 合成数据,CPU 即可 |
| Phase 1 | 16 | 100k | ~50 min | alpha101 short panel,n_envs=8, fps 333 |
| Phase 7 | 64 | 50k | ~7 min | GPU framework smoke, fps 1490 |
| Phase 10 | 64 | 1M | ~8h(隔夜) | LayerNorm + dual pooling + bf16, fps 326 |
| Phase 14 | 64 | 1M | ~6h(隔夜) | TF32 + unique-date, fps 460 |
| Phase 16a (prod) | 343 | 300k | ~5h | 6 seeds 并行外推 |
| Phase 22 (main_wave) | 343 | 300k | ~8h | 3-run 隔夜对照 (A/B/C) |
| Phase 26F-v3 (prod) | 361 | 300k | ~5h | 3 seeds × 1 config |
English. Two hard rules:
- The local 8-core 14 GB ECS is FORBIDDEN for training. PyTorch installation alone occupies ~3 GB RSS; training will OOM-kill the host. A 7-worker
ProcessPoolExecutoronce OOM-killed and rebooted the box. Train only on a GPU instance (local RTX 4070+ or cloud RTX 4090 / A10 / V100). max_workers=3is a hard ceiling for allProcessPoolExecutor/ThreadPoolExecutor. PostgreSQLshared_buffers=2GB; PG OOMs when host free RAM < 4 GB.
Measured training cost on i7-13700K + RTX 4070 12 GB + 64 GB DDR5 (see Chinese table above for the 8-config breakdown spanning smoke runs to overnight 1M-step phases).
中文. src/aurumq_rl/factors/ 是 polars-native 实现的 296 个量价因子(105 alpha101 + 191 gtja191)。每个因子一篇 markdown 文档在 docs/factor_library/,含原文公式 + Polars 实现说明 + 引用。
| Family | 实现 | quality_flag=0 (clean) | =1 (best-effort) | =2 (stub) |
|---|---|---|---|---|
| alpha101 | 101/101 + 6 自定义 | 88 | 13 | 0 |
| gtja191 | 191/191 | 177 | 12 | 2 |
quality_flag 语义:
- 0 (clean):完整 + 数值稳定 + 跨平台 parity(如 alpha_001、gtja_159)
- 1 (best-effort):实现合理但存在已知边界情况(如 alpha_017 在窗口=2 时 std=0 触发 NaN,已用
fill_null处理但未触发 inf) - 2 (stub):实现存在但等价于占位(如 gtja_115/189 没有可靠数据源对应的 sd_pe_ttm 字段)
注册表用法:
import aurumq_rl.factors.alpha101 # registers 107
import aurumq_rl.factors.gtja191 # registers 191
from aurumq_rl.factors.registry import ALPHA101_REGISTRY, GTJA191_REGISTRY
panel = pl.read_parquet("ohlcv.parquet") # 需含 OHLCV + vwap + amount
df = panel.with_columns([fn(panel).alias(name) for name, fn in ALPHA101_REGISTRY.items()])通用算子 _ops.py 提供 25+ 个跨家族复用的算子:ts_sum / ts_corr / cs_rank / decay_linear / regbeta / ts_argmax / ts_argmin / ts_min / ts_max / ts_rank / ts_delta / ts_delay / ts_std / ts_skew / ts_kurt / ind_neutralize / scale / signed_power / sign,所有都是 polars expr-aware,可在懒求值 graph 里组合。
English. src/aurumq_rl/factors/ ships 296 polars-native price-volume factors (105 alpha101 + 191 gtja191), one markdown doc per factor under docs/factor_library/ with original formula + Polars implementation notes + references. quality_flag ∈ {0 clean, 1 best-effort, 2 stub} per the Chinese table above. The common-operator module _ops.py provides 25+ Polars-aware operators (ts_sum, ts_corr, cs_rank, decay_linear, regbeta, …) that compose lazily.
中文. 这是项目 11 个 A 股私有因子族,必须从用户自己的数据仓库算好后写进 Parquet。它们对应的中国市场原始数据源在欧美市场没有等价物:
由用户上游 scripts/compute_mf_panel.py 输出,14 个基础列 + 6 个 _log 变体。例:
mf_net_{1d,3d,5d,10d,20d,60d}— 主力净流入累计(元)mf_buy_share_main— 主力买入占比(SHAP rank 7,Path 4 模型里第 7 重要的特征)mf_net_accel_5_20— 5d / 20d 流入加速度mf_net_5d_amount_ratio— 5d 净流入 / 5d 成交额mf_net_{1d,3d,5d,10d,20d,60d}_log— sign-preserving log1p 变体(2026-05-08 加入),公式sign(x) · ln(1 + |x|/total_amount),把原始 std=1.5×10⁸ 的"元"量级压到 std=0.040 的"无量纲"量级
HUGE_TAIL 事故:原始 mf_net_*d 标准差从 1e8 到 1e10,跨 ts_code 的尺度差异极大(一只大盘股一天净流入 10 亿元,一只小盘股 1 万元)。Phase 24 在 data_loader._cross_section_zscore 里 z-score 之后仍然有量级,因为 polars 默认 ddof=1 在 3000-stock 截面下分母被极端值拉爆。修复:上游加 _log 变体后训练直接吃压缩量级,跨年泛化恢复。
由 src/aurumq/factors/main_force.py 输出,与 mf_* 互补但完全独立:
mfp_elg_buy_ratio_20d— 超大单买入占比 20 日mfp_lg_buy_ratio_20d— 大单买入占比 20 日mfp_main_net_cum_pct— 主力净流入累计百分位mfp_main_net_volatility_20d— 主力净流入波动 20 日
Phase 16 关键事故:mfp_ 前缀曾被静默从 aurumq_rl.data_loader.FACTOR_COL_PREFIXES 漏掉,12 列输入完全没进训练。修复后 16a 跑出 adj Sharpe +1.593(vs 之前 plateau +1.165)。教训:前缀注册表是 single source of truth,prefix-glob 漏一个前缀等于静默丢一族特征。
源自 Tushare cyq_perf 表,3 列:
cyq_winning_ratio— 当前价位上的获利筹码占比cyq_concentration_70— 70% 筹码集中在多少价位区间cyq_cost_distance— 当前价 vs 平均成本距离
Phase 26 关键事故:cyq 是 A 股独有的因子(券商内部模型 + Tushare 加工),但 Tushare 历史只能回填到 2025-10-20,更早数据是 cyq_perf v1.0 用合成方法补的。结果:训练集(含合成回填)cyq_cost_distance std = 0.197;OOS 集(全部实采)std = 0.066,3× 压缩。跨截面 z-score 不抹平这种 mid-stream regime shift —— 模型学到的是合成数据的尺度,到真实数据上全错。Phase 26C2 切换到 v1.2 修正 cyq(bulk API 重新回填)后,T-1 lift 从 1.47× 反弹到 2.61×(甚至超过原始 23A 的 2.38×,且收敛快 4 倍)。
源自 Tushare top_list + top_inst:
hm_net_5d/hm_net_20d/hm_net_60d— 游资席位累计净买入hm_recent_active/hm_seat_count_30d/hm_top3_concentration— 活跃席位 / 30 日席位数 / 前 3 名集中度
结构性 hard wall:Tushare 龙虎榜数据 ≥ 2023-08-16 才存在,2018-2023.8 的 hm_* 永远是 NULL。Phase 20 长 panel 训练时 LightGBM 的 use_missing=True 自动处理,但 RL 训练时观测向量必须填 0(不能 NaN)。修复:data_loader._fill_missing_with_zero_track_mask 同时填 0 + 写 mask,模型可选择性地学到"这一列 mask=1 时无效"。
hk_hold_chg_60d 等,SHAP rank 16。结构性 null:港股通名单外的 25% A 股永远是 NULL。同样 _fill_missing_with_zero_track_mask 处理。
inst_appear_count_60d / inst_net_30d — 龙虎榜机构席位活跃度。
mg_short_chg_20d / mg_balance_pct 等。78% A 股有融资融券覆盖。
涨停板池 + 同花顺热度榜派生。已知问题:senti_ths_hot_pct 99% null,因为同花顺热度榜只追踪 ~3000 只热门股;非热门股在 2024-08-29 之前完全没有数据。Phase 26 数据质量审计把 senti_ths_hot_pct 列入 include_columns_v1_clean.txt 的永久排除清单。
sh_holder_num_chg_30d — 股东户数变化。86% null(季度披露,日频面板上稀疏)。
fund_pe_ttm / fund_pb / fund_roe_ttm / fund_revenue_growth。SHAP rank 11 (pe_ttm) / rank 25 (roe_ttm)。
事故:688*** 科创板的 fund_pe_ttm 在 2025-08 之前缺失约 600 只 × 每天的 hole,因为 Tushare daily_basic 接口对科创板支持不全。2026-05-08 批次用 bulk daily_basic 回填了所有 600+ × 历史日期。
ind_relative_strength_20d / ind_relative_strength_60d — 个股 vs 申万一级行业指数收益差。49-57% null(sw_index_member 表只覆盖约 3000 只主板成分股)。
mkt_index_pct_chg_5d / mkt_index_volatility_20d — 上证指数派生。Phase 16 关键发现:drop mkt_* 组反而 +0.428 lift!原因:mkt_* 在主板宇宙下高度共线(所有股票同一个上证指数派生量),模型用它做"今天大盘涨/跌"的偷懒预测,反而损害了选股能力。永久从主流配置移除。
Phase 26 新增。tech_* 30 列 = MA5/10/20/60 比值 + KDJ 派生 + MACD 派生 + Bollinger 派生 + ATR 派生 + 振幅。cmf_* 2 列 = Chaikin Money Flow 60d/120d。zt_* 6 列 = 涨停板 30d/60d 频次 + 首板/连板/最长连板。
Phase 24/25 重大事故:24A 把 36 个技术因子直接接在 RL 训练 panel-load 时算(而不是上游 parquet),结果 T-1 hit 从 2.11% 跌到 0.40%(lift 0.45× < 随机 0.89×)。根因:KDJ/振幅近似自 close-only(panel 没有 OHLC),MA-cross / golden-cross 是二值事件标志,进 LayerNorm 后 z-score 把 binary 0/1 拉成极端 outlier,污染了梯度。Phase 26F 修复:把二值事件改为 指数衰减 τ=10d 编码 evt(t) = sum(1[event in last 10d] * exp(-(t-tau)/10)),T-1 hit 从 1.13× 反弹到 2.27×(best 2.41% hit at step 50k)。教训见 §7。
English summary. The 11 private A-share factor families are: mf_* (Money Flow Velocity, 14 base + 6 sign-preserving log variants — fixes the 1e8-yuan HUGE_TAIL scale issue); mfp_* (Main Force Position, 12 cols, independent of mf_* despite the similar prefix — Phase 16 found mfp_ was silently missing from FACTOR_COL_PREFIXES); cyq_* (chip distribution, 3 cols — Phase 26C2 v1.2 fix recovered T-1 lift from 1.47× to 2.61× by replacing the synthetic-backfill v1.0 with a bulk-API-recomputed v1.2); hm_* (Dragon-Tiger hot-money seats, 6 cols, structural null pre-2023-08-16); hk_* (Northbound, 4 cols, structural null for 25 % non-HK-Stock-Connect stocks); inst_* (institutional, 3); mg_* (margin trading, 3); senti_* (sentiment, 3, 99 % null for non-hot stocks); sh_* (shareholder, 2, 86 % null due to quarterly disclosure); fund_* (PE/PB/ROE/revenue growth, 4 — SHAP rank 11/25); ind_* (SW industry relative strength, 2); mkt_* (market index, 2 — dropped permanently in Phase 16 because removing it gave +0.428 adj Sharpe lift). Phase 26 added tech_* (30), cmf_* (2), zt_* (6) — note that binary event flags must be exp-decay encoded (τ=10d) not raw 0/1 to avoid the −33 % regression seen in Phase 24A.
中文. aurumq_rl.data_loader.FACTOR_COL_PREFIXES 是 single source of truth。当前规范清单:
FACTOR_COL_PREFIXES = (
"alpha_", "mf_", "mfp_", "hm_", "hk_", "inst_", "mg_", "senti_",
"sh_", "fund_", "ind_", "cyq_", "gtja_",
"tech_", "cmf_", "zt_",
)漏一个前缀 = 静默丢一族特征 + Phase 16 复现。所有 PR 修改这个 tuple 必须同步更新:
tests/test_data_loader.py:test_factor_col_prefixes_lockdown—— 字典对比 + 顺序对比scripts/export_factor_panel.py:FACTOR_PREFIXES—— PG 抽取脚本的镜像列表docs/FACTORS.md—— 表格 + 列名规范文档
English. aurumq_rl.data_loader.FACTOR_COL_PREFIXES is the single source of truth (17 prefixes today). Missing one = silently lose a factor family = Phase 16 reproduction. Three sites must be kept in sync per PR: the tuple itself, the tests/test_data_loader.py lockdown test, and scripts/export_factor_panel.py.
中文. 见 handoffs/2026-05-10-sl-extras/shap_audit/(paris 侧执行):
- 方法:
shap.TreeExplainer跑在 Path 4 最佳 LightGBM 单模(nl31_lr050_mdl50_seed44),10k 行 VAL_EFF 数据,按mean(|SHAP|)排名。 - Top 5 surprises:
gtja_159一骑绝尘(mean|SHAP|=0.001270,gain 28.7%,162 个分裂点)、gtja_158、gtja_065、gtja_140、gtja_181。资金流第 7 名mf_buy_share_main,基本面第 11 名fund_pe_ttm,北向第 16 名hk_hold_chg_60d。 - 剪枝规则:
mean(|SHAP|) < 1e-6视为零贡献 → 119 个候选 → 保存到drop_candidates.json。被剪掉的例子:alpha_098、gtja_054、gtja_101、gtja_190、alpha_002、gtja_001、mf_net_accel_5_20、mf_net_60d、gtja_114、inst_net_30d。 - Path 6 验证(Bayesian opt 50 trials):226 列训出来 H1 校准 mean_y = +0.028265 vs Path 4 全 345 列的 +0.028483,Δ = −0.0002(与噪声不可分辨)。bundle 从 40 MB 缩到 32 MB,训练时间 −15%。结论:超参数搜索已 saturate,剪枝是免费午餐。
English. SHAP-based feature pruning ran on the best Path 4 LightGBM single model (nl31_lr050_mdl50_seed44) over 10k VAL_EFF rows. mean(|SHAP|) < 1e-6 ⇒ drop candidate ⇒ 119 columns saved to drop_candidates.json. Validating on Path 6 (226 cols + 50 Bayesian-opt trials): H1 calibrated mean_y +0.028265 vs full-345 Path 4 +0.028483 = −0.0002 (indistinguishable from noise). Bundle 40 MB → 32 MB, training −15 %. Lesson: hyperparameter search is saturated; SHAP pruning is a free lunch.
中文. 因子 parquet 按家族 × 年份分片:
| 路径 | 内容 |
|---|---|
data/duckdb/factor_eval/alpha_panel_year=YYYY.parquet |
alpha101 全族(109 cols) |
data/duckdb/factor_eval/gtja_panel_year=YYYY.parquet |
gtja191 全族(193 cols,单年最大 1.37 GB) |
data/duckdb/factor_eval/mf_panel_year=YYYY.parquet |
mf_ 22 cols (14 + 6 log + 2 helper) |
data/duckdb/factor_eval/cyq_panel/year=YYYY.parquet |
canonical cyq 3 cols(v1.2 修正版) |
data/duckdb/factor_eval/tech_panel/year=YYYY.parquet |
tech_ 30 cols |
data/duckdb/factor_eval/tech_event_panel/year=YYYY.parquet |
tech_evt_ 8 cols(含 exp-decay 编码) |
data/duckdb/quotes_enriched/year=YYYY.parquet |
11 个内部 enriched 家族 mfp_/hm_/hk_/inst_/mg_/senti_/sh_/fund_/ind_/mkt_/cyq_ legacy |
流式 concat 红线:14 GB ECS 上禁止跑 pl.concat(diagonal_relaxed) + sink_parquet 拼 10 年面板,会 OOM-killed 主机。正确做法是先 shard,然后 pl.scan_parquet([shards], missing_columns="insert") 流式扫,见 scripts/build_combined_panel_safe.py。
English. Factor parquets are sharded by family × year (see Chinese table). Streaming red line: on the 14 GB ECS, pl.concat(diagonal_relaxed) + sink_parquet over 10 years of panel data will OOM-kill the host. Correct pattern: shard first, then pl.scan_parquet([shards], missing_columns="insert") streaming scan. See scripts/build_combined_panel_safe.py.
中文. 本节是 GPU 训练栈本身的工程史 —— 从最初 11% GPU 利用率到 1M-step 隔夜训练的所有 stack diff、bug、消融。模型/数据/奖励的实验史在 §5(Phase 15-26)。两个 phase 编号体系独立:本节 Phase 0-14 是「框架建设」,§5 Phase 15-26 是「在已建好的框架上跑模型实验」。
完整的逐 phase 记录在 docs/TRAINING_HISTORY.md(1350 行),本节是其压缩版。
English. This section is the engineering history of the training stack itself — every stack diff, bug, and ablation from the initial 11 % GPU utilization to the 1M-step overnight training. The modeling / data / reward experiments live in §5 (Phases 15–26). The two numbering systems are independent: Phases 0–14 here are "framework construction"; §5 Phases 15–26 are "model experiments on top of the built framework".
The full per-phase record is in docs/TRAINING_HISTORY.md (1350 lines). This section is its compressed version.
(~pre-2026-04-29)
Goal. Prove the parquet → env → SB3 PPO → ONNX → backtest → JSON link before touching real data.
Stack. StockPickingEnv (numpy panel, single-process), SB3 default MlpPolicy net_arch=[64,64], PPO n_envs=1 batch=64 n_steps=2048 epochs=10, synthetic_demo.parquet (~200 SYN-coded fake stocks).
Bugs surfaced (4 in PR #1 / #2):
| # | Bug | Fix |
|---|---|---|
| 1 | gymnasium not always installed |
lazy import + placeholder raises ImportError |
| 2 | ONNX export device mismatch (CUDA policy + CPU dummy_obs) | move policy to CPU before export |
| 3 | torch.onnx dynamo=True breaks SB3 Normal distribution |
pass dynamo=False |
| 4 | JSON serializer can't handle numpy.float32 |
WandbMetricsCallback._append_jsonl got default=_json_default |
Outcome. Pipeline end-to-end. ONNX exported. Backtest IC ≈ 0 (synthetic noise, expected).
Goal. Scale to a real factor panel and a real GPU. Run a 100k-step PPO on alpha101 to see how far naive setup goes.
Data. factor_panel_alpha101_short_2023_2026.parquet — 105 alpha cols, 5743 stocks, 2023-01..2026-04. After main_board_non_st: 3043 × 800 × 105.
Config. PPO --total-timesteps 100000 --n-envs 8 --vec-normalize --learning-rate 3e-4 --target-kl 0.05 --max-grad-norm 0.3 net_arch=[64,64] n_factors=16.
Bugs (8 new): NaN propagating through cross-section z-score (real PG data has NaN cells for suspended / pre-IPO; synthetic didn't); OOS obs_dim mismatch (training universe = 3043, OOS = 3052 because some IPOs landed — env's observation_space is fixed at training time; fix = align_panel_to_stock_list persisted in metadata.json); PPO approx_kl=41,820 on first update (12.5M-param first layer + 48,688-dim obs; fix = target_kl=0.05 + max_grad_norm=0.3 → approx_kl=0.028); mean_fps=0 in summary (SB3 only emits time/fps on rollout-summary frames; fix = callback computes wall-time fps); metrics_summary all-null (callback wrote raw SB3 keys; summarize_metrics() expects canonical schema; fix = raw→canonical mapping at write time); runs/ gitignore unanchored (web/app/runs/ silently dropped; fix = /runs/ anchored at root); alpha045 STHSF parity 44 % mismatch on Windows only (scipy rank-tie-break unstable across versions on 10-stock synthetic; fix = @pytest.mark.xfail(strict=False)); OSS admin AK disabled mid-flight (switch to wepa AK).
Outcome. 100k-step PPO ran clean. fps ~ 333. GPU util ~ 11 %. The 4070 was massively underutilized — wide first-layer of [64,64] was only 3 M params; GPU spent most of its time waiting on CPU rollouts.
Data added. factor_panel_combined_short_2023_2026.parquet — 355 factor cols (105 alpha + 191 gtja + 14 mf + 12 mfp + 5 hk + 4 fund + 3 inst + 3 mg + 3 senti + 2 sh + 2 ind + 2 mkt + 6 hm + 3 cyq), 5643 stocks × 800 dates, 7.7 GB zstd. After main-board filter: 3014 × 600.
Code added. --policy-kwargs-json CLI accepts {"net_arch":[2048,1024,512], "activation_fn":"relu"}. --feature-group-weights-json accepts e.g. {"alpha_*":2.0, "mf_*":0.5}, applied after z-score in _apply_feature_group_weights so VecNormalize doesn't neutralize the weights.
Network widening. net_arch=[2048,1024,512]. First-layer params for n_factors=64: 3014 × 64 × 2048 ≈ 395 M. GPU memory 3 GB → 12 GB peak; util peak 11 % → 57 %.
3-way alpha-prefix ablation:
| Run | --feature-group-weights-json |
OOS IC | OOS top30 Sharpe |
|---|---|---|---|
ablation_alpha_w0_5 |
{"alpha_*":0.5} |
(~0) | (~random) |
ablation_alpha_w1_0 |
{"alpha_*":1.0} (no-op baseline) |
(~0) | (~random) |
ablation_alpha_w2_0 |
{"alpha_*":2.0} |
+0.0006 | −0.807 (random p50 −0.482) |
Decision. Framework works end-to-end. Numbers are noise at 15k steps. Validation passed; promote feature_group_weights as load-bearing CLI feature.
Round 1 (R1). --policy-kwargs-json '{"net_arch":[1024,512,256]}' + --feature-group-weights '{"alpha_*":2.0, "mf_*":1.5, "gtja_*":1.0}' at 50k steps, n_envs=12, n_steps=2048, target_kl=0.05. First model with explained_variance climbing. value_loss from 1.5e-2 → 4.3e-3 over 22 rollouts. OOS IC = +0.011, top30 Sharpe = +1.42 (random p50 −0.48, vs-p50 +1.90). GPU util 35 %, fps 312.
Round 2 (R2). Three changes at once: target_kl 0.05 → 0.10, n_envs 12 → 14, n_steps 2048 → 4096. First attempt OOMed (MemoryError allocating 8.83 GiB rollout buffer); reduced n_steps 4096 → 1024; ran again. OOS top30 Sharpe = +2.16, vs-p50 = +0.74 above R1. Convergence accelerated (explained_var 0.93 at 30k vs R1's 0.81). But: three changes at once = uninterpretable. Could be KL relaxation, env parallelism, or buffer length. Lesson recorded — Phase 3's central rule: one change per round.
Round 3 (R3). Just target_kl 0.10 + learning_rate 3e-4 → 1e-4 anneal. OOS top30 Sharpe = +1.89 (R2 = +2.16, R1 = +1.42). Within seed variance of R2; suggests target_kl accounts for most of R2's lift, but n_envs / n_steps cannot be cleanly attributed.
Lesson. OOS Sharpe at 50k steps is noise. Don't pick winners from smokes; pick them from convergence-scale runs (≥ 1M ideally 5M). Burned three rounds arguing about R1/R2/R3 ranking before admitting differences were within seed variance.
Goal. Find the n_envs ceiling.
Method. Smoke-grid n_envs ∈ {12, 14, 16, 18, 20} at fixed n_steps=1024.
n_envs |
fps | GPU util | Outcome |
|---|---|---|---|
| 12 | 314 | 35 % | R1 baseline |
| 14 | 366 | 41 % | linear scale |
| 16 | 412 | 47 % | linear scale |
| 18 | 455 | 53 % | linear scale starting to bend |
| 20 | 458 | 56 % | bent — IPC bottleneck |
Realization. Above n_envs=18, adding env doesn't proportionally raise fps because the bottleneck is Python IPC between worker subprocs and the central learner, not GPU compute. And: n_envs=20 OOMed on rollout buffer alloc (14.7 GiB). Back to n_envs=12 for safety.
The IPC ceiling discovery changed the project direction. The classic SB3 setup (numpy panel + subproc envs + CPU rollouts → GPU train) is fundamentally CPU-rollout-bound. The 4070 was sitting at 56 % util at the bottleneck. To break through, we'd need to move rollouts onto the GPU itself. → Phase 5 / 6.
Phase 5 — 四个 realization 推出 GPU-框架重构 / Four Realizations Driving GPU-Framework Redesign (2026-05-01 afternoon)
Goal. Sit down, look at the data, decide whether to keep tuning or fundamentally redesign.
Four realizations:
- Brute-force capacity is a trap. 395 M-param flat MLP needed 12 GB VRAM, fps capped at 458, and didn't beat the per-stock symmetry prior. A symmetry-correct architecture (per-stock encoder, ~50 k params shared across all stocks) has dramatically more inductive bias for stock-picking AND is faster.
- The numpy panel is the wrong abstraction. Re-uploading the same panel to GPU memory every env reset is wasteful. The panel should live in GPU memory throughout training, indexed by env step.
- Observations should be indices, not values. Stock factor vectors don't change across env steps — only which date the env is at changes. Send (date_idx, stock_codes_idx) over the IPC boundary, do the GPU-side gather afterwards.
- VRAM and RAM are different. Confused them once: GPU showed 12 GB used, my Python proc was only 3.8 GB RSS. Always check
nvidia-smi --query-gpu=memory.usedper-process AND hostGet-Process -RSS.
Decision. Design a GPU-vectorized framework: per-stock encoder + CUDA-resident panel + index-only observations + dual pooling head. Implementation in Phase 6/7.
Design (Phase 6, designed afternoon; built in Phase 7).
gpu_env.py: Gymnasium-compatible env that holds the entire panel as a CUDA tensorpanel[n_dates, n_stocks, n_factors]. Step = advance one date; obs =panel[date_idx]slice + holdings mask. n_envs=16 in a single proc (vectorized across env axis on GPU, no IPC).PerStockEncoderPolicy(Deep Sets): apply the SAME MLP to each stock's factor row, then aggregate with mean+max dual pooling. Permutation-equivariant. Net_arch=[64, 32] per-stock, then a [64, 32] head — only ~50 k params total, shared across 3014 stocks.- Action: scaled tanh on per-stock logits, top-K selection.
Smoke run (Phase 7). 50k steps, n_envs=16, n_steps=1024. fps 1490 (vs Phase 4 ceiling 458 — 3.25× lift). GPU util peak 78 %, mean 62 %. OOS IC +0.014, top30 Sharpe +1.78 — better than Phase 3 R3 +1.89 was within noise of, and convergence reached at 30k vs R3's 50k.
Result. The GPU framework eclipses the previous best at one-third the wall time. The redesign was worth it.
Bottleneck. Even with GPU env, rollout buffer was numpy on host. Every PPO update did host→device copies for each minibatch.
Fix. Wrote gpu_rollout_buffer.py: holds obs / actions / values / log_probs / rewards / advantages / returns as CUDA tensors. PPO update reads directly from device memory — zero copies.
Outcome. fps 1490 → 1820. GPU util mean 62 % → 71 %. VRAM +1.2 GB (acceptable).
Bottleneck identified. Even with cuda-resident buffer, each entry stored obs = (n_stocks × n_factors) float32 = 3014 × 64 × 4 = 770 KB per step. At n_envs=16, n_steps=1024: 16 × 1024 × 770 KB = 12.6 GiB rollout buffer. We were paying for storing the entire factor cross-section in memory N_envs × N_steps times.
Insight. All observations are slices of the same panel. Store only the date index (4 bytes) and gather on-the-fly during minibatch.
Fix. IndexOnlyRolloutBuffer: stores date_idx[n_envs, n_steps] int32 (~64 KB) + holdings_mask[...] bool (~250 KB). Gather panel[date_idx[batch]] at minibatch read time. Effective batch size keeps the same numerical behavior; memory is 200× smaller.
Outcome. Rollout buffer 12.6 GiB → 0.06 GiB. Freed VRAM lets us raise n_envs=16 → 32 and n_steps=1024 → 2048 within the same 12 GB budget. fps 1820 → 2050. GPU util peak 88 %.
Bug. value_loss plateaued around 4e-3, never broke below. Suspected vanishing gradients in the value head. Inspected: value head's parameters were not in the optimizer. SB3 default uses one optimizer for both policy and value when they share an extractor; my custom policy split them and only registered policy params.
Fix. Explicit optim.AdamW([{params: policy.parameters()}, {params: value_net.parameters(), lr: 1e-3}], lr=3e-4). Value head learning unlocked. explained_variance climbed 0.78 → 0.99 within 50k steps.
Two more additions in this phase:
- LayerNorm after each per-stock MLP layer. Real-data cross-section z-scores still have outliers (after
nan_to_num, a single inf cell at the cell level can still pull mean). LayerNorm gives stable gradients. ~Verified ablation: removing LayerNorm gave value_loss instability across seeds. - Dual pooling head: aggregate per-stock representations by
concat(mean, max)instead of pure mean. Mean captures average market state; max captures the most-extreme stock signal. Worth +0.06 explained_var on the 50k smoke.
Outcome. fps 2050 → 1980 (the LayerNorm + dual pool cost ~3 %, but explained_var=0.99 was worth it). bf16 was attempted but eliminated in Phase 11.
Phase 11. Tried torch.autocast(dtype=bfloat16) for matmul+linear. Memory −20 %, fps +12 %. But: approx_kl became unstable, occasionally spiking to 0.3 (vs nominal 0.02). Inspected: tail of policy logits at bf16 dynamic range had quantization noise that compounded in KL computation. Rolled back to fp32.
Phase 12. Tried target_kl=0.10 adaptive (raise to 0.15 if violated 3x in a row). PPO update frequency dropped from every rollout to ~70 %. Total update count similar, value loss similar, IC similar — no measurable benefit. Removed for code simplicity.
Goal. Profile a single PPO update with torch.profiler to find any remaining low-hanging fruit.
Findings.
- 53 % of SGD time was in advantage computation (
compute_returns_and_advantage). - 22 % in policy log-prob recomputation.
- 11 % in value head forward.
- 14 % in
optimizer.step().
Fix. Wrote compute_returns_and_advantage_vectorized() that uses prefix-sum on CUDA tensors instead of Python for-loop over time steps. 53 % → 7 %. SGD wall time per update −40 %.
fps 1980 → 2580. GPU util mean 78 %.
Phase 14 — TF32 + unique-date + 1M overnight / TF32 + Unique-Date + 1M Overnight (2026-05-02 → 2026-05-03 early)
Goal. Capacity build. Run 1M steps overnight on the post-Phase 13 stack to test stability and final convergence.
Two micro-optimizations going in:
- TF32:
torch.backends.cuda.matmul.allow_tf32 = True; cudnn.allow_tf32 = True. Free ~12 % matmul speedup on RTX 4070 (Ampere) with no measurable accuracy loss. - Unique-date dedup: same date often appears multiple times in a 16-env × 2048-step rollout because envs reset asynchronously. Detected and gathered only unique
date_idxonce per minibatch, then duplicated rows. Saves ~30 % gather time.
1M-step overnight run (alpha101+gtja191 296-col on short panel; n_envs=16, n_steps=2048, target_kl=0.05 fixed, lr=3e-4 → 1e-5 linear, ent_coef 0.01 → 0):
- Wall time: 5h 47min
- fps mean: 460 (started 326 from cold start, reached 540 at ~200k)
- GPU util mean: 73 %
- Peak VRAM: 9.8 GiB / 12 GiB
- approx_kl trajectory: stable 0.025-0.045, no spikes
- explained_variance: 0.99 from ~150k onwards
- Final OOS top30 Sharpe: +5.83 (vs random p50 +0.62, vs-p50 +5.21)
Outcome. Stack hardened. Phase 14 is the "ready for real experiments" milestone. All Phase 15+ modeling experiments run on this exact stack.
Section recap. From Phase 0's 1k-step smoke (fps 50ish) to Phase 14's 1M-step overnight (fps 460), the framework went through 15 stack diffs, 24 documented bugs, and 5 mid-flight algorithmic redesigns. The biggest single jump was Phase 5/6/7 — moving rollouts onto the GPU 3.25×'d fps, and the symmetry-correct per-stock encoder simultaneously reduced parameter count by 8000× (395 M flat MLP → 50 k Deep Sets) while improving OOS Sharpe.
中文. Phase 14 之后框架不再变了。Phase 15-26 是在固定 stack 上跑模型/数据/奖励实验。每个 phase 都有明确假设、单变量改动、量化决策。
English. After Phase 14 the framework stopped changing. Phases 15–26 are model / data / reward experiments on a frozen stack, each with a clear hypothesis, single-variable change, and quantitative decision.
Goal. Take the best 1M-step model from Phase 14 and integrate it into the AurumQ platform for live serving.
Three SB3 PPO models registered:
| Agent ID | OOS Sharpe | Note |
|---|---|---|
phase15e_150k_grand_champion |
+6.277 | active production model |
phase15e_100k_alt_peak |
+5.94 | alternative early-peak ckpt |
phase15e2_225k_continuation_peak |
+5.83 | continuation of Phase 15e but with extended steps |
Bundle layout models/rl/<id>/:
policy.zip— SB3 native (kept forPPO.loadpath)factor_schema.json— factor name list, ordering, hashmetadata.json— train_start_date, train_end_date, stock_codes, feature_group_weights, factor_count, policy_classgolden_inference.json+golden_obs.npy+golden_scores.npy— sanity-check pair persisted with the bundlechecksums.json— sha256 of all artifacts
5 admin-only endpoints under /api/v1/rl/agents/:
POST /import-bundle— upload, validate schema, persistPOST /{id}/validate— replaygolden_obsthrough policy, compare togolden_scoresPOST /{id}/archive— soft deletePOST /{id}/inference— async job (returns job_id)GET /api/v1/rl/inference-jobs/{job_id}— poll
Tech-debt addressed. SB3 PPO.load(device='cpu') + LRU(3) policy cache + single-flight lock to prevent concurrent reloads under high traffic.
Frontend. ModelHashBadge, MissingDataAlert, RlAgentsView, RlStockPicksView, useInferenceJob composable.
Goal. Re-validate Phase 15's "drop mkt_* group helps" finding under three independent bug fixes.
Three bugs:
- Reward double-shift fix.
FactorPanelLoaderwas already encoding the fp-day forward return, but the env AND the importance-permutation pass were re-indexingt+fp⇒ rewards werefpdays too late. - Sharpe over-annualization. Overlapping fp-day forward returns must be annualized by
√(252/fp), not√252. For fp=10, that's an inflation factor of ~3.16×. Phase 15's legacy Sharpe +6.277 includes this inflation; the bug-corrected Sharpe is ~+1.99 on the same data. mfp_prefix silently missing. 12 mfp cols had been dropped fromFACTOR_COL_PREFIXES. Training input was 343 cols, not 355.
Models retrained. 16a (drop mkt_ only), 16b (drop mkt_+gtja_), 16c (extend 16b to 450k).
Key findings (under bug-fixed eval):
| Run | adj Sharpe | vs random p50 | IC | Note |
|---|---|---|---|---|
| 15e legacy (uncorrected) | +6.27 | — | — | annualization artifact |
| 16a (drop mkt_) | +1.593 | +0.428 | +0.0143 | new prod candidate |
| 16b (drop mkt_+gtja_) | +1.32 | −0.27 | +0.0109 | gtja_ is load-bearing, contrary to Phase 15 belief |
| 16c (16b @ 450k) | +1.36 | −0.23 | +0.0112 | extension didn't help |
Two robustly anti-helpful groups emerged in permutation importance: cyq (−0.142 ± 0.044) and inst (−0.115 ± 0.030). mfp turned out weakly positive (+0.047 ± 0.067), gtja_ load-bearing (+0.160 ± 0.126).
Decision. 16a → production. Phase 15 legacy peaks retired as annualization artifacts. Phase 17 scoped to (a) seed-sweep 16a (b) test "drop cyq+inst" hypothesis.
Goal. Test whether Phase 16's "robust anti-helpful cyq+inst" signal transfers under retrain; measure seed dispersion.
Method. 17A: train drop_mkt+cyq+inst at seed=42, 300k. 17B/C/D: re-run 16a at seeds 1/2/3. 17E: extend 16a (seed=42) to 450k.
Key findings.
- 17A failed catastrophically. adj Sharpe = +0.861, vs-p50 = −0.304 (16a was +0.428). The cyq+inst drop hypothesis is FALSE. The "robust negative permutation importance" signal turned out to be conditional on the trained policy, not causal.
- Seed sweep. 3/4 seeds beat random; mean lift +0.249; range [−0.060, +0.428]. seed=42 sits at the upper edge of the noise band; seed=2 is a lone failure.
- 17E (450k) produced no new peak. Phase 16a's +1.593 at step 224k is confirmed as the seed=42 global maximum.
Critical lesson. Stop chasing factor drops based on permutation importance alone. Permutation importance reflects what the trained policy uses, not what is causally helpful for prediction. To test causality you must retrain after the drop.
Goal. Convert the seed-sensitivity finding into a deployable mitigation.
Method. Add seeds 4-7 (18A-D) at the unchanged 16a config. Build rank-mean / z-mean / z-median ensembles.
Key findings.
| Run | adj Sharpe | vs random p50 |
|---|---|---|
| seed=42 (16a) | +1.593 | +0.428 |
| seed=1 (17B) | +0.97 | +0.115 |
| seed=2 (17C) | +0.40 | −0.060 — failure |
| seed=3 (17D) | +1.42 | +0.408 |
| seed=4 (18A) | +1.917 | +0.752 — single-seed big win |
| seed=5 (18B) | +1.84 | +0.596 |
| seed=6 (18C) | +0.92 | +0.080 |
| seed=7 (18D) | +0.27 | −0.140 — failure |
Across 8 seeds: mean vs-p50 +0.352, median +0.388, win rate 6/8. Top-K Jaccard between seeds was 0.003–0.010 — seeds chose almost completely disjoint baskets. This is exactly why ensembling lifts.
The 6-member rank-mean ensemble (excluding seeds 2 & 7): vs-p50 +0.711 (Δ vs seed=42 alone = +0.283), IC = +0.0278 (1.94× 16a), non-overlap Sharpe +1.938.
Decision. Ensemble passed strong-candidate gate. But: single-OOS-window optimum ≠ production Sharpe. Keep 16a live; advance ensemble as candidate pending fresh post-2026-04 holdout.
Goal. Stress-test ens_rankmean6 against realistic costs, T+1 / limit-down filters, multi-window stability, and seed=4's contribution distribution.
Method. Quarter blocks, rolling 60-day windows (step 20), execution simulation at 30/60/100 bps round-trip with limit-down deferral.
Key findings.
- 3/3 quarters won. 100 % rolling-60d win rate. 7/7 windows IC-positive.
- Post-cost at 60 bps: ensemble adj_S = +0.971 vs 16a +0.579 (Δ +0.392).
- At 100 bps: gap widened to +0.272 vs −0.233 — ensemble stayed positive when 16a flipped negative.
- Seed=4 forensics warning: 100.6 % of seed=4's marginal lift came from a single month (2026-01). Removing 16a from the ensemble actually improved score slightly (+0.045) — 16a was the weakest of the six.
- Fresh holdout check: 0 dates past 2026-04-24 with fp=10, threshold ≥ 40 — INSUFFICIENT.
Decision. Conclusion locked by data freshness. Phase 16a stays as live production; ensemble remains release-candidate. No factor drops based on importance alone. Phase 20 priority: collect fresh holdout.
Goal. Retrain 16a config on the long panel (2018-01 → 2025-06 train, 2025-07 → 2026-04 OOS) and check whether more history improves.
Two seeds: 20A seed=42, 20B seed=4. Each 300k steps on the new 7y panel.
Key findings.
| Run | adj Sharpe | OOS hit_rate@5 | OOS win_rate |
|---|---|---|---|
| Phase 16a (2y train) | +1.593 | 4.88 % | 36.9 % |
| Phase 20A (7y, seed=42) | +1.78 | 4.94 % | 37.4 % |
| Phase 20B (7y, seed=4) | +1.42 | 4.85 % | 37.1 % |
Combined-evidence. 2-seed average vs 16a: +0.012 adj Sharpe (within noise). The long panel is at parity with the short panel for the RL track on this metric.
20C cross-data ensemble: BLOCKED. Ensembling the 7y-trained policy with the 2y-trained policy required obs_dim alignment, but the 7y policy was trained with 2018-2019 data that includes ~600 stocks that delisted before 2025; padding caused obs_dim mismatch. Decision: defer to Phase 21.
Decision. Long panel doesn't help RL track (will revisit in Phase 22 after reward redesign). Phase 19's 16a stays live. The long-panel data does later become foundational for the SL track (see §6: P0 / Path 1 long / path5_long).
Goal. Try a brand-new V2 architecture (Dict observation space + larger transformer-style head) on the existing forward_10d reward.
Result.
| Run | adj Sharpe | hit_rate@5 | win_rate | avg_hold |
|---|---|---|---|---|
| Phase 16a (V1) @ top_k=3 | +1.59 | 4.88 % | 36.9 % | +0.20 % |
| Phase 21A (V2 forward_10d) | −0.60 | 3.70 % | 34.5 % | −0.16 % |
Catastrophic regression. V2 Dict obs scheme broke something about the policy's gradient flow; or maybe the Transformer attention head on 3014 stocks is too parameter-heavy for the 300k-step budget. Three retries with seed sweep — same result.
Decision. V2 rejected. V1 PerStockEncoderPolicy stays canonical. Phase 22 reverts to V1 architecture but changes the reward function (see below).
Goal. The forward_10d reward target is "average 10-day forward log-return". But what we actually care about is realized return until a sensible exit signal fires. Implement an MA5/MA10 death-cross exit with a 5-day hard cap.
Method. New module src/aurumq_rl/main_wave_labels.py computes per-stock per-day hold_return[t, j] under signal-exit (min(5d, MA5<MA10 death cross)). The env's reward function reads from this tensor instead of computing forward returns on-the-fly. The valid_mask is tightened to entry_eligible & label_valid so training reward and OOS evaluation filter on the same criterion.
Three runs (8h each on RTX 4070, short panel 2023-01..2025-06 train / 2025-07..2026-04 OOS).
- 22A: seed=42, top_k=5, 300k steps
- 22B: seed=1, top_k=5, 300k steps (robustness check)
- 22C: seed=42, top_k=3 train, 200k steps (concentration check)
Key findings.
| Run | top_k | best step | hit_rate | win_rate | avg_hold | avg_dd | eval_score |
|---|---|---|---|---|---|---|---|
| Phase 16a (V1 forward_10d, prod) | 3 | 224928 | 4.88 % | 36.9 % | +0.20 % | 3.00 % | +0.0490 |
| Phase 21A (V2 forward_10d, REJECTED) | 3 | 149952 | 3.70 % | 34.5 % | −0.16 % | 4.71 % | −0.0596 |
| Phase 22A (V1 main_wave seed=42) | 3 | 299904 | 5.89 % | 41.1 % | +0.44 % | 3.68 % | +0.0419 |
| Phase 22B (V1 main_wave seed=1) | 3 | 24992 | 6.06 %* | 40.2 % | +0.18 % | 3.79 % | +0.0168 |
| Phase 22C (V1 main_wave train_topk=3 → eval@5) | 5 | 174944 | 6.16 % | 44.0 % | +0.62 % | 3.84 % | +0.0505 |
*22B's "best" is at step 24992 (≈ 1.5 PPO iterations); subsequent regress.
Three deltas vs Phase 16a baseline.
- Hit rate: random base is 5.72 %. Forward_10d models all below random (−0.84 to −0.87 pp). main_wave models seed=42 series consistently above random (+0.14 to +0.44 pp).
- Win rate: +4 to +7 pp uniform lift across all three runs / all top_k variants.
- Avg hold return: 3× lift (16a +0.20 % → 22C +0.62 %).
- Drawdown trade-off: slightly higher
avg_max_drawdown(3.65–3.95 % vs 16a's 3.00 %). The hold_return reward doesn't directly penalize drawdown — model is willing to ride larger in-hold drawdowns to capture larger gains. Net positive on eval_score, worth a Phase 23 fix.
Decision. Phase 22C → production candidate. First time a model clears the 5.72 % random baseline on the main_wave criterion.
Goal. Clean up valid_mask edge cases and tighten the eval pipeline. Specifically:
- last-week-of-data label_valid sometimes True because label window extends past panel end → forward fill gives +0 % return.
- T+0 entry/exit on same day (entry day = exit day, hold_return = 0, treated as "valid loss") was double-counted.
Method. Three patches in main_wave_labels.py: (a) strict label window check; (b) drop t==exit_t entries; (c) recompute entry_eligible to exclude T+0 (only buy after 9:30 of day t, exit at or after day t+1 close).
Result.
| Run | hit_rate | win_rate | avg_hold | eval_score |
|---|---|---|---|---|
| Phase 22C | 6.16 % | 44.0 % | +0.62 % | +0.0505 |
| Phase 23A (same config + cleanup) | 6.42 % | 44.6 % | +0.71 % | +0.0617 |
Production candidate locked. Phase 23A is the first model that combines: (a) clears random base rate, (b) >44 % win rate, (c) >0.7 % avg hold return, (d) tight label semantics. Becomes "23A baseline" in subsequent phases.
Phase 24 / 25 — 技术因子改训 + 重要性权重 — 全部拒绝 / Tech-Factor Detour + Importance-Weighting REJECTED (2026-05-07)
Goal. Add ~36 technical-analysis factors (MA / KDJ / MACD / Bollinger / amplitude / limit-up counts) on top of the 353-col baseline. Secondarily test per-factor importance-derived input weights.
Method.
- Phase 24A: compute tech factors at panel-load time inside the RL repo (KDJ/MACD computed from close-only because parquet had no OHLC). Train 300k seed=42 +
--add-technical-factors. - Phase 25A: add IG-saliency × |T-1 z-score| sigmoid weights on top of 24A.
- Phase 25D: weighting on the 353-col base WITHOUT tech, to isolate the weighting paradigm itself.
Results.
| Run | top-5 T-1 hit | lift vs random | Note |
|---|---|---|---|
| 23A baseline | 2.11 % | 2.38× | the reference |
| Phase 24A (tech, 353+36=389 cols) | 0.40 % | 0.45× | below random 0.89 % |
| Phase 25A (24A + weighting) | 0.50 % | 0.56 % | VRAM-thrashed (96 % on RTX 4070), fps 172 → 4, killed at 60 % |
| Phase 25D (weighting only, no tech) | 1.41 % | 1.59× | −33 % vs 23A |
Root causes (post-mortem):
- KDJ / amplitude approximated from close-only. The parquet didn't carry OHLC at the time.
kdj_k = 100 * (close - min(close,9)) / (max(close,9) - min(close,9))is a degenerate variant of true KDJ. Z-scoring this approximation produces strong artifacts on quiet stocks. - Binary event flags pollute LayerNorm gradients.
ma_cross_5_10,golden_crossare 0/1 indicators. After z-score, "1" days become ~3-5σ outliers; LayerNorm scales them down, but back-propagated gradient hits these few outlier samples and produces large updates that destabilize the policy. - Weighted top-30 factors saturate encoder capacity. The per-stock encoder is 64-dim. Forcing 30 high-weight factors through a 64-dim bottleneck kills fine-grained ensemble structure that previously distributed signal across the full 353 cols.
Decision. Both directions rejected. Architecture rule re-affirmed: factor computation belongs in the upstream parquet pipeline, never at panel-load time inside the RL repo. The team wrote TECH_FACTOR_SPEC.md (203 lines) and handed it back to the data team for proper OHLC-based tech factors. 23A remained production. Importance-weighting paradigm permanently dropped.
Phase 26 is a 7-step recovery and breakthrough chain. Track it carefully — this is where the project's current production model came from.
Result. 373-col training (343 base + 30 upstream tech_/cmf_/zt_) with new "canonical" cyq replacing legacy 88%-NaN cyq from quotes_enriched. Regressed: T-1 lift = 1.36× (vs 23A 2.38×).
Result. 343-col, no tech. Still regressed: T-1 lift = 1.47×. The regression is NOT from tech factors.
Traced to cyq_perf backfill regime split:
- v1.0 cyq table only had real Tushare data ≥ 2025-10-20; everything before was backfilled with a different methodology.
- Backfill null rate 0.61 % vs real 26.53 %.
cyq_cost_distancestd: 0.197 (backfill, in training window) vs 0.066 (real, in OOS) — a 3× compression.- Training was 100 % backfill; OOS was ~63 % real.
- Cross-section z-score does NOT equalize a mid-stream regime shift. The model learned the synthetic-backfill scale; on real data the dispersion is 3× smaller, the model's "this stock has high cyq" signal collapses.
23A's accidental advantage. 23A used the legacy 88%-NaN cyq, so the model had effectively learned to ignore cyq entirely. By contrast 26A/B's cleaner-but-distribution-shifted cyq actively misled the model.
v1.2 cyq fix from data team: re-fetch via bulk Tushare API, all dates have consistent methodology. But: train end date was 2024-12-31 (vs 23A's 2025-06-30 — 6-month gap before OOS start).
Result. T-1 lift 1.47× (same as 26B). Train-window mismatch swamps any cyq improvement.
Result. T-1 lift 2.61× (2.31 % hit rate, best ckpt at step 50k). +9.7 % over 23A's 2.38 %, AND converges 4× faster (50k vs 200k). v1.2 cyq fix is validated. Production candidate.
Result. T-1 lift 1.13×. Adding 30 tech factors on the clean panel still hurts by −57 %. Confirms Phase 24's diagnosis: at 128→64→32 per-stock encoder capacity, the 30 raw tech cols dilute attention from strong alpha/gtja/mfp signals.
Three new variants at fresh 3-seed baseline:
- 26E: 26C2 (353 cols) + 2 continuous tech cols (
tech_boll_percent,cmf_120d_pct_amt) = 355 cols - 26F: 26E + 6 event-decay cols (
tech_evt_*with τ=10d exp decay) = 361 cols - 26G: 26F at bigger encoder (256→128, 256k params per-stock)
3-seed median (seeds 42/43/44):
| Run | factors | T-1 lift median | T-1 hit median | best ckpt |
|---|---|---|---|---|
| 26C2 (clean panel sanity) | 353 | 1.70× | 1.50 % | step 50k |
| 26E | 355 | 1.59× | 1.41 % | step 50k |
| 26F (event-decay) | 361 | 2.15× | 1.90 % | step 50k |
| 26G (bigger encoder) | 361 | (abandoned) | — | — |
Best 26F seed: seed=44 hit 2.72× lift at step 50k (T-1 hit 2.41 %).
26G abandoned: 256→128 encoder on RTX 4070 12 GB thrashed fps from 326 down to 4–55 with VRAM stranded by zombie contexts. 3 hours of attempts, no clean result. The capacity question is deferred to Linux / 16 GB-class hardware.
Re-run the comparison on the panel-v3 build (cyq v1.2 fully shipped, alpha/gtja sanitizer integrated upstream, mf_ _log variants emitted).
| Run | T-1 lift median | range |
|---|---|---|
| 26C2-v3 sanity | 1.70× | 1.36–2.15× |
| 26F-v3 ⭐ (PRODUCTION) | 2.27× | 2.04–2.38× |
| 26G-v3 | 1.82× | 1.59–2.05× |
Decision. 26F-v3 = final RL production model. 361 cols (353 base + 2 continuous tech + 6 event-decay) at 128→64 encoder. Bigger-encoder hypothesis closed (disproven on clean panel too — not a hardware artifact).
Phase 26 lesson summary.
- The win came from continuous event-decay encoding of binary signals (the explicit fix to Phase 24's binary-flag failure), not from continuous TA on close prices.
- Mid-stream regime shifts (cyq v1.0 backfill vs real) defeat cross-section z-score; only the data team can fix it (v1.2 bulk-API recompute).
- Train-window matters as much as architecture. A 6-month gap between train-end and OOS-start is enough to wipe out any cyq-quality improvement.
中文. 与 RL 赛道并行,paris 侧(AurumQ 主仓)维护了一条 LightGBM/CatBoost/XGBoost 监督学习赛道作为对照。这条赛道已经超越了 RL 赛道的实证表现,但 RL 仍然作为多样性来源继续生产。下文是 SL 赛道关键节点的精简记录。
English. In parallel with the RL track, paris (AurumQ main repo) maintains a LightGBM/CatBoost/XGBoost supervised-learning track as control. The SL track has empirically outperformed the RL track, but RL remains in production as a diversity source. Below is a compressed record of the SL track's key milestones.
Goal. Among four labeling methods (A v2_excess_adaptive, B trend-scanning, C triple-barrier, D directional-change), pick the production main-wave label.
Method. Calibrate every method to train pos_rate ≈ 0.80 %, train LGBM (3 horizons each — t1, t3, e20) on the 26F-v3 348-col panel, score on test 2025-07..2025-12. Decision weights: 0.45·PR_AUC + 0.20·(1−ECE) + 0.15·top1 %_lift + 0.10·(1−ind_cv) + 0.10·(1−year_cv).
Stage 2 test PR-AUC at t3:
| Method | best_iter | PR_AUC | lift | Brier_ratio | ECE | top1 % | top5 % | daily_prec@5 |
|---|---|---|---|---|---|---|---|---|
| A | 191 | 0.1217 | 3.0× | 0.972 | 0.010 | 3.65× | 4.21× | 0.200 |
| B | 196 | 0.1052 | 2.4× | 0.977 | 0.013 | 4.16× | 3.35× | 0.156 |
| C | 162 | 0.1195 | 2.5× | 0.968 | 0.015 | 4.25× | 3.70× | 0.203 |
| D | 8 | 0.0961 | 2.5× | 0.971 | 0.005 | 2.31× | 3.25× | 0.114 |
Composite-decision score: A = 0.836, C = 0.815 (within tiebreak band 0.03). C wins industry_cv tiebreak (0.40 vs 0.51); A wins on operational clarity + 15d vs 5d median event duration.
Null tests both PASS: label-shuffle PR_AUC = 0.04021 (0.989× base), date-shuffle 0.06060 (1.491× — borderline, threshold 1.5×). Real model lift 3.0× / shuffle lift 1.49× → 2.0× fresh-signal ratio.
Decision. P0 = Method A at horizon t3, threshold τ_A = 1.2327, locked in src/aurumq/labeling/p0_chosen.py. C kept as drop-in fallback. P0 cleared for P1 production deployment.
Goal. Close P1's training gaps so the LGBM is production-grade, not an ablation prototype.
Pieces.
- Stage 0: Daily panel rebuild. Split monolithic
feature_panel_v3_344.parquet(3.65 GB) into 804 per-day shards (year=YYYY/date=YYYY-MM-DD.parquet). Celery beat @18:35 between phase20 rebuild (18:30) and wave_scores (18:45). Schema-hash assert against P0 lock5e71e158e331. - Stage 1: Walk-forward on single anchor 2025-07, 5 seeds × 3 horizons (t1, t3, e20) = 15 LGBM trainings. LGBM params locked from P0 (lr=0.02, num_leaves=63, n_est=1500, early_stop=80), per-seed isotonic, mean ensemble.
- Stage 2: composite_mean(A, C) finishing — REJECTED (PR_AUC = 0.0433 vs A_t3's 0.1217, Δ = −0.0785). Composite labels mix A∪C event positions that don't overlap, raising noise.
- Stage 3: PPO residual — SKIPPED, deferred to P3 4070 work.
- Stage 4: Alembic 052, v1/v2 dual-write,
drift_check @19:00.
| Horizon | test_pos_rate | PR_AUC | lift | ECE | top1 % | daily@5 | per-seed std |
|---|---|---|---|---|---|---|---|
| t1 | 0.0135 | 0.0721 | 5.34× | 0.0024 | 9.28× | 0.103 | 0.002 |
| t3 | 0.0407 | 0.1224 | 3.01× | 0.0100 | 3.33× | 0.209 | 0.001 |
| e20 | 0.2650 | 0.4136 | 1.56× | 0.0477 | 1.96× | 0.548 | 0.002 |
Decision. Promote wave_t3_lgbm_v1 (P0 single seed) → wave_t3_lgbm_v2.anchor=2025-07.ensemble (5 seeds) under tiered "realistic gate": t3 + t1 ship, e20 reference-only.
Diversity exploration. 4 new bundles + 1 meta-stack on the same H1/H2 windows.
| Path | Input panel | Preprocessing | n_features | Model |
|---|---|---|---|---|
| Path 1 | feature_panel_v3_344 | NO rank-z (raw) | 345 | LightGBM β-regression |
| Path 2 | feature_panel_clean | rank-z (same as Path 4) | 345 | CatBoost + XGBoost mix |
| Path 4 | feature_panel_clean | rank-z | 345 | LightGBM (prod) |
| Path 6 | feature_panel_clean_pruned | rank-z + drop 119 SHAP-zero | 226 | LightGBM (Bayesian opt) |
| Path 5 | meta-LGB over (Path1, Path4, Path2) | + 11 regime + 9 interactions | 23 meta | LightGBM meta |
Short-panel scoreboard (H1 = 2025-07..09, H2 = 2025-10..12):
| Path | H1 cal primary | H2 cal primary | T1_hit H1 |
|---|---|---|---|
| Path 1 | +0.028030 | +0.030497 | 54.4 % |
| Path 4 | +0.028483 | +0.030577 | 54.5 % |
| Path 2 | +0.027932 | +0.030658 | 54.0 % |
| Path 6 | +0.028265 | +0.030739 | 54.5 % |
| Path 5 (meta) | +0.028372 | +0.030245 | 55.8 % |
| Path D (long, REJECTED as standalone) | +0.028358 | +0.030738 | — |
| Path 3 (TabNet, REJECTED) | +0.019417 | +0.020 | — |
Path 3 TabNet's rejection. H1 +0.019 vs Path 1 +0.028 (−30-40 %); 47 min train vs 50 s for LightGBM (~55× slower); killed 7/8 grid combos. Also drove T1_hit DOWN by ~1 pp when added to meta — excluded from Path 5 stacking. TabNet is not the answer for this problem under this data scale.
Cross-path ensemble: rank-mean across Path 1+4+2 gives H1 +0.028407 vs best single +0.028483 — Δ = −0.0001. Three GBDT-family paths are highly correlated; ensemble diversity is exhausted.
Goal. Use the 7-year long panel (2018-01-02..2024-12-04 train, same H1/H2 as short) to retrain Path 1/2/4/5. Test the rank-z hypothesis.
Combined headline:
| Path | short H1 | long H1 | Δ H1 | short H2 | long H2 | Δ H2 |
|---|---|---|---|---|---|---|
| Path 1 (raw LGB) | +0.028030 | +0.028626 | +5.97 bps | +0.030497 | +0.031089 | +5.92 bps |
| Path 4 (rank-z LGB) | +0.028483 | +0.028358 | −1.25 bps | +0.030577 | +0.030738 | +1.61 bps |
| Path 2 (CB+XGB) | +0.027932 | +0.028343 | +4.11 bps | +0.030658 | +0.030857 | +2.00 bps |
| Path 5 (regime stack) | +0.028372 | +0.028817 | +4.46 bps | +0.030245 | +0.031131 | +8.86 bps |
Finding 1 — rank-z kills long-panel info. Experiment B isolates the variable: Path 4 hyperparams + the same raw long panel = numerically identical to Path 1 long (H1 = 0.028626). The per-day cross-sectional rank operation erases cross-year factor amplitude information — fine for 2y where amplitude variation is small, catastrophic for 7y where most of the new info lives in cross-year amplitude regimes.
Finding 2 — 5-year is the plateau (Experiment A):
| Train window | n years | H1 mean | H2 mean |
|---|---|---|---|
| 2023-2024 (2y) | 2 | +0.027928 | +0.030285 |
| 2022-2024 (3y) | 3 | +0.028182 | +0.030880 |
| 2020-2024 (5y) | 5 | +0.028529 | +0.030953 |
| 2018-2024 (7y) | 7 | +0.028533 | +0.030983 (plateau) |
Strictly monotonic up to 5y then flat. 2018-2019 contribute nothing. Implications: if retraining, use 2020-2024, save 30 % compute.
Finding 3 — Strategy D compounds. Top-50 score-weighted sizing:
| Path | + Strategy D H1 mean_y | + Strategy D H2 mean_y |
|---|---|---|
| Path 4 short + Strategy D (current prod v2) | +0.0308 | +0.0333 |
| Path 1 long + Strategy D | +0.0315 | +0.0343 |
| Δ | +7 bps | +10 bps |
Strategy D's +8 % concentration effect stacks additively on a stronger base — bigger absolute gains, not just same percentage of smaller pie.
Three production candidates:
| Candidate | H1 vs prod | H2 vs prod | + Strategy D vs prod v2 | Ops cost |
|---|---|---|---|---|
| C. Hybrid (Path 1 long + Path 4 short 50/50) | +2.7 bps | +4.5 bps | +5-8 bps | simplest — 2 inferences + average, no new model |
| A. Path 1 long (raw, 2018-2024) | +1.4 bps | +5.1 bps | +7-10 bps | medium — 1 new bundle |
| B. Path 5 long stacking | +3.3 bps | +5.5 bps | strongest | high — 3 base + meta + 11 regime + 9 interactions |
All three were shipped to production on 2026-05-11. Hybrid + path1_long went live first (commit 78d71ce); path5_long followed 30 minutes later after ledashi shipped the missing path4_long + path2_long base bundles (commit 0ab6a55). Today path5_long is the H1-leading model (+0.02882) and is one signal-A/B-window from being promoted to is_recommended=True over Path 4 short.
中文. 截面 z-score(rank-z)在每个 trade_date 内把当天因子重排到 [-σ, +σ] 区间。短 panel(2 年)训练时这没问题:所有因子的截面分布相似。长 panel(7 年)训练时灾难:2020 年和 2024 年的市场结构完全不同(科创板规模、北向资金占比、机构持仓比例都翻了几倍),而 rank-z 把跨年的「因子绝对幅度」信息全部抹掉,模型只能学到「相对排名」,反而损失 5-6 bps 的 H1 mean_y。
English. Cross-section z-score (rank-z) re-ranks every factor's per-day distribution to [-σ, +σ]. Fine for short panels (2y) where cross-section distributions are similar. Catastrophic for long panels (7y) where market structure shifts dramatically year-over-year (STAR board size, Northbound share, institutional holdings all multi'd over 7y). Rank-z erases the cross-year factor amplitude information; the model only learns relative ranks, losing 5-6 bps H1 mean_y.
Action. For long-panel training, use raw features. Verified: Path 1 long (raw, 7y) > Path 4 long (rank-z, 7y) by 5.97 bps H1 with identical hyperparams.
中文. 2018-2019 数据零边际贡献。原因:A 股市场在 2019 年下半年到 2020 年初经历了科创板开板、注册制改革、外资准入扩大 —— 实质上是个 regime change。把 2018-2019 数据塞进训练集相当于让模型同时学两个市场结构,跨年泛化变差。
English. 2018-2019 contributes nothing. Reason: A-shares went through a regime change in late 2019 / early 2020 (STAR opening, registration-based IPO reform, foreign-access expansion). Training on 2018-2019 forces the model to learn two market structures simultaneously, hurting cross-year generalization.
Action. Default train window is 2020-2024 (5y). Saves 30 % compute at same precision.
中文. Strategy D = top-K 仓位按校准分数加权(weight = max(score, 0) / sum),而不是等权。在 Path 4 short 上 +8% mean_y;在 Path 1 long 上 +10% mean_y。新指标提升的绝对值 ≈ 旧 base 提升的绝对值,不是百分比。换 base 越强,Strategy D 累加越多。
English. Strategy D = top-K score-weighted position sizing (weight = max(score, 0) / sum), not equal-weight. Adds +8 % mean_y on Path 4 short, +10 % on Path 1 long. Absolute improvement is the same whether base is weak or strong — stronger base means larger combined gain. Always apply Strategy D.
中文. 二值事件(MA 金叉、KDJ 死叉、突破 3σ)原始 0/1 进 LayerNorm 是 −33% 准确率的元凶(Phase 24A)。正确做法:把二值序列做 exp-decay τ=10d 转换,evt_decay(t) = Σ_{tau ≤ 10d} 1[event in last 10d] · exp(-(t-tau)/10),得到 0..1 连续衰减值。Phase 26F 用这个修复把 T-1 lift 从 1.13× 拉回 2.27×。
English. Raw binary event flags (MA cross, KDJ death-cross, 3σ breakout) directly into LayerNorm cause a −33 % accuracy regression (Phase 24A). Correct encoding: exp-decay τ=10d. Phase 26F applied this fix and brought T-1 lift back from 1.13× to 2.27×.
Finding 5 — Permutation importance 是 conditional, 不是 causal / Permutation Importance Is Conditional, Not Causal
中文. Phase 16 跑出 cyq 和 inst 都是「robust 负向」(permutation importance 显著负)。Phase 17A 把它们 drop 掉重训 → adj Sharpe 暴跌 −0.732。原因:permutation importance 度量的是当前训好的策略有多依赖某个特征,不是这个特征是否对预测有因果贡献。要测因果必须重训而不是重 permutation。
English. Phase 16 found cyq + inst both had "robust negative" permutation importance. Phase 17A dropped them and retrained → adj Sharpe collapsed −0.732. Reason: permutation importance measures how much the trained policy depends on a feature, not whether the feature is causally helpful for prediction. To test causality you must retrain after the drop, not just re-permute.
中文. Phase 26 跑 inf-root-cause 审计时,发现 19 个 inf-producing 因子的根因不是公式问题:
- adj_factor 损坏:5 个日期 + 2026 全年 16.5 万 NULL →
compute_alpha101.adj.fill_null(1.0)把不复权价拼到复权价上,制造假 regime shift。 - 一字板 high=low=close:vwap-close、high-low 都 = 0,公式 div-zero。
- rolling 相关性 window=2:std=0 触发 NaN/inf。
- rank^delta 当 rank=0 且 delta 极端:
0^-large = inf,gtja_017 实测 max 达 1.4×10³⁰⁸(fp64 上限)。
英文 lesson: sanitizer(inf→NaN + clip ±1e6)只是兜底,真正的修复在上游数据管线。
English. A Phase 26 inf-root-cause audit traced 19 inf-producing factors not to formula bugs but to four upstream data wounds: (1) adj_factor corruption (5 dates + all of 2026 = 165k NULL → compute_alpha101.adj.fill_null(1.0) stitched unadjusted prices to adjusted prices, creating fake regime shifts); (2) one-字板 days (high=low=close → vwap-close and high-low both zero, divide-by-zero); (3) rolling correlation window=2 producing std=0; (4) rank^delta with rank=0 and extreme delta blowing past fp64 (gtja_017 hit max 1.4×10³⁰⁸). Lesson: sanitizer (inf→NaN + clip ±1e6) is only a backstop; the real fix is upstream data.
中文. 当前生产环境(AurumQ 主仓侧)的 Celery Beat 调度(工作日):
18:30 phase20_rebuild_panels_daily 重建 short combined panel + 上传 OSS
18:35 rebuild_feature_panel_daily 当日 shard 写入
18:45 generate_wave_scores_daily v1/v2 集成(legacy 兼容)
18:50 generate_wave_scores_path4_daily Path 4 + Strategy D(当前推荐)
18:51 path1 评分
18:52 path2 评分
18:53 path6 评分
18:54 path1_long 评分 ★ 长 panel raw
18:55 path4_long 评分 ★ path5_long base
18:56 path5 评分 meta on path1+path4+path2
18:57 path2_long 评分 ★ path5_long base
18:58 hybrid 评分 ★ path1_long + path4 50/50
18:59 path5_long 评分 ★ NEW BEST regime stack on long bases
19:00 wave_drift_check 漂移监控
关键点:
- 10 个 model_version 并存。任何模型上线,路径:
runner.py:PATH_CONFIG加一条 +celery_beat.py:BEAT_SCHEDULE加一行 +wave_scores.py:_PATH_DISPLAY_INFO加 display 元数据 → 自动出现在/api/v1/wave/model_versions接口。 - 预测就绪时间:每工作日 19:00 BST。
- 生产推荐:当前
is_recommended=True仍在 Path 4 short;A/B 几周后切到 path5_long。
English. Current production Celery Beat schedule (weekdays) — see Chinese block above for the 14-slot 18:30→19:00 timeline. Predictions are ready by 19:00 BST every weekday. Adding a new model is a 3-line patch: add a PathConfig in runner.py, a beat entry in celery_beat.py, and a display row in wave_scores.py. It then auto-surfaces on /api/v1/wave/model_versions. Current is_recommended=True is Path 4 short; will A/B-promote path5_long after a few weeks of shadow.
中文.
- 每轮只改一个变量。复合实验不可解释。Phase 3 R2 同时改
target_kl + n_envs + n_steps,分不出哪个贡献了 +0.74 OOS Sharpe。 - 永远先跑 10-30k step 的 micro-smoke。Phase 3 R2 第一次尝试 + Phase 4 n_envs=20 实验都因为没先 micro-smoke 就跑
MemoryError。 - 50k 步的 OOS Sharpe 是噪声。从 smoke 排座次不靠谱;要从 ≥1M 步 convergence-scale 跑里挑赢家。
- 不基于 permutation importance 做特征剔除。Phase 17 已经栽过一次。要剔除必须 retrain ablation。
- 重新审视基线 > 增量调参。Phase 5 重构(per-stock encoder + GPU env)一次给了 10× fps + 5× 因子上限。所有 Phase 1-4 hyperparam 调整加起来都没这一次重构贡献大。
- VRAM ≠ RAM。Phase 5 realization #2 教训。
- 对称性正确的架构 > 蛮力容量。50 K 参数的 per-stock encoder 比 800 M 参数的 flat MLP 更准 + 更快 + 更省。
- 数据收敛速度必须 > 网络容量增长速度。Phase 2 把网络从 [64, 64] 加到 [2048, 1024, 512](12 倍),但训练步数还是 50k,于是误判「网络容量不够」。
- 管线对 ≠ 模型好。R1 的「跑通端到端」和 R3 的「explained_var=0.99」是同等重要的里程碑。
English.
- One change per round. Compound experiments are uninterpretable. Phase 3 R2 changed three things at once and couldn't attribute the +0.74 OOS Sharpe jump.
- Always micro-smoke at 10-30k steps first. Phase 3 R2 first attempt and Phase 4
n_envs=20both died withMemoryErrorbecause nobody micro-smoked first. - OOS Sharpe at 50k steps is noise. Don't pick winners from smoke; pick from convergence-scale runs (≥1M).
- Don't drop features based on permutation importance alone. Phase 17 ate this one. Always retrain ablation to test causality.
- Re-examining baselines > incremental tuning. Phase 5 redesign (per-stock encoder + GPU env) gave 10× fps + 5× factor capacity at once. All Phase 1-4 hyperparam tuning combined didn't match it.
- VRAM ≠ RAM. Phase 5 realization #2.
- Symmetry-correct architecture > brute capacity. 50 k-param per-stock encoder is more accurate + faster + smaller than 800 M-param flat MLP.
- Data convergence must outpace capacity ramp. Phase 2 widened net 12× ([64,64] → [2048,1024,512]) but kept smoke at 50k steps → misjudged as "not learning"; really just hadn't trained long enough.
- Pipeline correctness ≠ model quality. R1's "first end-to-end run" was as much progress as R3's "explained_var=0.99 value function" because both unlocked new test categories.
FACTOR_COL_PREFIXES 是 single source of truth。漏一个前缀 = Phase 16 的 mfp_ 静默缺失复现。三处必须同步:
src/aurumq_rl/data_loader.pytupletests/test_data_loader.py:test_factor_col_prefixes_lockdownscripts/export_factor_panel.py:FACTOR_PREFIXES
- A 股
pct_chg是小数(+10 % = 0.10),不是 +10 或 10.0。 vol == 0表示停牌,必须从训练 panel 过滤掉。- A 股代码必须 Tushare 风格
XXXXXX.SH/SZ/BJ。 - 涨跌停判定按板别,不能写死 ±10 %(ST ±5 %, 科创创业 ±20 %, 北交 ±30 %)。
- T+1 约束必须强制:今天买入次日才可卖。这是
entry_eligible_mask的核心。
- ECS 8C14G 严禁训练。PyTorch 安装即占 ~3 GB RSS。
ProcessPoolExecutor / ThreadPoolExecutor max_workers ≤ 3。- PostgreSQL
shared_buffers=2GB。 - bf16 autocast 在 4070 / Ampere 上对 PPO 不稳定(Phase 11),保留 fp32 + TF32(Phase 14)。
<PRIVATE_OBJECT_STORE>/handoffs/handoffs/...= ledashi (4070 Windows) → paris (Ubuntu ECS) 方向<PRIVATE_OBJECT_STORE>/handoffs/...= paris → ledashi 方向- ECS 在 sgp 区只能直传
<PRIVATE_OBJECT_STORE>,CRR 自动同步到大陆<PRIVATE_OBJECT_STORE>让 ledashi 拉取 - 文档内的
oss://路径要写主库(<PRIVATE_OBJECT_STORE>/...),不要写源 bucket(<PRIVATE_OBJECT_STORE>/...)
完整列表在 docs/TRAINING_HISTORY.md Section D。摘要:
- Phase 0: gymnasium import / ONNX device mismatch /
dynamo=True失败 / numpy.float32 JSON - Phase 1: NaN through z-score / OOS obs_dim mismatch / approx_kl=41,820 blow-up / mean_fps=0 / metrics_summary all-null /
runs/gitignore unanchored / alpha045 STHSF Windows-only failure / OSS admin AK disabled / wepa namespace pollution - Phase 2: dashboard canonical-key filter / recharts width(-1) height(-1) warnings / hydration mismatch from MPA browser extension / missing
/api/runsroute / next.js node:fs/promises in client bundle - Phase 3: eval_backtest n_factors mismatch / R2 OOM (8.83 GiB rollout buffer)
- Phase 4: n_envs=20 OOM (14.7 GiB)
- Phase 5: OSS IncompleteRead at 99.7 % / connection timeout on parallel transfers / dev server
&-orphan exit 127
git clone https://github.com/yupoet/aurumq-rl.git
cd aurumq-rl
python3 -m venv .venv && source .venv/bin/activate
# 安装核心依赖(推理 only,~50MB)
pip install -e .
# 跑 smoke test(合成数据,CPU 即可)
python scripts/train.py --smoke-test --out-dir /tmp/aurumq_rl_smoke
cat /tmp/aurumq_rl_smoke/smoke_summary.json# 1) 安装 GPU 训练依赖(PyTorch + SB3 + gymnasium + onnx + wandb)
pip install -e ".[train]"
# 2) 准备数据:用合成 demo 或自己导出
python scripts/generate_synthetic.py --out data/synthetic_demo.parquet # 10MB demo
# 或:从 PG 抽取
python scripts/export_factor_panel.py \
--pg-url postgresql://user:pass@host/db \
--start 2020-01-01 --end 2026-04-30 \
--out data/factor_panel.parquet
# 3) 启动训练(RTX 4070 12GB,n_envs=16,~6h overnight 1M steps)
python scripts/train.py \
--algorithm PPO \
--total-timesteps 1000000 \
--data-path data/factor_panel.parquet \
--universe-filter main_board_non_st \
--include-hot-money \
--n-envs 16 \
--target-kl 0.05 \
--reward-mode main_wave_hold \
--out-dir models/ppo_v1
# 4) 推理(CPU only)
python scripts/infer.py \
--model models/ppo_v1/policy.onnx \
--data data/factor_panel.parquet \
--date 2026-04-30 \
--top-k 30# Phase 22C — train_topk=3, eval@5, 200k steps, seed=42, ~8h on RTX 4070
python scripts/train_v2.py \
--algorithm PPO \
--total-timesteps 200000 \
--data-path data/factor_panel_combined_short_2023_2026.parquet \
--universe-filter main_board_non_st \
--n-envs 16 \
--seed 42 \
--reward-mode main_wave_hold \
--main-wave-config '{"exit_signal":"ma5_ma10_death_cross","max_hold_days":5}' \
--top-k 3 \
--out-dir runs/phase22c
# 评估(H1 OOS = 2025-07..2025-12)
python scripts/_eval_main_wave_v1.py \
--ckpt runs/phase22c/policy_best.zip \
--eval-data data/factor_panel_combined_short_2025-07_2026-04.parquet \
--eval-top-k 5 \
--out runs/phase22c/eval.json期望结果(中文/Expected results):
- hit_rate@5: 6.16 %
- win_rate: 44.0 %
- avg_hold: +0.62 %
- avg_max_drawdown: 3.84 %
# Phase 26F-v3 — 3 seeds, 361 cols, 300k steps each
for seed in 42 43 44; do
python scripts/train_v2.py \
--algorithm PPO \
--total-timesteps 300000 \
--data-path data/factor_panel_v3.parquet \
--include-columns-file configs/phase26f_v3_361cols.txt \
--universe-filter main_board_non_st \
--n-envs 16 \
--seed $seed \
--reward-mode main_wave_hold \
--out-dir runs/phase26f-v3-seed$seed
done
# 取 3-seed 中位数 (T-1 lift median ~2.27×)
python scripts/_eval_main_wave_v1.py \
--ckpt-glob "runs/phase26f-v3-seed*/policy_best.zip" \
--eval-data data/factor_panel_v3_oos.parquet \
--eval-top-k 5 \
--aggregate median \
--out runs/phase26f-v3-summary.json# 在 AurumQ 主仓
cd /path/to/AurumQ
# 跑 SL Path 1 long(raw 345-col + 7y train)
python scripts/generate_wave_scores_paths.py \
--path path1_long \
--date 2026-05-08 \
--top-k 50
# 跑 Hybrid(path1_long + path4 50/50)
python scripts/generate_wave_scores_paths.py \
--path hybrid \
--date 2026-05-08 \
--top-k 50
# 跑 path5_long (NEW BEST)
python scripts/generate_wave_scores_paths.py \
--path path5_long \
--date 2026-05-08 \
--top-k 50Windows 上 pip install -e ".[train]" 默认会从 PyPI 装 CPU-only torch。要拿到 CUDA 版需要先单独装:
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install -e ".[train]"
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"ONNX 导出阶段会输出含 emoji 的提示文本;简体中文 Windows 控制台默认 GBK 编码无法编码:
# bash / git-bash
export PYTHONIOENCODING=utf-8
# PowerShell
$env:PYTHONIOENCODING = "utf-8"# 启动(自动 npm install + npm run dev)
bash scripts/web_dashboard.sh # macOS / Linux / Git Bash
.\scripts\web_dashboard.ps1 # PowerShell打开 http://localhost:3000 查看训练历史。/runs/<id> 是单次详情(含 backtest 摘要 + 训练曲线 + GPU 利用率面板),/compare?ids=a,b,c 多次叠加对比。前端 Next.js 16 server route 直接读取 runs/ 目录,无需后端。
短期(已 in-progress):
- path5_long A/B 几周后切到
is_recommended=True - 6 个新增 path 的 wave drift check 覆盖(19:00 cron)
- 26G 在 16 GB 级 GPU 上重测(256→128 encoder 假设的硬件验证)
- gtja_017 等 8 个 quality_flag=1 的 stub 因子在 v2 sanitizer 下重审
中期(设计阶段):
- Conformal prediction interval(已在 SL 侧 Strategy D 之外实测 +5 bps,需移植到 RL)
- 跨市场(US/HK/Crypto)调用同一套训练 stack 的可行性研究
- AQML → PPO reward function 的自动转译("用户写策略意图,自动训出 policy")
远期(探索):
- Transformer attention head 在 36 GB+ 级 GPU 上重审(Phase 21 V2 在 12 GB 失败的复盘)
- 多 agent 协作(市值因子专家 + 资金流专家 + 主升浪专家的 mixture-of-experts)
Short term (in progress). A/B-promote path5_long to is_recommended=True; cover wave drift check for the 6 new paths; re-test Phase 26G on a 16 GB-class GPU; re-audit 8 quality_flag=1 stub factors under v2 sanitizer.
Mid term (design). Conformal prediction interval (proven +5 bps on SL side beyond Strategy D, port to RL); cross-market (US/HK/Crypto) feasibility study with the same stack; AQML → PPO reward auto-translation.
Long term (exploration). Transformer attention head on 36 GB+ GPU (re-audit Phase 21 V2 failure under more compute); multi-agent mixture-of-experts (cap factor expert + capital flow expert + main-wave expert).
如果本项目对你有帮助,欢迎 Star ⭐ 和引用:
@software{aurumq_rl_2026,
title = {AurumQ-RL: Reinforcement Learning Stock Selection for China A-Shares},
author = {Paris Yu and AurumQ-RL Contributors},
year = {2026},
url = {https://github.com/yupoet/aurumq-rl},
}中文. 本项目使用的金融数据来自公开行情数据导出,包括日线 OHLCV、资金流分档、龙虎榜、北向持股、融资融券、筹码分布、基本面、申万行业等公开市场信息。这些数据在新浪财经、东方财富、同花顺、券商行情软件等公开渠道均可获取。项目不内置任何特定数据 API 的密钥或商业授权数据。
data/synthetic_demo.parquet 完全是合成数据,不对应任何真实股票。
如需真实数据训练,用户需自行:
- 从合规渠道获取行情数据
- 用
scripts/export_factor_panel.py导入到 PostgreSQL - 自行承担数据使用合规责任
English. Financial data used in this project comes from public market data exports — daily OHLCV, capital flow buckets, Dragon-Tiger List, Northbound holdings, margin trading, chip distribution, fundamentals, SW industry. These data are available through Sina Finance, Eastmoney, THS, broker platforms, etc. The project does NOT bundle any specific data-vendor API keys or commercially licensed data.
data/synthetic_demo.parquet is entirely synthetic and does not correspond to any real stock.
For real-data training, users must: (1) acquire market data through compliant channels, (2) import via scripts/export_factor_panel.py into their own PostgreSQL, (3) assume data-compliance responsibility themselves.
License. MIT. See LICENSE. 商用、修改、再分发皆可 / commercial use, modification, redistribution all permitted.
Disclaimer (中文). 本项目作者不对收益和风险承担任何责任。请记住:量化策略历史回测优秀 ≠ 实盘赚钱。
Disclaimer (English). This project is for educational and research purposes. Backtested performance does not guarantee live trading profits. The authors take no responsibility for any financial losses incurred from using this code.
A 股选股 ML 研究归两大 paradigm:
学名 / Academic name:
- Cross-Sectional Return Forecasting
- Supervised Alpha Modeling
- Forward Return Prediction
核心范式 / Core: features(t) → y(t) 其中 y(t) = f(forward_returns over [t+1, t+K])。每天 rank stocks by predicted y,选 top-K。
业界占比 / Industry: 主流 (Renaissance, Two Sigma, 多数对冲基金)。
子方向 / Sub-directions:
| algorithm family | 学名 | 代表 algorithm |
|---|---|---|
| Regression | Continuous target regression | LGB regression / XGBoost / CatBoost / MLP |
| Classification | Binary / Quantile classification | LGB binary / Logistic / Probit |
| Learning-to-Rank | Pairwise + listwise loss | LambdaMART / RankNet / ListNet |
| Distributional | Quantile regression / Mixture density | LGB quantile / TabNet |
| Multi-horizon Multi-task | Joint learning K=1/5/10/20 | Multi-head NN / multi-output LGB |
| Sequence Models | LSTM / Transformer for tabular time-series | Kronos / FinBERT-style |
| Stacking / Meta-learning | L1 / L2 ensemble | LGB / NN on base predictions |
所有 paradigm 1 共同特点: 用 forward window 计算 label,无论 K=20 还是 K=1,无论 regression 还是 binary,本质都是预测"未来 K 天 outcome"。
学名 / Academic name:
- Event Study + Case-Control Sampling (经济学/金融学派)
- Pre-Event Pattern Detection
- Anomaly / Novelty / Rare-Event Detection (ML 派)
- Sequence-to-Event Models (深度学习派)
核心范式 / Core: 历史扫一遍找 N 个 events,取每 event 前 T-1/T-3/T-5 天作 positive,随机非 pre-event day 作 negative。features(t) → P(t is pre-event)。
业界占比 / Industry: 小众 (人工选股 + 部分模式识别 trading + 异常检测 quant)。
子方向 / Sub-directions:
| algorithm family | 学名 | 代表 algorithm |
|---|---|---|
| Event-Anchored Classification | Case-control logistic / Imbalanced binary | LGB / XGBoost on anchored samples |
| Pattern Mining | Matrix Profile / Motif discovery | STUMPY / Time Series Subsequence search |
| Imbalanced Classification | Focal loss / SMOTE oversampling | Focal-Loss LGB / XGBoost scale_pos_weight |
| Anomaly Detection | Isolation Forest / Autoencoder | iForest / VAE |
| Survival Analysis | Cox proportional hazards | scikit-survival / Cox-LGB |
| Sequence-to-Event | LSTM / Transformer / TCN | DeepAR / Kronos fine-tune |
| Self-Supervised Pre-training | Contrastive learning on time series | SimCLR-style for finance |
所有 paradigm 2 共同特点: 用 backward 历史扫描定义 event,用 pre-event window 作正样本,严重 class imbalance,符合 "找主升浪前夕入场" 思路。
Comprehensive synthesis of 1,473 cells across 8 matrices (5/15-5/18 overnight pipeline): see docs/RANKINGS_COMPREHENSIVE_v18.md for full 13-section report (98 KB, 353 lines, top-20 overall + top-10 per universe/paradigm/panel/method/horizon + sanity checks + production routing + gap audit) + 6 visualization PNGs in docs/figures/.
(composite = H2 IC × Sharpe NET × max(Q1 IC, 0))
| # | cell_id | paradigm | univ | panel | H2 fwd20 IC | Q1 fwd20 IC | Sharpe_NET K10 fwd20 |
|---|---|---|---|---|---|---|---|
| 1 | target_y_HARD_TECH_v2_null |
p1-proximity-reg | HARD_TECH | v2_null | +6.60% ⭐ | +10.68% ⭐⭐ | 2.46 |
| 2 | target_y_HARD_TECH_ledashi |
p1-proximity-reg | HARD_TECH | ledashi | +6.29 | +10.82 | 2.39 |
| 3 | target_y_HARD_TECH_r2a |
p1-proximity-reg | HARD_TECH | r2a | +6.17 | +9.97 | 2.53 |
| 4 | binary_v4_HARD_TECH_v3unified |
p1-binary-dense | HARD_TECH | v3unified | +5.84 | +5.87 | 4.25 ⭐ |
| 5 | target_y_HARD_TECH_r2b |
p1-proximity-reg | HARD_TECH | r2b | +6.63 | +8.69 | 2.32 |
| 6 | target_y_HARD_TECH_v2_no_phase_c |
p1-proximity-reg | HARD_TECH | v2_no_phase_c | +6.19 | +9.44 | 2.24 |
| 7 | target_y_HARD_TECH_v3unified |
p1-proximity-reg | HARD_TECH | v3unified | +5.88 | +9.20 | 2.42 |
| 8 | target_y_HARD_TECH_tier4_v2_old |
p1-proximity-reg | HARD_TECH | tier4_v2_old | +6.00 | +9.38 | 1.97 |
| 9 | alpha_T3_HARD_TECH_ledashi |
p2-anchor | HARD_TECH | ledashi | +6.14 | +5.33 | 2.85 |
| 10 | binary_v3_HARD_TECH_v2_null |
p1-binary-dense | HARD_TECH | v2_null | +3.92 | +5.24 | 4.04 |
| universe | short (fwd5) best | mid (fwd10) best | long (fwd20) best |
|---|---|---|---|
| MAIN_BOARD | v2_MAIN_BOARD_r2b |
v2_MAIN_BOARD_r2b |
v4_MAIN_BOARD_ledashi |
| CSI500 | binary_v2_CSI500_v2_null |
binary_v2_CSI500_v2_null |
catboost_v2_CSI500_ledashi |
| CSI1000 | v2_CSI1000_tier4_v2_old |
binary_v4_CSI1000_tier4_v2_old |
binary_v2_CSI1000_tier4_v2_old |
| NPF | target_y_NPF_v3unified |
v2_NPF_r2a |
v2_NPF_r2a |
| NPF_FULL | v2_NPF_FULL_v3unified |
binary_v3_NPF_FULL_v3unified |
binary_v4_NPF_FULL_v2_no_phase_c |
| HARD_TECH | binary_v3_HARD_TECH_ledashi |
binary_v3_HARD_TECH_ledashi |
binary_v4_HARD_TECH_v3unified |
| matrix | paradigm | cells | universe × panel grid | bootstrap CI | status |
|---|---|---|---|---|---|
| v10 | P1 proximity reg | 174 | 7×6 + 6 ES eval-only | partial (in v10h) | shipped |
| v10b | P1 proximity reg (target_y) | 42 | 7×6 | partial (in v10h) | shipped |
| v10c | P1 binary dense (P75 25% pos) | 168 | 7×6×4 labels | partial (in v10h) | shipped |
| v10d | P1 CatBoost diversity | 48 | 2 panels × 6 univ × 4 labels | partial (in v10h) | shipped (5-panel gap) |
| v10e | P1 XGBoost diversity | 48 | 2 panels × 6 univ × 4 labels | partial (in v10h) | shipped (5-panel gap) |
| v10h | bootstrap CI | 207×4 | top cells from v10/v10c/v10d/v10e | itself | shipped |
| v11 | P1 binary sparse (paris 0.8%) | 504 | 7×6 × 4 methods × 3 horizons | missing | shipped (no CI) |
| v12 | P2 anchor α/β | 252 (147 valid + 105 skip) | 7×6 × 2 specs × 3 anchors | missing | shipped (no CI; β sparse) |
| v13 | P3 Kronos sequence anchor | 22 | 6 univ × 3 anchor × α full + β-MAIN_BOARD + null control | planned | scheduled 5/22 evening fire |
v13 paradigm 3 matrix (post paris ACK_v30 + ledashi ACK-of-ACK shipped 5/19 PM):
- Architecture: reuse
aurumq_predictor_smallencoder → 1536-dim hidden state (60d + 120d concat) → +1 log(free_float_mv) → 1537-dim → LGB binary head - D-1 leakage guard strict (paris ACK §R2):
embedding(D) = encoder(OHLCV[D-seq:D-1]) - Skip Phase 1 explicit pre-train (ledashi optimization, saves 2-4h; fallback if Phase 3 全军覆没)
- 22 cells: 21 main (18 α + 3 β-MAIN_BOARD) + 1 null-embedding random control (paris Q1 strong-rec)
- Compute: Phase 2 ~3-4h GPU embed extract + Phase 3 ~1.5h LGB train + Phase 4 ~1h eval
- Production gate: if 5 cells meet (Sharpe NET ≥ 3.0 + dual-regime + bootstrap CI lower > 0) → Track 11 paradigm 3 catalog launches
Headline findings:
- The strongest single-cell deployable signal is
target_y_HARD_TECH_v2_null(paradigmp1-proximity-reg, panelv2_null, universeHARD_TECH) with H2_2025 fwd20 IC = +6.60% and Sharpe_NET K10 fwd20 = 2.46, beating the baselinev3_MAIN_BOARD_ledashi(+4.14% IC). - Paradigm 1 (cross-sectional prediction) dominates Paradigm 2 (anchor) on H2 fwd20 IC by ~0.41pp — anchor labels useful as meta-feature, not standalone.
- Bootstrap CI (v10h K=50 fwd20): 207/207 cells (100%) have CI 2.5% > 0 — production should preferentially deploy K=50 sizing for tail-control.
- LGB binary dense (v10c) has the highest mean composite score; LGB proximity continuous (v10) has the highest peak composite score. Both retained for production diversification.
- CSI500/CSI1000 cells (PIT membership) are the safest universes; HARD_TECH and NPF cells need ≥ 1pp differential vs baseline to claim improvement (IC SE ≈ 0.018).
- Gap: v11/v12 lack bootstrap CI; v10d/v10e only cover 2 panels of 7. Production routing on those cells should be flagged as 'preliminary'.
Universe × Regime alpha (validated bootstrap CI v10h):
- target_y NPF Q1 IC +10.22% (panel-invariant across 7 panels)
- target_y HARD_TECH dual-regime H2 +6.29 / Q1 +10.82 — record-holder cell
- CSI500 H2 +7.97 / Q1 -2.00 — bull-rotation flip
- HARD_TECH
binary_v4_HARD_TECH_v3unifiedequi-regime gold (spread 0.03pp ⭐⭐⭐)
paris production label distribution insight (paris v26+ confirmed):
- paris production wave_v[1234] static train cutoff 2024-12, NO walk-forward retrain
- paris label = global static τ from train-window search with target_pos_rate=0.008 (0.8% positive)
- ledashi v10c dense P75 cross-section threshold = 25% positive → 30x noise → best_iter=1 early-stop bug fixed in v11
- ✅ Baseline reproduction:
v3_MAIN_BOARD_ledashiH2 fwd20 IC == +4.143% (bit-exact across matrix v4-v8) - ✅ Cost model: mean - mean_net == 0.20% (0.002) round-trip
- ✅ Gross > Net: cost increases drag for positive-return cells
- ✅ Train/Eval window separation: Train 2022-2024 ≠ Eval H1_2025..Q2_2026 — no look-ahead leakage
- ✅ Deterministic: random_state=42 fixed in all lgb_params
- ✅ PIT correctness: CSI500/CSI1000 daily PIT membership (per CLAUDE.md universe table)
- ✅ Bootstrap CI K=50 fwd20: 100% cells CI 2.5% > 0 (v10h)
- ✅ Bootstrap CI K=10 fwd20: ≥ 20% cells CI 2.5% > 0 (v10h)
⚠️ No walk-forward: paris production also static train cutoff,not blocker⚠️ v11/v12 no bootstrap CI: gap,future work
fig01_top20_overall_bar.png— Top-20 cells barplotfig02_panel_universe_heatmap.png— Panel × universe × paradigm IC heatmaps (4 subplots)fig03_horizon_scaling.png— IC vs forward horizon (fwd1/3/5/10/20/30) per paradigmfig04_dyn_exit_ranking.png— Top-5 cells per dyn-exit trigger (11 triggers)fig05_paradigm_compare_scatter.png— H2 IC vs Q1 IC scatter, colored by paradigmfig06_bootstrap_ci_distribution.png— Bootstrap CI lower-bound histogram
| matrix | scope | cells | status |
|---|---|---|---|
| matrix v3-v8 | Panel ablation 系列 (paris combined_panel evolution) | ~150 | ✅ done, RESULT v3-v8 shipped |
| matrix v9 | direct ret_fwdK regression (short proximity attempt) | 60 | ✅ done, failed — IC weak/negative |
| matrix v10 | 7 panel × 6 universe × 4 wave_v* × 7 sizing + 6 ES eval | 174 | ✅ done (5/16 12:30, 255 min) |
| matrix v10b | + target_y (paris primary proximity, 5th label) | 42 | ✅ done (5/16 14:30, 111 min) |
| matrix v10c | LGB binary classifier on wave_v* (P75 dense threshold) | 168 | ✅ done (5/16 21:00, 388 min) |
| matrix v10de | CatBoost + XGBoost expanded (algorithm diversity) | 96 | ✅ done (5/17 00:50, 123 min, +inf fix re-fire) |
| matrix v10fg | L1 meta stacker (24) + L2 hybrid blend (6) | 30 | ✅ done (5/17 01:02, meta 全 SKIP due to only 2/7 panel preds saved; hybrid completed) |
| matrix v10h | Bootstrap CI post-processing on 207 pred parquets | 207 | ✅ done (5/17 01:56, 55 min, block-bootstrap 1000 iter on Sharpe NET) |
| matrix v11 (paris sparse binary apples-to-apples) | 7 panel × 6 universe × 4 method (A/B/C/D) × 3 horizon (t1/t3/t5) | 504 | 🟡 in progress ~217/504 (43%, 5/17 10:18 OOM crash + resumed, ETA ~22:00) |
Key non-obvious findings from v10/v10b/v10c/v10de:
- Label structure determines IC ceiling, not panel/algorithm: wave_v3 sparse proximity → +4% IC vs target_y dense calibrated proximity → +2% IC on same MAIN_BOARD ledashi cell.
- Universe×Regime alpha extreme: target_y NPF Q1 IC +10.22% (panel-invariant across 7 panels), target_y HARD_TECH H2 +6.29% & Q1 +10.82% (dual-regime record), CSI500 H2 +7.97% but Q1 -2.00% (bull-rotation flip).
- Phase C concept_ features over-engineered*: v2_null vs ledashi (no Phase C) on theme universes Q1 IC differ < 0.7pp (Phase C marginal); on HARD_TECH wave_v3 binary, Phase C NULL rescues Q1 +5.24pp vs Phase C present.
- No single panel/label wins all universes (paper-level evidence):
- wave_v3 wins MAIN_BOARD + HARD_TECH + NPF_FULL binary
- wave_v4 wins CSI1000 + NPF binary
- wave_v2 wins CSI500 binary
- LGB binary wins theme universes (NPF/NPF_FULL/HARD_TECH); CatBoost wins PIT universes (CSI500/CSI1000)
- r2b 232-col minimalist panel reaches CSI1000 wave_v3 binary equi-regime gold standard (H2 +5.61% / Q1 +4.62% / spread 0.99pp) with only 3 trees — feature engineering minimalism wins.
- 3 equi-regime gold cells found (spread < 0.5pp + Sharpe NET ≥ 3.5):
- HARD_TECH v3unified wave_v4 binary: +5.84/+5.87/Sharpe +4.09 (spread 0.03pp ⭐⭐⭐)
- HARD_TECH v2_no_phase_c wave_v2 binary: +4.19/+4.22/Sharpe +4.12 (spread 0.03pp)
- MAIN_BOARD v2_no_phase_c wave_v4 binary: +3.24/+3.18/Sharpe +3.64 (spread 0.06pp)
- Sparse 0.8% label trains model 50-225 trees (paris production-aligned, no best_iter=1 early-stop bug like dense 25% label).
| matrix | scope | status |
|---|---|---|
| Phase 1 short proximity labels (paris ship 5/16) | 4 method × 3 horizon × 6 universe, target_pos_rate=0.008 | ✅ shipped, used by v11 |
| Phase 2 anchor labels β + α (paris ship 5/16) | 3 anchor (T-1/T-3/T-5) × 6 universe × {α 5-condition, β PELT-hybrid} | ✅ shipped, used by v12 |
| matrix v12 anchor-based (planned) | 7 panel × 6 universe × 3 anchor × {α, β} = 252 cells | 🟡 next after v11 |
| Imbalanced loss variants | Focal-Loss / SMOTE on anchor labels | future |
| Sequence-to-Event | Kronos fine-tune for pre-event detection | future |
v24 (Phase 1 + Phase 2 labels + LABELS_SPEC + IC pre-estimate) → v25 (P2 reference data 5 files) → v25b (wave_v3 retro 2025+ true OOS) → v26 (catboost+xgb hyperparams + 5 reverse-ask answers) → v27 (wave_v1/v2/v4 retro + IC ROI table + regime labels + best_iter table) → 14 docs / 700+ files / ~95 MB cumulative.
来自 matrix v3-v10de 全 paradigm-1 横评 + bootstrap CI on 207 cells:
Panel × Regime interaction (validated bootstrap CI v10h):
- ledashi 226 pruned panel: H2 momentum regime best on broad universes (MAIN_BOARD/CSI1000 wave_v3 IC +4.14%)
- paris tier4_v2_old 378 panel: NPF binary H2 + Sharpe powerhouse (+5.59 / +4.43 avg vs ledashi +3.01/+3.69)
- v3unified (paris production candidate 244 cols): NPF Q1 IC +11.07% record holder + NPF_FULL wave_v3 binary equi-regime gold (+5.47/+4.28 spread 1.19pp)
- r2b 232-col minimalist: CSI1000 wave_v3 binary dual-regime gold with only 3 trees (+5.61 H2 / +4.62 Q1)
- Phase C concept features over-engineered: NULL or drop both 0.3-1pp Q1 stability gain on theme universes (NPF/HARD_TECH)
Label × Algorithm interaction (formal evidence v10c+v10de):
- wave_v3 sparse proximity → highest IC across most cells (LGB binary +4.34 > regression +4.14)
- wave_v4 (direct proximity) → best Q1 regime stability on theme universes
- wave_v1 binary → systematically weak (best_iter often = 1/2, no learnable signal)
- wave_v2 binary → fast learner (2-12 trees) on PIT universes
- target_y (paris primary 83% pos rate) → 1/2 the IC of wave_v3 (label sparsity dominates)
- paris sparse 0.8% binary → 30x lower positive rate, model trains 50-225 trees, IC ~0.5-3% (production-relevant decision)
- CatBoost dominates PIT mid-cap universes (CSI500/CSI1000 Q1 +3.93%/+1.56% vs LGB +1.30%/+1.81%)
- LGB binary dominates theme universes (NPF/NPF_FULL/HARD_TECH Q1 +0.36% / +2.03% / -1.51% — only positive Q1)
- XGBoost generic params consistently 3rd weakest (need paris-tuned hyperparams for v10de_v2)
Universe × Label × Panel interaction (3D):
- 6 universes × 4 wave_v* × 7 panels = no global best combination
- Production stack MUST use regime detector + universe×label×panel routing
- Best dual-regime equi-stable cells (spread < 0.5pp, Sharpe NET ≥ 3.5):
- HARD_TECH × v3unified × wave_v4 binary: H2 +5.84% / Q1 +5.87% / Sharpe +4.09 (spread 0.03pp ⭐⭐⭐)
- HARD_TECH × v2_no_phase_c × wave_v2 binary: H2 +4.19 / Q1 +4.22 / Sharpe +4.12 (spread 0.03pp)
- MAIN_BOARD × v2_no_phase_c × wave_v4 binary: H2 +3.24 / Q1 +3.18 / Sharpe +3.64 (spread 0.06pp)
Sizing: top_k=5/10/15/20/30/50 + adaptive scheme. Production sweet spot ~10-30 names per universe. Sharpe NET typically 1.5-4.5 after 0.20% round-trip cost (extreme +4.79 on r2a HARD_TECH wave_v3 binary).
Dyn-exit production champions:
- MAIN_BOARD wave_v3 + Q_OR_FIE ensemble = best
- CSI1000 wave_v3 + J5_take_profit_5 = highest Sharpe NET seen
- F_trend_break (close < MA5) = wife's strategy, robust across universes
paris production label distribution insight (paris v26+ confirmed):
- paris production wave_v[1234] static train cutoff 2024-12, NO walk-forward retrain
- paris label = global static τ from train-window search with target_pos_rate=0.008 (0.8% positive)
- ledashi v10c dense P75 cross-section threshold = 25% positive → 30x noise → best_iter=1 early-stop bug
- v25b retro-score 2025-01+ subset = true paris production-style OOS baseline
Tier 1 (1-2 weeks):
- v11 short-K proximity labels (paradigm 1 short-horizon completion)
- v11+ anchor-based main-rising-wave label (paradigm 2 entry)
- Walk-forward rolling retrain (paradigm 1 robustness verification)
- Sector-neutral alpha decomposition (paradigm 1 cleanliness)
Tier 2 (1-3 months):
- Meta-learner across panels (paradigm 1 model diversity)
- Risk-parity portfolio construction (replace top-K equal-weight)
- Regime classifier conditional model (HMM / vol-regime)
- Hyperparam Optuna search (Bayesian)
Tier 3 (3+ months):
- Sequence-to-Event models (paradigm 2 deep learning)
- Self-supervised pre-training on time-series subsequences
- Intraday signals integration (tick-level + cross-asset basis)
- Cross-asset signals (futures basis, ETF flow, options skew)
The matrix v3-v10 series produces academic-grade evidence on:
- Panel design × regime interaction (paper draft target: "Cross-sectional alpha decomposition by regime in A-share markets")
- Hyperparam-label fit (paper draft target: "Regression vs binary classifier choice in proximity-weighted forecasting")
- Dyn-exit ensemble alpha (paper draft target: "Adaptive exit triggers in factor-based portfolios")
- Comparison Paradigm 1 vs Paradigm 2 (future paper after anchor-based label complete)
PRs welcomed for: anchor-based label math formula refinement, paradigm-2 algorithm benchmarks, sector-neutral decomposition implementations.
If a phase here taught us something the hard way, we wrote it down so the next person doesn't have to relearn it. PRs that record new lessons are warmly welcomed.
「凡是踩过的坑,都该有文字留下。欢迎 PR 补充新的教训。」
