Skip to content

Commit ae64c51

Browse files
author
Hermes PR Integrator
committed
feat(server): expose qwen pre-norm hidden for MTP handoff
Promote a default-off slice from the conflicted Luce-Org#153/Luce-Org#154 native MTP stack. The Qwen35 graph can now optionally mark and return the final hidden state before output norm for future MTP handoff work while leaving default runtime behavior unchanged.\n\nRefresh docs/auto-integration.md with the latest PR containment, conflict probes, Codex delegation outcome, and validation notes.
1 parent 35f2958 commit ae64c51

3 files changed

Lines changed: 22 additions & 5 deletions

File tree

docs/auto-integration.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ Repository: `Luce-Org/lucebox-hub`
44
Integration branch: `auto-integration`
55
Writable remote: `easel`
66
Upstream remote: `origin` / `Luce-Org`
7-
Last refresh: `2026-06-01T13:30:51-04:00`
7+
Last refresh: `2026-06-01T13:54:22-04:00`
88
Current base: `origin/main` `8305b6c2`
9-
Previous integration tip: `easel/auto-integration` `e221024b`
10-
Current integration source tip before this refresh: `e221024b`
9+
Previous integration tip: `easel/auto-integration` `35f29582`
10+
Current integration source tip before this refresh: `35f29582`
1111

1212
This branch is maintained as a reproducible patch stack over `origin/main`. This unattended run started from a clean primary checkout on `auto-integration`, verified GitHub/Claude/Codex auth using the real user credential home, fetched `origin` and `easel` separately, fetched current non-draft PR heads, and checked exact PR-head containment against the stack tip.
1313

14-
The current stack contains 29 exact current open non-draft PR heads plus draft #329, which was already integrated before it became draft. No open non-draft PR head advanced since the prior pushed refresh. Six current non-draft PRs remain non-ancestor/selective-port candidates: #305, #237, #221, #154, #153, and #135. Fresh direct-merge probes reconfirmed conflicts for all six remaining candidates. This run ran a tmux-driven Codex read-only pass for #237; it reconfirmed that the only tiny safe PR237 slice is `server/src/common/gguf_metadata.h`, which is already present in the current stack, while the Qwen-specific native MTP runtime remains coupled to current backend/loader/target-graph reconciliation and needs populated-dependency build plus CUDA runtime validation. Existing selective salvage still covers #305's `DFLASH_EXPERT_BUDGET_PCT`, Qwen35MoE gallocr/full-chunk FFN work, and PR305 persistent prefill `StepGraph` reuse slice; #237's common MTP helper scaffold; and #135's diagnostic/control-plane multi-request scheduler scaffolds plus cache-reset seed fix and committed-boundary bookkeeping. The remaining live runtime paths are blocked on broad current-layout reconciliation and runtime validation.
14+
The current stack contains 29 exact current open non-draft PR heads plus draft #329, which was already integrated before it became draft. No open non-draft PR head advanced since the prior pushed refresh. Six current non-draft PRs remain non-ancestor/selective-port candidates: #305, #237, #221, #154, #153, and #135. Fresh direct-merge probes reconfirmed conflicts for all six remaining candidates. This run ran a tmux-driven Codex pass for the #153/#154 native MTP pair and promoted one default-off current-layout slice: Qwen35 graph inputs/outputs can now expose the final hidden state before output norm (`expose_pre_norm_hidden` / `pre_norm_hidden`) for future MTP handoff work, without enabling native MTP runtime behavior. Codex rejected the broader #153/#154 native MTP loader/graph/tests as old-layout and still coupled to current MoE/backend/CUDA validation. Existing selective salvage still covers #305's `DFLASH_EXPERT_BUDGET_PCT`, Qwen35MoE gallocr/full-chunk FFN work, and PR305 persistent prefill `StepGraph` reuse slice; #237's common MTP helper scaffold; #153/#154's pre-norm hidden exposure; and #135's diagnostic/control-plane multi-request scheduler scaffolds plus cache-reset seed fix and committed-boundary bookkeeping. The remaining live runtime paths are blocked on broad current-layout reconciliation and runtime validation.
1515

1616
## Included in the current stack
1717

@@ -54,6 +54,12 @@ Closed, upstreamed, or no-longer-open PRs still represented by the stack/base in
5454

5555
This run performed (latest first):
5656

57+
- `date -Is` -> `2026-06-01T13:45:21-04:00` / `2026-06-01T13:54:22-04:00` during this refresh; primary checkout was clean on `auto-integration`, auth/tooling checks succeeded using the real user credential home (`gh auth status`, `claude auth status --text`, and `codex --version`), and `origin` / `easel` were fetched separately. Current refs were `origin/main` `8305b6c2`, `easel/auto-integration` `35f29582`, and source tip `35f29582`; `origin/main` was already represented.
58+
- Open PR enumeration reported 35 non-draft PRs and 5 draft/excluded PRs (#329 remains draft after earlier integration). Exact-head containment after explicit PR ref fetch showed 29 current open non-draft PR heads included; remaining non-ancestor/selective-port candidates remain #305, #237, #221, #154, #153, and #135.
59+
- Fresh worktree direct-merge probes were run under `/tmp/luce-auto-cron-20260601-134521/`. Conflict counts remain #305 (61 status / 38 unmerged), #237 (33 / 27), #221 (88 / 25), #154 (13 / 12), #153 (10 / 10), and #135 (3 / 3).
60+
- Tmux-driven Codex session `luce1345-pr153154-codex` in `/tmp/luce-auto-cron-20260601-134521/probe-pr-154` completed with report `/tmp/luce-codex-pr153154-20260601-134521.txt` and `VERDICT: SAFE_SLICE` for a default-off #153/#154 pre-norm hidden handoff scaffold. The promoted current-layout slice adds `QwenGraphInputs::expose_pre_norm_hidden`, `QwenGraphOutputs::pre_norm_hidden`, and marks/returns `inpL` before `out_norm` when explicitly requested. Codex rejected the broader native MTP loader/graph/test port because it is old-layout (`dflash/`/`dflash27b`), collides with current MoE/backend fields and CMake wiring, and still needs populated-dependency CUDA runtime validation.
61+
- Validation for this source/manifest refresh: `git diff --check` passed and targeted conflict-marker search in changed files found none. Full CMake validation was not rerun because this checkout still lacks populated `server/deps/llama.cpp` plus the known local CUDA compiler-id `sm_52` `ptxas` failure before project compilation.
62+
5763
- `date -Is` -> `2026-06-01T13:25:36-04:00` / `2026-06-01T13:30:51-04:00` during this refresh; primary checkout was clean on `auto-integration`, auth/tooling checks succeeded using the real user credential home (`gh auth status`, `claude auth status --text`, and `codex --version`), and `origin` / `easel` were fetched separately. Current refs were `origin/main` `8305b6c2`, `easel/auto-integration` `e221024b`, and source tip `e221024b`; `origin/main` was already represented.
5864
- Open PR enumeration reported 35 non-draft PRs and 5 draft/excluded PRs (#329 remains draft after earlier integration). Exact-head containment after explicit PR ref fetch showed 29 current open non-draft PR heads included; remaining non-ancestor/selective-port candidates remain #305, #237, #221, #154, #153, and #135.
5965
- Fresh worktree direct-merge probes were run under `/tmp/luce-auto-cron-20260601-132536/`. Conflict counts remain #305 (61 status / 38 unmerged), #237 (33 / 27), #221 (88 / 25), #154 (13 / 12), #153 (10 / 10), and #135 (3 / 3).

server/src/internal.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -545,6 +545,7 @@ struct QwenGraphInputs {
545545
bool capture_layers; // if true, write captured layer features into cache.target_feat
546546
bool capture_delta_intermediate = false; // if true, populate out_delta_captures
547547
bool capture_moe_router = false; // if true, expose selected expert ids for MoE layers
548+
bool expose_pre_norm_hidden = false; // if true, expose the final hidden before output norm
548549
int fa_window = 0; // sliding window for FA layers: 0 = full attention
549550
bool last_token_logits_only = false; // if true, only compute logits for last token (prefill optimization)
550551
ggml_tensor * parent_ids = nullptr; // [n_tokens] i32; tree mode when non-null
@@ -560,6 +561,9 @@ struct QwenGraphOutputs {
560561
// One entry per target layer. Populated only when capture_moe_router is
561562
// true; qwen35 dense layers and non-MoE models leave entries null.
562563
std::vector<ggml_tensor *> moe_selected;
564+
// Final hidden state before output norm. Populated only when
565+
// QwenGraphInputs::expose_pre_norm_hidden is true.
566+
ggml_tensor * pre_norm_hidden = nullptr;
563567
};
564568

565569
struct QwenLayerPrefnOutputs {

server/src/qwen35/qwen35_target_graph.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1261,6 +1261,14 @@ QwenGraphOutputs build_qwen35_graph(
12611261
inpL = cur;
12621262
}
12631263

1264+
QwenGraphOutputs og = std::move(og_early);
1265+
if (in.expose_pre_norm_hidden) {
1266+
ggml_set_name(inpL, "pre_norm_hidden");
1267+
ggml_set_output(inpL);
1268+
ggml_build_forward_expand(gf, inpL);
1269+
og.pre_norm_hidden = inpL;
1270+
}
1271+
12641272
// 2. Final norm
12651273
ggml_tensor * out = rms_norm_mul(ctx, inpL, w.out_norm, w.rms_eps);
12661274

@@ -1281,7 +1289,6 @@ QwenGraphOutputs build_qwen35_graph(
12811289
ggml_build_forward_expand(gf, out);
12821290
}
12831291

1284-
QwenGraphOutputs og = std::move(og_early);
12851292
og.logits = logits;
12861293
return og;
12871294
}

0 commit comments

Comments
 (0)