feat(server): expose qwen pre-norm hidden for MTP handoff

Hermes PR Integrator · Hermes PR Integrator · commit ae64c5117758 · 2026-06-01T13:53:15.000-04:00
Promote a default-off slice from the conflicted Luce-Org#153/Luce-Org#154 native MTP stack. The Qwen35 graph can now optionally mark and return the final hidden state before output norm for future MTP handoff work while leaving default runtime behavior unchanged.\n\nRefresh docs/auto-integration.md with the latest PR containment, conflict probes, Codex delegation outcome, and validation notes.
diff --git a/docs/auto-integration.md b/docs/auto-integration.md
@@ -4,14 +4,14 @@ Repository: `Luce-Org/lucebox-hub`
 Integration branch: `auto-integration`
 Writable remote: `easel`
 Upstream remote: `origin` / `Luce-Org`
-Last refresh: `2026-06-01T13:30:51-04:00`
+Last refresh: `2026-06-01T13:54:22-04:00`
 Current base: `origin/main` `8305b6c2`
-Previous integration tip: `easel/auto-integration` `e221024b`
-Current integration source tip before this refresh: `e221024b`
+Previous integration tip: `easel/auto-integration` `35f29582`
+Current integration source tip before this refresh: `35f29582`
 
 This branch is maintained as a reproducible patch stack over `origin/main`. This unattended run started from a clean primary checkout on `auto-integration`, verified GitHub/Claude/Codex auth using the real user credential home, fetched `origin` and `easel` separately, fetched current non-draft PR heads, and checked exact PR-head containment against the stack tip.
 
-The current stack contains 29 exact current open non-draft PR heads plus draft #329, which was already integrated before it became draft. No open non-draft PR head advanced since the prior pushed refresh. Six current non-draft PRs remain non-ancestor/selective-port candidates: #305, #237, #221, #154, #153, and #135. Fresh direct-merge probes reconfirmed conflicts for all six remaining candidates. This run ran a tmux-driven Codex read-only pass for #237; it reconfirmed that the only tiny safe PR237 slice is `server/src/common/gguf_metadata.h`, which is already present in the current stack, while the Qwen-specific native MTP runtime remains coupled to current backend/loader/target-graph reconciliation and needs populated-dependency build plus CUDA runtime validation. Existing selective salvage still covers #305's `DFLASH_EXPERT_BUDGET_PCT`, Qwen35MoE gallocr/full-chunk FFN work, and PR305 persistent prefill `StepGraph` reuse slice; #237's common MTP helper scaffold; and #135's diagnostic/control-plane multi-request scheduler scaffolds plus cache-reset seed fix and committed-boundary bookkeeping. The remaining live runtime paths are blocked on broad current-layout reconciliation and runtime validation.
+The current stack contains 29 exact current open non-draft PR heads plus draft #329, which was already integrated before it became draft. No open non-draft PR head advanced since the prior pushed refresh. Six current non-draft PRs remain non-ancestor/selective-port candidates: #305, #237, #221, #154, #153, and #135. Fresh direct-merge probes reconfirmed conflicts for all six remaining candidates. This run ran a tmux-driven Codex pass for the #153/#154 native MTP pair and promoted one default-off current-layout slice: Qwen35 graph inputs/outputs can now expose the final hidden state before output norm (`expose_pre_norm_hidden` / `pre_norm_hidden`) for future MTP handoff work, without enabling native MTP runtime behavior. Codex rejected the broader #153/#154 native MTP loader/graph/tests as old-layout and still coupled to current MoE/backend/CUDA validation. Existing selective salvage still covers #305's `DFLASH_EXPERT_BUDGET_PCT`, Qwen35MoE gallocr/full-chunk FFN work, and PR305 persistent prefill `StepGraph` reuse slice; #237's common MTP helper scaffold; #153/#154's pre-norm hidden exposure; and #135's diagnostic/control-plane multi-request scheduler scaffolds plus cache-reset seed fix and committed-boundary bookkeeping. The remaining live runtime paths are blocked on broad current-layout reconciliation and runtime validation.
 
 ## Included in the current stack
 
@@ -54,6 +54,12 @@ Closed, upstreamed, or no-longer-open PRs still represented by the stack/base in
 
 This run performed (latest first):
 
+- `date -Is` -> `2026-06-01T13:45:21-04:00` / `2026-06-01T13:54:22-04:00` during this refresh; primary checkout was clean on `auto-integration`, auth/tooling checks succeeded using the real user credential home (`gh auth status`, `claude auth status --text`, and `codex --version`), and `origin` / `easel` were fetched separately. Current refs were `origin/main` `8305b6c2`, `easel/auto-integration` `35f29582`, and source tip `35f29582`; `origin/main` was already represented.
+- Open PR enumeration reported 35 non-draft PRs and 5 draft/excluded PRs (#329 remains draft after earlier integration). Exact-head containment after explicit PR ref fetch showed 29 current open non-draft PR heads included; remaining non-ancestor/selective-port candidates remain #305, #237, #221, #154, #153, and #135.
+- Fresh worktree direct-merge probes were run under `/tmp/luce-auto-cron-20260601-134521/`. Conflict counts remain #305 (61 status / 38 unmerged), #237 (33 / 27), #221 (88 / 25), #154 (13 / 12), #153 (10 / 10), and #135 (3 / 3).
+- Tmux-driven Codex session `luce1345-pr153154-codex` in `/tmp/luce-auto-cron-20260601-134521/probe-pr-154` completed with report `/tmp/luce-codex-pr153154-20260601-134521.txt` and `VERDICT: SAFE_SLICE` for a default-off #153/#154 pre-norm hidden handoff scaffold. The promoted current-layout slice adds `QwenGraphInputs::expose_pre_norm_hidden`, `QwenGraphOutputs::pre_norm_hidden`, and marks/returns `inpL` before `out_norm` when explicitly requested. Codex rejected the broader native MTP loader/graph/test port because it is old-layout (`dflash/`/`dflash27b`), collides with current MoE/backend fields and CMake wiring, and still needs populated-dependency CUDA runtime validation.
+- Validation for this source/manifest refresh: `git diff --check` passed and targeted conflict-marker search in changed files found none. Full CMake validation was not rerun because this checkout still lacks populated `server/deps/llama.cpp` plus the known local CUDA compiler-id `sm_52` `ptxas` failure before project compilation.
+
 - `date -Is` -> `2026-06-01T13:25:36-04:00` / `2026-06-01T13:30:51-04:00` during this refresh; primary checkout was clean on `auto-integration`, auth/tooling checks succeeded using the real user credential home (`gh auth status`, `claude auth status --text`, and `codex --version`), and `origin` / `easel` were fetched separately. Current refs were `origin/main` `8305b6c2`, `easel/auto-integration` `e221024b`, and source tip `e221024b`; `origin/main` was already represented.
 - Open PR enumeration reported 35 non-draft PRs and 5 draft/excluded PRs (#329 remains draft after earlier integration). Exact-head containment after explicit PR ref fetch showed 29 current open non-draft PR heads included; remaining non-ancestor/selective-port candidates remain #305, #237, #221, #154, #153, and #135.
 - Fresh worktree direct-merge probes were run under `/tmp/luce-auto-cron-20260601-132536/`. Conflict counts remain #305 (61 status / 38 unmerged), #237 (33 / 27), #221 (88 / 25), #154 (13 / 12), #153 (10 / 10), and #135 (3 / 3).
diff --git a/server/src/internal.h b/server/src/internal.h
@@ -545,6 +545,7 @@ struct QwenGraphInputs {
     bool          capture_layers; // if true, write captured layer features into cache.target_feat
     bool          capture_delta_intermediate = false; // if true, populate out_delta_captures
     bool          capture_moe_router = false; // if true, expose selected expert ids for MoE layers
+    bool          expose_pre_norm_hidden = false; // if true, expose the final hidden before output norm
     int           fa_window = 0;  // sliding window for FA layers: 0 = full attention
     bool          last_token_logits_only = false; // if true, only compute logits for last token (prefill optimization)
     ggml_tensor * parent_ids = nullptr; // [n_tokens] i32; tree mode when non-null
@@ -560,6 +561,9 @@ struct QwenGraphOutputs {
     // One entry per target layer. Populated only when capture_moe_router is
     // true; qwen35 dense layers and non-MoE models leave entries null.
     std::vector<ggml_tensor *> moe_selected;
+    // Final hidden state before output norm. Populated only when
+    // QwenGraphInputs::expose_pre_norm_hidden is true.
+    ggml_tensor * pre_norm_hidden = nullptr;
 };
 
 struct QwenLayerPrefnOutputs {
diff --git a/server/src/qwen35/qwen35_target_graph.cpp b/server/src/qwen35/qwen35_target_graph.cpp
@@ -1261,6 +1261,14 @@ QwenGraphOutputs build_qwen35_graph(
         inpL = cur;
     }
 
+    QwenGraphOutputs og = std::move(og_early);
+    if (in.expose_pre_norm_hidden) {
+        ggml_set_name(inpL, "pre_norm_hidden");
+        ggml_set_output(inpL);
+        ggml_build_forward_expand(gf, inpL);
+        og.pre_norm_hidden = inpL;
+    }
+
     // 2. Final norm
     ggml_tensor * out = rms_norm_mul(ctx, inpL, w.out_norm, w.rms_eps);
 
@@ -1281,7 +1289,6 @@ QwenGraphOutputs build_qwen35_graph(
         ggml_build_forward_expand(gf, out);
     }
 
-    QwenGraphOutputs og = std::move(og_early);
     og.logits = logits;
     return og;
 }

Original file line number	Diff line number	Diff line change
`@@ -1261,6 +1261,14 @@ QwenGraphOutputs build_qwen35_graph(`
`1261`	`1261`	`inpL = cur;`
`1262`	`1262`	`}`
`1263`	`1263`
	`1264`	`+ QwenGraphOutputs og = std::move(og_early);`
	`1265`	`+ if (in.expose_pre_norm_hidden) {`
	`1266`	`+ ggml_set_name(inpL, "pre_norm_hidden");`
	`1267`	`+ ggml_set_output(inpL);`
	`1268`	`+ ggml_build_forward_expand(gf, inpL);`
	`1269`	`+ og.pre_norm_hidden = inpL;`
	`1270`	`+ }`
	`1271`	`+`
`1264`	`1272`	`// 2. Final norm`
`1265`	`1273`	`ggml_tensor * out = rms_norm_mul(ctx, inpL, w.out_norm, w.rms_eps);`
`1266`	`1274`
`@@ -1281,7 +1289,6 @@ QwenGraphOutputs build_qwen35_graph(`
`1281`	`1289`	`ggml_build_forward_expand(gf, out);`
`1282`	`1290`	`}`
`1283`	`1291`
`1284`		`- QwenGraphOutputs og = std::move(og_early);`
`1285`	`1292`	`og.logits = logits;`
`1286`	`1293`	`return og;`
`1287`	`1294`	`}`