Summary
When both specprefill_enabled and dflash_enabled are set to true for a model, the specprefill prefill-TPS boost is silently dropped. Only the dflash TG-TPS boost (block-diffusion speculative decoding) remains active.
Behaviour
| Config |
pp tps |
tg tps |
| specprefill only |
✅ boosted |
baseline |
| dflash only |
baseline |
✅ boosted |
| both enabled |
❌ no boost |
✅ boosted |
Root Cause
engine_pool._load_engine() checks for dflash first (engine_pool.py ~L627):
if dflash_enabled and dflash_draft:
engine = DFlashEngine(...) # engine is now non-None
if engine is None: # False → BatchedEngine is never created
engine = BatchedEngine(...)
Specprefill's draft model is loaded and wired up exclusively inside BatchedEngine.start() (batched.py ~L301):
self._engine.engine.scheduler.set_specprefill_draft_model(draft_model, ...)
DFlashEngine has no scheduler and no specprefill concept. The specprefill kwargs passed from server.py flow into DFlashEngine.chat() → generate() → **kwargs and are silently discarded — stream_dflash_generate does not know about them.
There is also no validation in ModelSettings.__post_init__ that rejects this combination. The mtp+dflash conflict raises a ValueError; specprefill+dflash does not, so users can configure both and see no error — just a silent no-op.
Why dflash prefill can't just "use" specprefill as-is
stream_dflash_generate accepts prompt_tokens_override as a flat token list but has no prompt_token_positions / sparse-indices parameter. True specprefill requires placing selected tokens at their original positions via manual RoPE injection. Dflash handles RoPE internally and provides no hook for this, so correct sparse prefill is impossible without an API addition to dflash-mlx.
Options
1. Short-term: add a warning / validation error (low effort)
Add a __post_init__ check in ModelSettings analogous to the existing mtp+dflash guard:
if self.dflash_enabled and self.specprefill_enabled:
raise ValueError(
"dflash_enabled and specprefill_enabled cannot both be True; "
"specprefill has no effect when DFlash is the active engine"
)
This surfaces the conflict instead of silently ignoring it.
2. Medium-term: approximate specprefill via token pre-filtering
Before calling stream_dflash_generate, run score_tokens() / select_chunks() from omlx/patches/specprefill.py to identify important tokens, then pass only those as prompt_tokens_override. This reduces dflash's prefill work and produces a real pp-tps boost. The caveat: filtered tokens arrive at sequential positions (0, 1, 2, …) rather than their original scattered positions — no RoPE correction. This is prompt compression semantics, not true specprefill. Quality impact is moderate at high prune rates.
3. Longer-term: add sparse-prefill support to dflash-mlx
Extend stream_dflash_generate with a prompt_token_positions: list[int] | None parameter so omlx can pass a sparse selection with correct positional encodings. This would enable full specprefill quality inside the dflash pipeline.
Suggested immediate fix
At minimum, Option 1 should be merged quickly so users aren't confused by the silent no-op. Options 2 and 3 are follow-on improvements.
Summary
When both
specprefill_enabledanddflash_enabledare set totruefor a model, the specprefill prefill-TPS boost is silently dropped. Only the dflash TG-TPS boost (block-diffusion speculative decoding) remains active.Behaviour
Root Cause
engine_pool._load_engine()checks for dflash first (engine_pool.py ~L627):Specprefill's draft model is loaded and wired up exclusively inside
BatchedEngine.start()(batched.py ~L301):DFlashEnginehas noschedulerand no specprefill concept. Thespecprefillkwargs passed fromserver.pyflow intoDFlashEngine.chat()→generate()→**kwargsand are silently discarded —stream_dflash_generatedoes not know about them.There is also no validation in
ModelSettings.__post_init__that rejects this combination. The mtp+dflash conflict raises aValueError; specprefill+dflash does not, so users can configure both and see no error — just a silent no-op.Why dflash prefill can't just "use" specprefill as-is
stream_dflash_generateacceptsprompt_tokens_overrideas a flat token list but has noprompt_token_positions/ sparse-indices parameter. True specprefill requires placing selected tokens at their original positions via manual RoPE injection. Dflash handles RoPE internally and provides no hook for this, so correct sparse prefill is impossible without an API addition to dflash-mlx.Options
1. Short-term: add a warning / validation error (low effort)
Add a
__post_init__check inModelSettingsanalogous to the existing mtp+dflash guard:This surfaces the conflict instead of silently ignoring it.
2. Medium-term: approximate specprefill via token pre-filtering
Before calling
stream_dflash_generate, runscore_tokens()/select_chunks()fromomlx/patches/specprefill.pyto identify important tokens, then pass only those asprompt_tokens_override. This reduces dflash's prefill work and produces a real pp-tps boost. The caveat: filtered tokens arrive at sequential positions (0, 1, 2, …) rather than their original scattered positions — no RoPE correction. This is prompt compression semantics, not true specprefill. Quality impact is moderate at high prune rates.3. Longer-term: add sparse-prefill support to dflash-mlx
Extend
stream_dflash_generatewith aprompt_token_positions: list[int] | Noneparameter so omlx can pass a sparse selection with correct positional encodings. This would enable full specprefill quality inside the dflash pipeline.Suggested immediate fix
At minimum, Option 1 should be merged quickly so users aren't confused by the silent no-op. Options 2 and 3 are follow-on improvements.