Skip to content

specprefill pp-tps boost silently lost when dflash is also enabled #1233

Description

@drumtorben

Summary

When both specprefill_enabled and dflash_enabled are set to true for a model, the specprefill prefill-TPS boost is silently dropped. Only the dflash TG-TPS boost (block-diffusion speculative decoding) remains active.

Behaviour

Config pp tps tg tps
specprefill only ✅ boosted baseline
dflash only baseline ✅ boosted
both enabled ❌ no boost ✅ boosted

Root Cause

engine_pool._load_engine() checks for dflash first (engine_pool.py ~L627):

if dflash_enabled and dflash_draft:
    engine = DFlashEngine(...)   # engine is now non-None

if engine is None:               # False → BatchedEngine is never created
    engine = BatchedEngine(...)

Specprefill's draft model is loaded and wired up exclusively inside BatchedEngine.start() (batched.py ~L301):

self._engine.engine.scheduler.set_specprefill_draft_model(draft_model, ...)

DFlashEngine has no scheduler and no specprefill concept. The specprefill kwargs passed from server.py flow into DFlashEngine.chat()generate()**kwargs and are silently discarded — stream_dflash_generate does not know about them.

There is also no validation in ModelSettings.__post_init__ that rejects this combination. The mtp+dflash conflict raises a ValueError; specprefill+dflash does not, so users can configure both and see no error — just a silent no-op.

Why dflash prefill can't just "use" specprefill as-is

stream_dflash_generate accepts prompt_tokens_override as a flat token list but has no prompt_token_positions / sparse-indices parameter. True specprefill requires placing selected tokens at their original positions via manual RoPE injection. Dflash handles RoPE internally and provides no hook for this, so correct sparse prefill is impossible without an API addition to dflash-mlx.

Options

1. Short-term: add a warning / validation error (low effort)

Add a __post_init__ check in ModelSettings analogous to the existing mtp+dflash guard:

if self.dflash_enabled and self.specprefill_enabled:
    raise ValueError(
        "dflash_enabled and specprefill_enabled cannot both be True; "
        "specprefill has no effect when DFlash is the active engine"
    )

This surfaces the conflict instead of silently ignoring it.

2. Medium-term: approximate specprefill via token pre-filtering

Before calling stream_dflash_generate, run score_tokens() / select_chunks() from omlx/patches/specprefill.py to identify important tokens, then pass only those as prompt_tokens_override. This reduces dflash's prefill work and produces a real pp-tps boost. The caveat: filtered tokens arrive at sequential positions (0, 1, 2, …) rather than their original scattered positions — no RoPE correction. This is prompt compression semantics, not true specprefill. Quality impact is moderate at high prune rates.

3. Longer-term: add sparse-prefill support to dflash-mlx

Extend stream_dflash_generate with a prompt_token_positions: list[int] | None parameter so omlx can pass a sparse selection with correct positional encodings. This would enable full specprefill quality inside the dflash pipeline.

Suggested immediate fix

At minimum, Option 1 should be merged quickly so users aren't confused by the silent no-op. Options 2 and 3 are follow-on improvements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions