specprefill pp-tps boost silently lost when dflash is also enabled

## Summary

When both `specprefill_enabled` and `dflash_enabled` are set to `true` for a model, the specprefill prefill-TPS boost is silently dropped. Only the dflash TG-TPS boost (block-diffusion speculative decoding) remains active.

## Behaviour

| Config | pp tps | tg tps |
|---|---|---|
| specprefill only | ✅ boosted | baseline |
| dflash only | baseline | ✅ boosted |
| both enabled | ❌ no boost | ✅ boosted |

## Root Cause

`engine_pool._load_engine()` checks for dflash **first** ([engine_pool.py ~L627](https://github.com/jundot/omlx/blob/main/omlx/engine_pool.py)):

```python
if dflash_enabled and dflash_draft:
    engine = DFlashEngine(...)   # engine is now non-None

if engine is None:               # False → BatchedEngine is never created
    engine = BatchedEngine(...)
```

Specprefill's draft model is loaded and wired up exclusively inside `BatchedEngine.start()` ([batched.py ~L301](https://github.com/jundot/omlx/blob/main/omlx/engine/batched.py)):

```python
self._engine.engine.scheduler.set_specprefill_draft_model(draft_model, ...)
```

`DFlashEngine` has no `scheduler` and no specprefill concept. The `specprefill` kwargs passed from `server.py` flow into `DFlashEngine.chat()` → `generate()` → `**kwargs` and are silently discarded — `stream_dflash_generate` does not know about them.

There is also no validation in `ModelSettings.__post_init__` that rejects this combination. The mtp+dflash conflict raises a `ValueError`; specprefill+dflash does not, so users can configure both and see no error — just a silent no-op.

## Why dflash prefill can't just "use" specprefill as-is

`stream_dflash_generate` accepts `prompt_tokens_override` as a flat token list but has no `prompt_token_positions` / sparse-indices parameter. True specprefill requires placing selected tokens at their **original positions** via manual RoPE injection. Dflash handles RoPE internally and provides no hook for this, so correct sparse prefill is impossible without an API addition to dflash-mlx.

## Options

### 1. Short-term: add a warning / validation error (low effort)

Add a `__post_init__` check in `ModelSettings` analogous to the existing mtp+dflash guard:

```python
if self.dflash_enabled and self.specprefill_enabled:
    raise ValueError(
        "dflash_enabled and specprefill_enabled cannot both be True; "
        "specprefill has no effect when DFlash is the active engine"
    )
```

This surfaces the conflict instead of silently ignoring it.

### 2. Medium-term: approximate specprefill via token pre-filtering

Before calling `stream_dflash_generate`, run `score_tokens()` / `select_chunks()` from `omlx/patches/specprefill.py` to identify important tokens, then pass only those as `prompt_tokens_override`. This reduces dflash's prefill work and produces a real pp-tps boost. The caveat: filtered tokens arrive at sequential positions (0, 1, 2, …) rather than their original scattered positions — no RoPE correction. This is prompt compression semantics, not true specprefill. Quality impact is moderate at high prune rates.

### 3. Longer-term: add sparse-prefill support to dflash-mlx

Extend `stream_dflash_generate` with a `prompt_token_positions: list[int] | None` parameter so omlx can pass a sparse selection with correct positional encodings. This would enable full specprefill quality inside the dflash pipeline.

## Suggested immediate fix

At minimum, Option 1 should be merged quickly so users aren't confused by the silent no-op. Options 2 and 3 are follow-on improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specprefill pp-tps boost silently lost when dflash is also enabled #1233

Summary

Behaviour

Root Cause

Why dflash prefill can't just "use" specprefill as-is

Options

1. Short-term: add a warning / validation error (low effort)

2. Medium-term: approximate specprefill via token pre-filtering

3. Longer-term: add sparse-prefill support to dflash-mlx

Suggested immediate fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Config	pp tps	tg tps
specprefill only	✅ boosted	baseline
dflash only	baseline	✅ boosted
both enabled	❌ no boost	✅ boosted

specprefill pp-tps boost silently lost when dflash is also enabled #1233

Description

Summary

Behaviour

Root Cause

Why dflash prefill can't just "use" specprefill as-is

Options

1. Short-term: add a warning / validation error (low effort)

2. Medium-term: approximate specprefill via token pre-filtering

3. Longer-term: add sparse-prefill support to dflash-mlx

Suggested immediate fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions