Skip to content

Ds4 engine embed#1850

Open
apetersson wants to merge 93 commits into
jundot:mainfrom
apetersson:ds4-engine-embed
Open

Ds4 engine embed#1850
apetersson wants to merge 93 commits into
jundot:mainfrom
apetersson:ds4-engine-embed

Conversation

@apetersson

@apetersson apetersson commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Adds support for serving DS4 GGUF models through omlx by managing ds4-server as an external subprocess, rather than reimplementing the engine in MLX.

How it works

  • Discovery: GGUF files are discovered alongside MLX models, validated (magic/metadata, corrupt files rejected), and exposed under normalized model ids plus thinking-mode aliases (-think, -think-max) and -chat for no-thinking mode.
  • Lifecycle: one ds4-server process at a time (DS4 is a single-process backend), bound to localhost only. The engine pool handles admission with subprocess RSS accounting, auto-enables DS4 SSD streaming under memory pressure, restarts crashed pinned models with backoff, and integrates with TTL/eviction.
  • Proxying: byte-preserving passthrough for /v1/chat/completions, /v1/completions, /v1/responses, and /v1/messages (Anthropic), including low-latency SSE streaming.
  • Context: per-model context windows; Think-Max requests trigger a transparent restart with a larger context when needed.
  • KV disk cache: per-model cache directories managed for the server (orphaned temp-file cleanup, graceful stop so shutdown persist can finish), with hit/store/eviction metrics parsed from server logs into the admin UI.
  • Admin: global + per-model DS4 settings, model downloader catalog, live prefill/decode progress, localized UI strings.

Notes

  • DS4 support files for darwin-arm64 are vendored; the backend is disabled by default on unsupported platforms.
  • The branch is merged with current main; the full test suite passes (5842 passed, 19 skipped).
  • Includes fixes from two self-review passes (tracked as issues #2–#21 on the fork), covering lifecycle races, GGUF parser hardening, SSE proxy throughput, and KV cache management.

apetersson added 30 commits June 6, 2026 11:05
Register visible GGUF files as DS4 engine entries during model discovery.

Normalize DS4 ids, preserve display casing, and avoid MLX collisions with :ds4.

Guard DS4 loads until the process backend lands; cover top-level, nested, and mixed cases.
Add global DS4 configuration for support files, KV cache, traces, context, SSD streaming, and power.

Persist settings, create default support/KV/trace directories, and allow hidden env overrides.

Validate DS4 ranges and cover adaptive RAM-based context defaults.
Add DS4 support-file inspection for the managed backend before process launch.

Require macOS Apple Silicon, executable ds4-server, LICENSE/README, and Metal sources.

Provide deterministic copy helper for bundled support files without fetch/build fallback.
Introduce managed ds4-server launch config with localhost-only port allocation and DS4 flag construction.

Validate support files before spawn, create per-model KV/debug dirs, capture stdout/stderr, and poll /v1/models readiness.

Cover lifecycle behavior with fake subprocess tests without adding protocol proxying yet.

Refs #1
Generate DS4 chat, reasoner, and think-max aliases for each visible DS4 GGUF entry in /v1/models.

Resolve suffix aliases through EnginePool, including user-defined model_alias bases, while avoiding global deepseek aliases.

Add alias helper metadata for future DS4 forwarding behavior.

Refs #1
Add a DS4ProcessEngine wrapper around the managed ds4-server subprocess so DS4 GGUF entries can lazy-load and unload through EnginePool.

Wire global DS4 settings into pool construction, skip MLX teardown for external processes, and expose DS4 PID/port/RSS/log status.

Cover manual load/unload, pinned preload, TTL unload, disabled settings, and deferred protocol methods with fake-process tests.

Refs #1
Forward DS4-backed OpenAI chat completion requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses.

Apply OMLX sampling defaults before forwarding and honor DS4 chat, reasoner, and think-max aliases without exposing DS4 directly.

Track active proxied requests so DS4 subprocesses are not evicted mid-stream.

Refs #1
Forward DS4-backed OpenAI text completion requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses.

Apply OMLX sampling defaults before forwarding and reuse DS4 suffix alias request mutation for completions.

Refs #1
Forward DS4-backed OpenAI Responses API requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses.

Apply Responses-shaped sampling defaults and DS4 suffix aliases before forwarding without entering MLX validation paths.

Refs #1
Forward DS4-backed Anthropic Messages API requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses.

Apply Anthropic-shaped sampling defaults and DS4 suffix aliases before forwarding without entering MLX conversion or validation paths.

Refs #1
Parse usage from non-streaming DS4 proxy responses without changing returned bytes and record request counts, token totals, cached tokens, and proxy duration in server metrics.

Keep streaming tee metrics and DS4 phase timing out of scope for a later slice.

Refs #1
Parse usage from streamed DS4 SSE payloads while yielding the original bytes unchanged and record request counts, token totals, cached tokens, TTFT, and generation duration.

Merge usage observed across OpenAI, Responses, and Anthropic stream shapes without adding DS4 phase timing or UI changes.

Refs #1
Restart managed DS4 backends with a temporary 393216-token context when per-model think-max aliases are requested and the current context is too small.

Preserve saved DS4 settings, keep high context until unload, and reject restarts while another DS4 request is active.

Refs #1
Restart idle DS4 backends before the next proxied request when the managed subprocess has exited.

Keep pinned DS4 models available by restarting crashed pinned processes from the pool health path, while leaving unpinned crashes stopped until demand.

Refs #1
Enforce the global DS4 disk KV budget before launching managed backends by deleting oldest recursive *.kv files under the shared KV root.

Keep pruning limited to DS4 KV files and record a prune summary for diagnostics.

Refs #1
Store managed DS4 stdout and stderr in each model debug directory while preserving in-memory recent logs for status diagnostics.

Expose the persistent DS4 log path through engine status so admin/status surfaces can point users at crash and launch logs.

Refs #1
Persist a DS4-only per-model context override and pass it into managed DS4 launches instead of mutating global DS4 settings.

Restart loaded idle DS4 backends when the override changes, reject active restarts, and cap overrides at the DS4 context maximum.

Refs #1
Include DS4 lifecycle, context, log path, formatted RSS, and recent log lines in the admin model-list API so the UI can surface managed backend state.

Refs #1
Replace the HF downloader's MLX-only toggle with MLX/DS4-GGUF OR filters and catalog DeepSeek V4 GGUF candidates from base-model searches with unverified compatibility metadata.

Refs #1
Add DS4-only admin model settings for per-model launch context and show live backend status details in the settings modal.

Refs #1
Expose DS4 backend status and launch settings in the admin global settings API/UI. Persist enabled, support path, context, KV, SSD streaming, and power controls for future managed DS4 launches. Refs #1
Wire macOS app builds to stage prebuilt DS4 support files and sign embedded DS4 binaries.

Seed the default user DS4 support directory from bundled resources at startup while preserving explicit custom dirs.

Refs #1
Parse DS4 progress log lines into active model activity snapshots so admin status can show live prefill/decode phase and token-rate details.

Extend activity metadata formatting for current/total token progress and average/chunk token throughput.

Refs #1
Add a release helper that builds ds4-server from a local ds4-apetersson checkout and stages a validated packaging/DS4Support tree for the macOS bundle.

Document the prebundle workflow and ignore generated DS4Support binaries.

Refs #1
Enable DS4 auto SSD streaming when GGUF size exceeds the current memory budget after normal eviction, while preserving explicit on/off overrides and avoiding explicit expert-cache budget flags.\n\nRefs #1
Probe staged ds4-server binaries for required launch flags so stale prebuilt support trees fail during validation instead of at first managed launch.\n\nRefs #1
Resolve discovered DS4 GGUF entries from original filenames or paths so source-name selections reach the normalized canonical model id.

Include DS4 GGUF file artifacts in the Models Manager local list with source-cased display names and safe file deletion support.

Refs #1
Read DS4 streaming responses with low-latency SSE flushing so clients receive events as they are generated instead of waiting on urllib3 chunk buffering.

Forward no-buffering SSE headers while preserving DS4 response bytes and streaming metrics tee behavior.

Refs #1
Raise the DS4 streaming fallback byte cap because SSE event boundaries are the primary flush mechanism; the cap only guards unusually long or unterminated events.

Refs #1
Hide unsupported MLX-only controls for DS4 models and clear them from DS4 save payloads.

Stop injecting DS4-unsupported sampling parameters into proxied OpenAI chat/completion bodies.

Refs #1
Resolve conflicts in server.py, admin routes/dashboard, and engine_pool
by adopting upstream's diffusion and mid-system cache refactors and
layering DS4 behavior on top (settings-state helper, admission memory,
profile apply). Read engine_type defensively for upstream's test fakes.
@apetersson

Copy link
Copy Markdown
Contributor Author

Since this is a more complex PR than usual, It would make sense to have independent validation of this PR before merging it. I personally developed and verified it on M1 Ultra 128GB RAM.

The true HW min specs for these are likely 48GB or 32GB RAM. It runs smoothly at 96 or more without SSD streaming.

Would love to see also feedback on Deepseek-V4-Pro from the 256 and 512 GB folks.

@apetersson apetersson marked this pull request as ready for review June 13, 2026 21:24
@apetersson

Copy link
Copy Markdown
Contributor Author

Would love to receive critical reviews, don't hold back, i can take it :) Any wishes for improvements?

I am running this since about 1 week with daily hours-long workloads in a multi-model oMLX install (LLM, TTS, STT, embeddings) and it seems quite robust yet.

Any thoughts on the way the binary of DS4 is vendored in?

@Kistaro

Kistaro commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Have not looked at the code, but my first take: You're not fitting DeepSeek 4 Flash onto a Mac with 32GB of RAM! DeepSeek 4 isn't in a league ordinarily thought of as suitable for use on any "normal" user machine. The specialized mostly-2-bit quant used by DS4 is over eighty gigabytes; a 96GB Mac would have space for minimal context and barely any for macOS itself, but could technically use it; a 128GB Mac can make reasonable use of it as long as it is going out of its way not to do anything else at the time. See https://huggingface.co/antirez/deepseek-v4-gguf for notes on the actual quants -- and their sizes. DeepSeek 4 Pro is not fitting on my Mac any time soon! Maybe if I'd decided to pull the trigger when the 512 GB M3 Ultra Mac Studio was still available, if I was okay with about two tokens per second...

Because even the very extreme quant of DeepSeek 4 Flash is very demanding, I think trying to use oMLX as a chain loader for it is not going to be useful for most users. With my 128GB M5 Max MBP, I would not use it, because I wouldn't want the RAM overhead of having oMLX loaded at all. Like, I am specifically migrating away from VSCode because it forces my Mac to swap (and to kick out DeepSeek between turns, taking several seconds to reload it from SSD on each reply) if I try to use it at the same time as DS4. I have 128 GB of RAM and it's not enough for me to just do "business as usual" while using DS4 -- I don't want a single extra Python interpreter, if I can avoid it.

The ds4 command line is not super convenient, but writing shell scripts to make it more convenient is totally fine. (I have a shell script that sets up a split-screen thing with tmux, running DS4 and then Pi Coding Agent configured to use it, showing DS4's server only in a tiny pane so I can see its status messages but Pi gets most of the space.) DeepSeek 4 Flash (antirez quant) is plenty capable to write those scripts for you! That's how I got the ones I wanted...

@apetersson

Copy link
Copy Markdown
Contributor Author

You're not fitting DeepSeek 4 Flash onto a Mac with 32GB of RAM!

True - this is meant to be running on a 128GB Machine or better. It is my main local agent at 15-22 tokens/sec, on a M1 Ultra 128GB. If you have a 96 GB Mac studio this could run too, but you better close all other programs upfront.

I am running it on a very busy workstation, with many other programs open - all of them very responsive, and non-ideal token generations . i am running multiple docker containers, codex with computer use, IntelliJ with lots of RAM and big projects indexed AND Android Studio and 2 backends and TTS/STT servers plus a couple dozens chrome tabs. - still 13-15 tokens/sec because of ram pressure.

What DS4 is surprisingly nice at is how aggressively it frees the RAM after each request, so whenever TG is paused, the other programs still enjoy the ram (see screenshot). If you have a 128GB MBP M5 you will have even better results than me, because M5 has even more optimisations inside DS4. Give it a try.

I did give DS4-pro a short try and do not recommend it right now - maybe if you have a 512GB machine. With SSD streaming auto-enabled i just got unusable 0,3 tokens/sec on the worst quant i could get.

image

@fry69

fry69 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This looks very interesting to me. I am a heavy ds4-server user.

But I run modified versions of ds4-server to fix minor problems and add support for structured outputs. Specifically I run a version of ds4-server (continuously rebased against official main) that contains these PRs:

Is it possible to run such a modified ds4-server with this PR or can you think of a way how to include custom patches to ds4-server with this PR?

@apetersson

apetersson commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Is it possible to run such a modified ds4-server with this PR or can you think of a way how to include custom patches to ds4-server with this PR?

This PR uses the binary from ~/.omlx/support/ds4/ds4-server which it obtains from the vendored verson. But you can override it, no code modification needed.

OMLX has a hidden/advanced override for exactly this use case. In DS4Settings:

python binary_path: str | None = None # Hidden/advanced override for ds4-server

You point OMLX at your custom binary in two ways:

  1. Environment variable (easiest):

export OMLX_DS4_BINARY_PATH=/path/to/your/custom/ds4-server

  1. Settings file (~/.omlx/settings.json):
  {
    "ds4": {
      "binary_path": "/path/to/your/custom/ds4-server"
    }
  }

When binary_path is set, OMLX skips its vendored binary entirely. No copies, no validation of bundled files. It does still probe --help to check for required CLI flags (--ssd-streaming), so your custom binary needs that flag present.

The other settings (kv_root, context_default_tokens, ssd_streaming, power, etc.) all apply normally since they're passed as CLI arguments to whatever binary you point it at.

I did not bother to make the option exposed in the GUI yet, but since the DS4 is progressing so quickly and apparently @fry69 and others are running custom versions of it, it makes sense to expose it also. Your patches seem to be very useful also, will have a look as structured outputs are a frequent use case of mine.

@fry69

fry69 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

The ds4-server binary is very picky about finding its runtime files (*.metal), it assumes they are in the current working directory. I guess you are packaging those, or how do you work around this problem?

Update:

I also start ds4-server with custom arguments such as:

./ds4-server --power 90 --ctx 500000 --kv-disk-dir ./tmp/ds4-kv --kv-disk-space-mb 131072 --port 28000

How do those get translated to oMLX world?
The port obviously get handled by oMLX, how does the KV cache get handled?
Also how do I set the power option to reduce the maximum power DS4 uses (a bit less strain on the hardware)?

Ah, I see:

The other settings (kv_root, context_default_tokens, ssd_streaming, power, etc.) all apply normally since they're passed as CLI arguments to whatever binary you point it at.

So there are new options inside oMLX for this? Great.
I am convinced, I'll try it out now.

Update2:

I can confirm that generating tokens via oMLX API -> this PR -> ds4-server works:

2026-06-15 12:11:02,718 - omlx.process_memory_enforcer - WARNING - Metal cap (122.0GB, kernel iogpu.wired_limit_mb) is below the oMLX static ceiling (126.0GB); Metal will clamp allocations to the cap and panic if a request exceeds it. Raise it with: sudo sysctl iogpu.wired_limit_mb=129024
2026-06-15 12:11:02,723 - omlx.process_memory_enforcer - INFO - Metal wired limit raised: 0.0GB -> 122.0GB (target=126.0GB, iogpu sysctl cap=122.0GB)
2026-06-15 12:11:02,723 - omlx.process_memory_enforcer - INFO - Process memory enforcer started (tier=custom, ceiling=98.0GB, interval=1.0s)
INFO:     Application startup complete.
2026-06-15 12:12:17,915 - omlx.engine_pool - INFO - Loading model: ds4f-deepseek-v4-flash-layers37
2026-06-15 12:12:17,945 - omlx.ds4_process - INFO - [DS4-1] Loading model: ds4f-deepseek-v4-flash-layers37
2026-06-15 12:12:17,980 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: Metal device Apple M5 Max, 128.00 GiB RAM
2026-06-15 12:12:17,986 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: Metal 4 tensor API enabled for Tensor kernels
2026-06-15 12:12:17,986 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: drift-patch flags hc_stable=on norm_unify=on kv_raw_f32=off rope_exp2_log2=off math_safe=off tensor_matmul=on
2026-06-15 12:12:19,633 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: Metal model views created in 3.086 ms, residency requested in 1635.572 ms, warmup 3.344 ms (mapped 93065.67 MiB from offset 5.09 MiB)
2026-06-15 12:12:19,633 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: Metal mapped mmaped model as 2 overlapping shared buffers
2026-06-15 12:12:19,633 - omlx.ds4_process - INFO - [DS4-1] stderr: ds4: metal backend initialized for graph diagnostics
2026-06-15 12:12:19,633 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:19 ds4-server: context buffers 8195.42 MiB (ctx=500000, backend=metal, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=125002)
2026-06-15 12:12:19,635 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:19 ds4-server: KV disk cache /Users/fry/GitHub/antirez/ds4/tmp/ds4-kv/ds4f-deepseek-v4-flash-layers37 (budget=131072 MiB, cross-quant=accept, min=512, cold_max=30000, continued=2048, trim=32, align=2048, hit_half_life=21600s)
2026-06-15 12:12:19,635 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:19 ds4-server: listening on http://127.0.0.1:55848
2026-06-15 12:12:19,678 - omlx.process_memory_enforcer - WARNING - ProcessMemoryEnforcer: could not resolve scheduler for engine type DS4ProcessEngine — prefill memory guard will not propagate to this engine. Verify the wrapper chain (engine._engine.engine.scheduler) still holds.
2026-06-15 12:12:19,678 - omlx.engine_pool - INFO - Loaded model: ds4f-deepseek-v4-flash-layers37 (actual: 56.81MB, estimated: 95.43GB, total: 95.43GB)
2026-06-15 12:12:19,680 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:19 ds4-server: chat ctx=0..6:6 prompt start
2026-06-15 12:12:20,033 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:20 ds4-server: chat ctx=0..6:6 prompt done 0.353s
2026-06-15 12:12:21,580 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:21 ds4-server: chat ctx=6..56:50 gen=50 THINKING decoding chunk=32.33 t/s avg=32.33 t/s 1.547s
2026-06-15 12:12:23,110 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:23 ds4-server: chat ctx=56..106:50 gen=100 THINKING decoding chunk=32.67 t/s avg=32.50 t/s 3.077s
2026-06-15 12:12:23,819 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:23 ds4-server: chat ctx=106..129:23 gen=123 decoding chunk=32.46 t/s avg=32.49 t/s 3.786s
2026-06-15 12:12:23,819 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:23 ds4-server: thinking live checkpoint remembered ctx=0..6:6 live=129 visible=131
2026-06-15 12:12:23,819 - omlx.ds4_process - INFO - [DS4-1] stderr: 0615 12:12:23 ds4-server: chat ctx=0..6:6 gen=123 finish=stop 4.139s

I noticed one small detail: This PR likes to create a separate KV caching folder for each model inside the specified ds4-kv folder via settings. This makes it impossible for omlx-ds4-server to discover/reuse already existing caches I generated earlier with plain ds4-server. Not sure if that is a showstopper.

It does make it harder to switch between plain ds4-server and omlx-ds4-server seamlessly.

@apetersson

apetersson commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

This PR likes to create a separate KV caching folder for each model inside the specified ds4-kv folder via settings.

AFAIK, this is absolutely necessary to not risk output corruption (i tried different quants/abliterations). I investigated this for a couple minutes, so maybe you know more about if the separation is truly needed.

  1. Metal Runtime Files

Already handled. In build_command():

python "--chdir", str(self.support_dir) # ~/.omlx/support/ds4/

The ensure_ds4_support() flow copies the vendored binary + metal shaders into ~/.omlx/support/ds4/, and ds4-server's CWD is set there via --chdir. No manual work needed.

  1. Your Custom Arguments → OMLX Settings
Your ds4-server arg OMLX equivalent Web UI?
--power 90 ds4.power Yes — slider + number input (1–100)
--ctx 500000 ds4.context_default_tokens Yes — number input + "Auto" button to reset to adaptive
--kv-disk-dir ./tmp/ds4-kv Auto-managed at ~/.omlx/ds4-kv//. Override root via ds4.kv_root Yes — text input for kv_root
--kv-disk-space-mb 131072 ds4.kv_disk_space_mb Yes — number input
--kv-cache-continued-interval-tokens 2048 ds4.kv_cache_continued_interval_tokens Yes — number input
--port 28000 Not configurable — always random, managed internally No
--ssd-streaming ds4.ssd_streaming (auto/on/off) Yes — dropdown
--kv-cache-reject-different-quant ds4.kv_cache_reject_different_quant No — settings.json or OMLX_DS4_KV_CACHE_REJECT_DIFFERENT_QUANT=1 env only
--trace ds4.trace_enabled No — env only (OMLX_DS4_TRACE_ENABLED=1)

What the web UI controls (Dashboard → Settings → DS4 section):

  • Power (slider 1–100)
  • KV disk space (in MB)
  • SSD streaming mode (auto/on/off)
  • Context default tokens is not exposed in the web UI — that's settings.json or env only.

Quick setup via settings.json (~/.omlx/settings.json):

  {
    "ds4": {
      "power": 90,
      "context_default_tokens": 500000,
      "kv_disk_space_mb": 131072
    }
  }

Or as env vars:

sh OMLX_DS4_POWER=90 OMLX_DS4_CONTEXT_DEFAULT_TOKENS=500000 OMLX_DS4_KV_DISK_SPACE_MB=131072 uv run omlx serve

this is the new section in settings:

image

@fry69

fry69 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I have one strong recommendation for this PR:

Do not include the ds4-server binary in this PR verbatim.
Build ds4-server from source, on the fly.
Pin the vendor repository to a commit for this build.
This already gets done for other vendor/build repos in this project, as far as I can tell.
I'd recommend to follow this path.

This will also remove the currently vendor runtime *.metal files from this PR, they can copied from the (shallow) cloned ds4 vendor repo.
Also, this might offer paths to include custom PRs/patches to build the ds4-server binary or a simply specify a custom repo path/URL with custom ds4 fork at build time.

@apetersson

Copy link
Copy Markdown
Contributor Author

Build ds4-server from source, on the fly.

While it does build easily on my machine, i am not 100% sure what the build-time requirements are. Does the user runtime then need xcode? So a user without Xcode CLT / make / Apple clang would fail to build ds4-server.

the current approach avoids this but still allows custom ds4-override which we should include in the UI settings.

what do you think of changing this to build ds4-server from a pinned DS4 source commit during release/CI packaging, not on the end user’s machine. but that would shift the burden more towards maintainers / CI environment who then absorbs DS4 build complexity. also a simple clone -> uv run omlx serve would no longer work.

@fry69

fry69 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I can only speak for myself, but I am not comfortable with a PR that includes an executable binary.

That is too opaque for me.

Remove committed DS4 runtime payloads from package data and source tree.

Add pinned-source build/install plumbing for app, CLI, Homebrew, and CI.
Keep existing build-script stdout phrases while using the shared pinned-source helper.
Move pinned-source DS4 builds out of serve startup and into launch validation.

Add auto-build controls, failure caching, and AC #7 docs for issue #22.
Persist DS4 binary path, auto-build, source repo, and source revision through global admin settings. Wire those values into source builds and document AC #7 for issue #22. Localize DS4 admin strings in non-English locale files.
Keep the displayed default support dir implicit so managed installs can still auto-build. Retry failed auto-builds when source overrides change. Require pinned local sources to be verifiable git checkouts.
Run first-load DS4 provisioning off the asyncio loop. Let custom source repos build their default ref while recording the resolved commit. Copy and validate the actual Metal kernel set and ignore generated vendor runtime files.
Do not fall through to runtime builds when bundled DS4 resources are present but broken. Prune stale destination Metal kernels on overwrite so custom source manifests describe the selected source tree.
Serialize default support-dir provisioning so concurrent DS4 loads share one build. Keep localized DS4 support-dir hints aligned with the custom source install guidance.
Build or fail DS4 support by default for app bundles so end users avoid runtime toolchains. Clear stale manifests when overwriting support files from manifest-less prebuilt trees.
Keep the pinned upstream DS4 Metal file list as test-only fixture data. Remove the stale production helper now that runtime support validation discovers Metal files dynamically.
Accept Hugging Face blob/resolve file URLs in the built-in downloader and download only the selected file with allow_patterns.

Expose direct DS4-GGUF file URLs as search/detail results so UI download actions avoid full-repo snapshots.
Expose DS4 support source metadata in the Engine Versions payload, preferring the staged support manifest over the bundled pin. Render unlinked commits as short SHAs and cover the default/custom manifest cases.
@fry69

fry69 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Great 👍 , I read through the external PR:

I agree with the reasoning, I see a similar path as with xgrammar: Make it mostly optional.

As mentioned in the external PR, ds4-server currently only affects a small popularity of owners of very powerful hardware, which are enthusiastic about running a 90GB+ model on their machine. Those can expect a minor inconvenience (the only alternative is building ds4-server from source anyway).

If it is not already mentioned in the external PR and I missed it: The path for building and shipping this PR and the needed assets can then be roughly modeled like the mentioned xgrammar case:

When the macOS app gets built, it can get decided to build and include ds4-server + *.metal files automatically (👋 @jundot ).

Maybe it also makes sense to get a signal from @antirez how he thinks his DwarfStar project should get shipped in other project? Just to avoid misunderstandings and sudden rug pull situations after shipping this feature. The macOS app has a much bigger user surface than e.g. Homebrew, so it might make sense to hold back until this got sorted out?

For Homebrew an additional flag can get added like --with-grammar -> --with-ds4, with ds4-server getting built on demand during install. That should be non-controversial in any case.

Move the DS4 support build job to the macOS 26 hosted runner so the pinned DS4 source compiles against an SDK with the required Metal residency and math APIs.
@apetersson

Copy link
Copy Markdown
Contributor Author

Thanks @fry69 for the feedback. there are a few simplifications possible and the GH Action i am just attempting to fix.
My reasoning for "not all users need it" is mostly focused on the "clone -> uv run omlx serve" path, where a small action needs to be taken to run it (and potentially install the build tooling) but all other users won't mind a small overhead in the .dmg which bundles the pinned build result, like the other engines. It is lazily provisioned though so the ~/.omlx/support/ds4 dir only shows up when really needed.

Long-term DS4 should be run in-process with C bindings like a library for better metrics and not to have tight coupling with the log format, but i tried avoiding that for stability and short-term maintainability reasons.

I will try to run this as my local setup for a while and if @jundot prefers - possibly find some more simplifications in the code. (do you prefer a squashed PR or this history-preserving PR style?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants