Ds4 engine embed#1850
Conversation
Register visible GGUF files as DS4 engine entries during model discovery. Normalize DS4 ids, preserve display casing, and avoid MLX collisions with :ds4. Guard DS4 loads until the process backend lands; cover top-level, nested, and mixed cases.
Add global DS4 configuration for support files, KV cache, traces, context, SSD streaming, and power. Persist settings, create default support/KV/trace directories, and allow hidden env overrides. Validate DS4 ranges and cover adaptive RAM-based context defaults.
Add DS4 support-file inspection for the managed backend before process launch. Require macOS Apple Silicon, executable ds4-server, LICENSE/README, and Metal sources. Provide deterministic copy helper for bundled support files without fetch/build fallback.
Introduce managed ds4-server launch config with localhost-only port allocation and DS4 flag construction. Validate support files before spawn, create per-model KV/debug dirs, capture stdout/stderr, and poll /v1/models readiness. Cover lifecycle behavior with fake subprocess tests without adding protocol proxying yet. Refs #1
Generate DS4 chat, reasoner, and think-max aliases for each visible DS4 GGUF entry in /v1/models. Resolve suffix aliases through EnginePool, including user-defined model_alias bases, while avoiding global deepseek aliases. Add alias helper metadata for future DS4 forwarding behavior. Refs #1
Add a DS4ProcessEngine wrapper around the managed ds4-server subprocess so DS4 GGUF entries can lazy-load and unload through EnginePool. Wire global DS4 settings into pool construction, skip MLX teardown for external processes, and expose DS4 PID/port/RSS/log status. Cover manual load/unload, pinned preload, TTL unload, disabled settings, and deferred protocol methods with fake-process tests. Refs #1
Forward DS4-backed OpenAI chat completion requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses. Apply OMLX sampling defaults before forwarding and honor DS4 chat, reasoner, and think-max aliases without exposing DS4 directly. Track active proxied requests so DS4 subprocesses are not evicted mid-stream. Refs #1
Forward DS4-backed OpenAI text completion requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses. Apply OMLX sampling defaults before forwarding and reuse DS4 suffix alias request mutation for completions. Refs #1
Forward DS4-backed OpenAI Responses API requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses. Apply Responses-shaped sampling defaults and DS4 suffix aliases before forwarding without entering MLX validation paths. Refs #1
Forward DS4-backed Anthropic Messages API requests through the managed localhost backend while preserving DS4 response bytes for streaming and non-streaming responses. Apply Anthropic-shaped sampling defaults and DS4 suffix aliases before forwarding without entering MLX conversion or validation paths. Refs #1
Parse usage from non-streaming DS4 proxy responses without changing returned bytes and record request counts, token totals, cached tokens, and proxy duration in server metrics. Keep streaming tee metrics and DS4 phase timing out of scope for a later slice. Refs #1
Parse usage from streamed DS4 SSE payloads while yielding the original bytes unchanged and record request counts, token totals, cached tokens, TTFT, and generation duration. Merge usage observed across OpenAI, Responses, and Anthropic stream shapes without adding DS4 phase timing or UI changes. Refs #1
Restart managed DS4 backends with a temporary 393216-token context when per-model think-max aliases are requested and the current context is too small. Preserve saved DS4 settings, keep high context until unload, and reject restarts while another DS4 request is active. Refs #1
Restart idle DS4 backends before the next proxied request when the managed subprocess has exited. Keep pinned DS4 models available by restarting crashed pinned processes from the pool health path, while leaving unpinned crashes stopped until demand. Refs #1
Enforce the global DS4 disk KV budget before launching managed backends by deleting oldest recursive *.kv files under the shared KV root. Keep pruning limited to DS4 KV files and record a prune summary for diagnostics. Refs #1
Store managed DS4 stdout and stderr in each model debug directory while preserving in-memory recent logs for status diagnostics. Expose the persistent DS4 log path through engine status so admin/status surfaces can point users at crash and launch logs. Refs #1
Persist a DS4-only per-model context override and pass it into managed DS4 launches instead of mutating global DS4 settings. Restart loaded idle DS4 backends when the override changes, reject active restarts, and cap overrides at the DS4 context maximum. Refs #1
Include DS4 lifecycle, context, log path, formatted RSS, and recent log lines in the admin model-list API so the UI can surface managed backend state. Refs #1
Replace the HF downloader's MLX-only toggle with MLX/DS4-GGUF OR filters and catalog DeepSeek V4 GGUF candidates from base-model searches with unverified compatibility metadata. Refs #1
Add DS4-only admin model settings for per-model launch context and show live backend status details in the settings modal. Refs #1
Expose DS4 backend status and launch settings in the admin global settings API/UI. Persist enabled, support path, context, KV, SSD streaming, and power controls for future managed DS4 launches. Refs #1
Wire macOS app builds to stage prebuilt DS4 support files and sign embedded DS4 binaries. Seed the default user DS4 support directory from bundled resources at startup while preserving explicit custom dirs. Refs #1
Parse DS4 progress log lines into active model activity snapshots so admin status can show live prefill/decode phase and token-rate details. Extend activity metadata formatting for current/total token progress and average/chunk token throughput. Refs #1
Add a release helper that builds ds4-server from a local ds4-apetersson checkout and stages a validated packaging/DS4Support tree for the macOS bundle. Document the prebundle workflow and ignore generated DS4Support binaries. Refs #1
Enable DS4 auto SSD streaming when GGUF size exceeds the current memory budget after normal eviction, while preserving explicit on/off overrides and avoiding explicit expert-cache budget flags.\n\nRefs #1
Probe staged ds4-server binaries for required launch flags so stale prebuilt support trees fail during validation instead of at first managed launch.\n\nRefs #1
Resolve discovered DS4 GGUF entries from original filenames or paths so source-name selections reach the normalized canonical model id. Include DS4 GGUF file artifacts in the Models Manager local list with source-cased display names and safe file deletion support. Refs #1
Read DS4 streaming responses with low-latency SSE flushing so clients receive events as they are generated instead of waiting on urllib3 chunk buffering. Forward no-buffering SSE headers while preserving DS4 response bytes and streaming metrics tee behavior. Refs #1
Raise the DS4 streaming fallback byte cap because SSE event boundaries are the primary flush mechanism; the cap only guards unusually long or unterminated events. Refs #1
Hide unsupported MLX-only controls for DS4 models and clear them from DS4 save payloads. Stop injecting DS4-unsupported sampling parameters into proxied OpenAI chat/completion bodies. Refs #1
Resolve conflicts in server.py, admin routes/dashboard, and engine_pool by adopting upstream's diffusion and mid-system cache refactors and layering DS4 behavior on top (settings-state helper, admission memory, profile apply). Read engine_type defensively for upstream's test fakes.
|
Since this is a more complex PR than usual, It would make sense to have independent validation of this PR before merging it. I personally developed and verified it on M1 Ultra 128GB RAM. The true HW min specs for these are likely 48GB or 32GB RAM. It runs smoothly at 96 or more without SSD streaming. Would love to see also feedback on Deepseek-V4-Pro from the 256 and 512 GB folks. |
|
Would love to receive critical reviews, don't hold back, i can take it :) Any wishes for improvements? I am running this since about 1 week with daily hours-long workloads in a multi-model oMLX install (LLM, TTS, STT, embeddings) and it seems quite robust yet. Any thoughts on the way the binary of DS4 is vendored in? |
|
Have not looked at the code, but my first take: You're not fitting DeepSeek 4 Flash onto a Mac with 32GB of RAM! DeepSeek 4 isn't in a league ordinarily thought of as suitable for use on any "normal" user machine. The specialized mostly-2-bit quant used by DS4 is over eighty gigabytes; a 96GB Mac would have space for minimal context and barely any for macOS itself, but could technically use it; a 128GB Mac can make reasonable use of it as long as it is going out of its way not to do anything else at the time. See https://huggingface.co/antirez/deepseek-v4-gguf for notes on the actual quants -- and their sizes. DeepSeek 4 Pro is not fitting on my Mac any time soon! Maybe if I'd decided to pull the trigger when the 512 GB M3 Ultra Mac Studio was still available, if I was okay with about two tokens per second... Because even the very extreme quant of DeepSeek 4 Flash is very demanding, I think trying to use oMLX as a chain loader for it is not going to be useful for most users. With my 128GB M5 Max MBP, I would not use it, because I wouldn't want the RAM overhead of having oMLX loaded at all. Like, I am specifically migrating away from VSCode because it forces my Mac to swap (and to kick out DeepSeek between turns, taking several seconds to reload it from SSD on each reply) if I try to use it at the same time as DS4. I have 128 GB of RAM and it's not enough for me to just do "business as usual" while using DS4 -- I don't want a single extra Python interpreter, if I can avoid it. The ds4 command line is not super convenient, but writing shell scripts to make it more convenient is totally fine. (I have a shell script that sets up a split-screen thing with tmux, running DS4 and then Pi Coding Agent configured to use it, showing DS4's server only in a tiny pane so I can see its status messages but Pi gets most of the space.) DeepSeek 4 Flash (antirez quant) is plenty capable to write those scripts for you! That's how I got the ones I wanted... |
|
This looks very interesting to me. I am a heavy But I run modified versions of
Is it possible to run such a modified |
This PR uses the binary from ~/.omlx/support/ds4/ds4-server which it obtains from the vendored verson. But you can override it, no code modification needed. OMLX has a hidden/advanced override for exactly this use case. In DS4Settings:
You point OMLX at your custom binary in two ways:
{
"ds4": {
"binary_path": "/path/to/your/custom/ds4-server"
}
}When binary_path is set, OMLX skips its vendored binary entirely. No copies, no validation of bundled files. It does still probe --help to check for required CLI flags (--ssd-streaming), so your custom binary needs that flag present. The other settings (kv_root, context_default_tokens, ssd_streaming, power, etc.) all apply normally since they're passed as CLI arguments to whatever binary you point it at. I did not bother to make the option exposed in the GUI yet, but since the DS4 is progressing so quickly and apparently @fry69 and others are running custom versions of it, it makes sense to expose it also. Your patches seem to be very useful also, will have a look as structured outputs are a frequent use case of mine. |
|
The Update: I also start ./ds4-server --power 90 --ctx 500000 --kv-disk-dir ./tmp/ds4-kv --kv-disk-space-mb 131072 --port 28000How do those get translated to Ah, I see:
So there are new options inside oMLX for this? Great. Update2: I can confirm that generating tokens via I noticed one small detail: This PR likes to create a separate KV caching folder for each model inside the specified It does make it harder to switch between plain |
|
I have one strong recommendation for this PR: Do not include the This will also remove the currently vendor runtime |
While it does build easily on my machine, i am not 100% sure what the build-time requirements are. Does the user runtime then need xcode? So a user without Xcode CLT / make / Apple clang would fail to build ds4-server. the current approach avoids this but still allows custom ds4-override which we should include in the UI settings. what do you think of changing this to build ds4-server from a pinned DS4 source commit during release/CI packaging, not on the end user’s machine. but that would shift the burden more towards maintainers / CI environment who then absorbs DS4 build complexity. also a simple clone -> |
|
I can only speak for myself, but I am not comfortable with a PR that includes an executable binary. That is too opaque for me. |
Remove committed DS4 runtime payloads from package data and source tree. Add pinned-source build/install plumbing for app, CLI, Homebrew, and CI.
Keep existing build-script stdout phrases while using the shared pinned-source helper.
Keep the displayed default support dir implicit so managed installs can still auto-build. Retry failed auto-builds when source overrides change. Require pinned local sources to be verifiable git checkouts.
Run first-load DS4 provisioning off the asyncio loop. Let custom source repos build their default ref while recording the resolved commit. Copy and validate the actual Metal kernel set and ignore generated vendor runtime files.
Do not fall through to runtime builds when bundled DS4 resources are present but broken. Prune stale destination Metal kernels on overwrite so custom source manifests describe the selected source tree.
Serialize default support-dir provisioning so concurrent DS4 loads share one build. Keep localized DS4 support-dir hints aligned with the custom source install guidance.
Build or fail DS4 support by default for app bundles so end users avoid runtime toolchains. Clear stale manifests when overwriting support files from manifest-less prebuilt trees.
Keep the pinned upstream DS4 Metal file list as test-only fixture data. Remove the stale production helper now that runtime support validation discovers Metal files dynamically.
Accept Hugging Face blob/resolve file URLs in the built-in downloader and download only the selected file with allow_patterns. Expose direct DS4-GGUF file URLs as search/detail results so UI download actions avoid full-repo snapshots.
Expose DS4 support source metadata in the Engine Versions payload, preferring the staged support manifest over the bundled pin. Render unlinked commits as short SHAs and cover the default/custom manifest cases.
|
Great 👍 , I read through the external PR: I agree with the reasoning, I see a similar path as with As mentioned in the external PR, If it is not already mentioned in the external PR and I missed it: The path for building and shipping this PR and the needed assets can then be roughly modeled like the mentioned When the macOS app gets built, it can get decided to build and include Maybe it also makes sense to get a signal from @antirez how he thinks his DwarfStar project should get shipped in other project? Just to avoid misunderstandings and sudden rug pull situations after shipping this feature. The macOS app has a much bigger user surface than e.g. Homebrew, so it might make sense to hold back until this got sorted out? For Homebrew an additional flag can get added like |
Move the DS4 support build job to the macOS 26 hosted runner so the pinned DS4 source compiles against an SDK with the required Metal residency and math APIs.
|
Thanks @fry69 for the feedback. there are a few simplifications possible and the GH Action i am just attempting to fix. Long-term DS4 should be run in-process with C bindings like a library for better metrics and not to have tight coupling with the log format, but i tried avoiding that for stability and short-term maintainability reasons. I will try to run this as my local setup for a while and if @jundot prefers - possibly find some more simplifications in the code. (do you prefer a squashed PR or this history-preserving PR style?) |


Adds support for serving DS4 GGUF models through omlx by managing
ds4-serveras an external subprocess, rather than reimplementing the engine in MLX.How it works
-think,-think-max) and-chatfor no-thinking mode.ds4-serverprocess at a time (DS4 is a single-process backend), bound to localhost only. The engine pool handles admission with subprocess RSS accounting, auto-enables DS4 SSD streaming under memory pressure, restarts crashed pinned models with backoff, and integrates with TTL/eviction./v1/chat/completions,/v1/completions,/v1/responses, and/v1/messages(Anthropic), including low-latency SSE streaming.Notes