Summary
On decode-perf-tuning (commit 5625a99), any request whose total context (prompt + generated tokens) crosses the model's 4096-token sliding-window boundary crashes the first decode step past 4096 with CUDA illegal memory access / cuda decode failed. After that, the CUDA context is poisoned and every subsequent request fails with cuda prefill state reset failed, so the server has to be restarted.
The trigger is the per-layer CUDA-graph decode capture (enabled by default on this branch). Disabling it with DS4_CUDA_LAYER_GRAPHS=0 completely resolves the crash, at no measurable decode-speed cost on this hardware.
Environment
- Driver: 595.58.03 (nvidia-driver-595-open)
- Build:
make cuda -j20 CUDA_ARCH=sm_121 against host CUDA 13.0
Also reproduced identically in a containerized build against CUDA 13.2, so it is not specific to one CUDA toolkit version.
Startup log
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4-server: context buffers 1739.75 MiB (ctx=65536, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16386)
ds4-server: listening on http://0.0.0.0:8000
Reproduction
Case A — prompt already longer than 4096 (crashes at gen=0)
Send a single /v1/completions request with a ~5200-token prompt:
curl -s http://localhost:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model":"deepseek-chat","prompt":"<~5200 tokens of text>","max_tokens":400}'
Prefill completes fine; the first decode step dies:
ds4-server: completion ctx=0..5212:5212 prefill chunk 4096/5212 (78.6%) chunk=460.56 t/s ...
ds4-server: completion ctx=0..5212:5212 prefill chunk 5212/5212 (100.0%) chunk=376.72 t/s ...
ds4-server: completion ctx=0..5212:5212 prompt done 11.856s
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4: CUDA synchronize failed: an illegal memory access was encountered
ds4: Metal synchronize after graph eval failure also failed
ds4-server: completion ctx=0..5212:5212 gen=0 finish=error error="cuda decode failed" 11.875s
Case B — short prompt, generation crosses 4096 mid-decode
A ~4086-token prompt generating ~500 tokens decodes fine up to ~4096 total, then crashes mid-stream:
ds4-server: completion ctx=4286..4336:50 gen=250 decoding chunk=12.34 t/s ...
ds4-server: completion ctx=4336..4386:50 gen=300 decoding ...
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4-server: completion ctx=4436..4480:44 gen=394 finish=error error="cuda decode failed"
Aftermath — context poisoned
Every later request, even a trivial one, returns:
{"error":{"message":"cuda prefill state reset failed","type":"invalid_request_error"}}
Only a process restart recovers the server.
Requests that keep total context under ~4096 tokens never crash, which is why short-output benchmarks don't surface this.
Workaround
Setting the environment variable disables graph capture and falls back to the eager decode path like the upstream.
DS4_CUDA_LAYER_GRAPHS=0 ./ds4-server --cuda -m <gguf> --host 0.0.0.0 -c 65536 --port 8000
With this set, a 5200-token prompt and repeated >4096-context requests all complete cleanly with no poisoning.
Decode throughput is unchanged on GB10 — ~13.2 tok/s single-stream either way
Summary
On
decode-perf-tuning(commit5625a99), any request whose total context (prompt + generated tokens) crosses the model's 4096-token sliding-window boundary crashes the first decode step past 4096 withCUDA illegal memory access/cuda decode failed. After that, the CUDA context is poisoned and every subsequent request fails withcuda prefill state reset failed, so the server has to be restarted.The trigger is the per-layer CUDA-graph decode capture (enabled by default on this branch). Disabling it with
DS4_CUDA_LAYER_GRAPHS=0completely resolves the crash, at no measurable decode-speed cost on this hardware.Environment
make cuda -j20 CUDA_ARCH=sm_121against host CUDA 13.0Startup log
Reproduction
Case A — prompt already longer than 4096 (crashes at gen=0)
Send a single
/v1/completionsrequest with a ~5200-token prompt:Prefill completes fine; the first decode step dies:
Case B — short prompt, generation crosses 4096 mid-decode
A ~4086-token prompt generating ~500 tokens decodes fine up to ~4096 total, then crashes mid-stream:
Aftermath — context poisoned
Every later request, even a trivial one, returns:
{"error":{"message":"cuda prefill state reset failed","type":"invalid_request_error"}}Only a process restart recovers the server.
Requests that keep total context under ~4096 tokens never crash, which is why short-output benchmarks don't surface this.
Workaround
Setting the environment variable disables graph capture and falls back to the eager decode path like the upstream.
With this set, a 5200-token prompt and repeated >4096-context requests all complete cleanly with no poisoning.
Decode throughput is unchanged on GB10 — ~13.2 tok/s single-stream either way