Skip to content

Decode crashes with "CUDA illegal memory access" once context crosses the 4096-token sliding window  #2

Description

@srinathh

Summary

On decode-perf-tuning (commit 5625a99), any request whose total context (prompt + generated tokens) crosses the model's 4096-token sliding-window boundary crashes the first decode step past 4096 with CUDA illegal memory access / cuda decode failed. After that, the CUDA context is poisoned and every subsequent request fails with cuda prefill state reset failed, so the server has to be restarted.

The trigger is the per-layer CUDA-graph decode capture (enabled by default on this branch). Disabling it with DS4_CUDA_LAYER_GRAPHS=0 completely resolves the crash, at no measurable decode-speed cost on this hardware.

Environment

  • Driver: 595.58.03 (nvidia-driver-595-open)
  • Build: make cuda -j20 CUDA_ARCH=sm_121 against host CUDA 13.0

Also reproduced identically in a containerized build against CUDA 13.2, so it is not specific to one CUDA toolkit version.

Startup log

ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4-server: context buffers 1739.75 MiB (ctx=65536, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16386)
ds4-server: listening on http://0.0.0.0:8000

Reproduction

Case A — prompt already longer than 4096 (crashes at gen=0)

Send a single /v1/completions request with a ~5200-token prompt:

curl -s http://localhost:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-chat","prompt":"<~5200 tokens of text>","max_tokens":400}'

Prefill completes fine; the first decode step dies:

ds4-server: completion ctx=0..5212:5212 prefill chunk 4096/5212 (78.6%) chunk=460.56 t/s ...
ds4-server: completion ctx=0..5212:5212 prefill chunk 5212/5212 (100.0%) chunk=376.72 t/s ...
ds4-server: completion ctx=0..5212:5212 prompt done 11.856s
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4: CUDA synchronize failed: an illegal memory access was encountered
ds4: Metal synchronize after graph eval failure also failed
ds4-server: completion ctx=0..5212:5212 gen=0 finish=error error="cuda decode failed" 11.875s

Case B — short prompt, generation crosses 4096 mid-decode

A ~4086-token prompt generating ~500 tokens decodes fine up to ~4096 total, then crashes mid-stream:

ds4-server: completion ctx=4286..4336:50 gen=250 decoding chunk=12.34 t/s ...
ds4-server: completion ctx=4336..4386:50 gen=300 decoding ...
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4-server: completion ctx=4436..4480:44 gen=394 finish=error error="cuda decode failed"

Aftermath — context poisoned

Every later request, even a trivial one, returns:

{"error":{"message":"cuda prefill state reset failed","type":"invalid_request_error"}}

Only a process restart recovers the server.

Requests that keep total context under ~4096 tokens never crash, which is why short-output benchmarks don't surface this.

Workaround

Setting the environment variable disables graph capture and falls back to the eager decode path like the upstream.

DS4_CUDA_LAYER_GRAPHS=0 ./ds4-server --cuda -m <gguf> --host 0.0.0.0 -c 65536 --port 8000

With this set, a 5200-token prompt and repeated >4096-context requests all complete cleanly with no poisoning.

Decode throughput is unchanged on GB10 — ~13.2 tok/s single-stream either way

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions