Decode crashes with "CUDA illegal memory access" once context crosses the 4096-token sliding window 

## Summary

On `decode-perf-tuning` (commit `5625a99`), any request whose total context (prompt + generated tokens) crosses the model's 4096-token sliding-window boundary crashes the first decode step past 4096 with `CUDA illegal memory access` / `cuda decode failed`. After that, the CUDA context is poisoned and every subsequent request fails with `cuda prefill state reset failed`, so the server has to be restarted.

The trigger is the per-layer CUDA-graph decode capture (enabled by default on this branch). Disabling it with **`DS4_CUDA_LAYER_GRAPHS=0`** completely resolves the crash, at no measurable decode-speed cost on this hardware.

## Environment

- **Driver:** 595.58.03 (nvidia-driver-595-open)
- **Build:** `make cuda -j20 CUDA_ARCH=sm_121` against host **CUDA 13.0**

> Also reproduced identically in a containerized build against **CUDA 13.2**, so it is not specific to one CUDA toolkit version.

## Startup log 

```
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4-server: context buffers 1739.75 MiB (ctx=65536, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16386)
ds4-server: listening on http://0.0.0.0:8000
```

## Reproduction

### Case A — prompt already longer than 4096 (crashes at gen=0)

Send a single `/v1/completions` request with a ~5200-token prompt:

```bash
curl -s http://localhost:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-chat","prompt":"<~5200 tokens of text>","max_tokens":400}'
```

Prefill completes fine; the **first decode step** dies:

```
ds4-server: completion ctx=0..5212:5212 prefill chunk 4096/5212 (78.6%) chunk=460.56 t/s ...
ds4-server: completion ctx=0..5212:5212 prefill chunk 5212/5212 (100.0%) chunk=376.72 t/s ...
ds4-server: completion ctx=0..5212:5212 prompt done 11.856s
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4: CUDA synchronize failed: an illegal memory access was encountered
ds4: Metal synchronize after graph eval failure also failed
ds4-server: completion ctx=0..5212:5212 gen=0 finish=error error="cuda decode failed" 11.875s
```

### Case B — short prompt, generation crosses 4096 mid-decode

A ~4086-token prompt generating ~500 tokens decodes fine up to ~4096 total, then crashes mid-stream:

```
ds4-server: completion ctx=4286..4336:50 gen=250 decoding chunk=12.34 t/s ...
ds4-server: completion ctx=4336..4386:50 gen=300 decoding ...
ds4: CUDA end commands failed: an illegal memory access was encountered
ds4-server: completion ctx=4436..4480:44 gen=394 finish=error error="cuda decode failed"
```

### Aftermath — context poisoned

Every later request, even a trivial one, returns:

```json
{"error":{"message":"cuda prefill state reset failed","type":"invalid_request_error"}}
```

Only a process restart recovers the server.

Requests that keep total context **under ~4096 tokens never crash**, which is why short-output benchmarks don't surface this.


## Workaround 

Setting the environment variable disables graph capture and falls back to the eager decode path like the upstream.

```bash
DS4_CUDA_LAYER_GRAPHS=0 ./ds4-server --cuda -m <gguf> --host 0.0.0.0 -c 65536 --port 8000
```

With this set, a 5200-token prompt and repeated >4096-context requests all complete cleanly with no poisoning.

**Decode throughput is unchanged on GB10** — ~13.2 tok/s single-stream either way 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode crashes with "CUDA illegal memory access" once context crosses the 4096-token sliding window #2

Summary

Environment

Startup log

Reproduction

Case A — prompt already longer than 4096 (crashes at gen=0)

Case B — short prompt, generation crosses 4096 mid-decode

Aftermath — context poisoned

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Decode crashes with "CUDA illegal memory access" once context crosses the 4096-token sliding window #2

Description

Summary

Environment

Startup log

Reproduction

Case A — prompt already longer than 4096 (crashes at gen=0)

Case B — short prompt, generation crosses 4096 mid-decode

Aftermath — context poisoned

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions