[Bug] OpenAI WS v2 passthrough lacks downstream WebSocket keepalive for long Codex tasks

## Summary

OpenAI Responses WebSocket v2 `passthrough` mode can stay silent on the downstream client connection during long Codex tasks, for example Codex `image_generation`. When Sub2API is behind an idle-sensitive proxy/load balancer/CDN, the downstream WebSocket may be closed before `response.completed`, even though the upstream OpenAI WebSocket is still alive.

This is different from the existing SSE keepalive issues and from `ctx_pool` continuation issues. The affected path is the WS v2 passthrough relay.

## Observed behavior

A Codex client using Responses WebSocket v2 through Sub2API passthrough may report errors similar to:

```text
stream disconnected before completion: websocket closed by server before response.completed
```

This is easiest to reproduce with long-running image generation or other turns where the upstream can be busy for a while without producing downstream business frames.

## Root cause

The passthrough relay forwards client/upstream frames, but it did not actively keep the downstream WebSocket alive when there were no downstream business writes for a while.

If an intermediate proxy has an idle timeout, it can close the client-facing WS connection during the long upstream turn. Sub2API then sees a graceful client-side EOF/close and has to drain upstream only for usage/accounting.

## Expected behavior

For WS v2 passthrough mode, Sub2API should optionally send WebSocket Ping control frames to the downstream client after the first downstream business frame, when the downstream side has been idle for a configured interval.

Suggested behavior:

- Default interval around 20 seconds.
- Configurable timeout for waiting on Pong, around 5 seconds.
- `0` interval disables the behavior.
- If the downstream connection does not support active Ping, log/trace this and keep existing relay behavior.
- If Ping/Pong fails, treat it as a graceful client disconnect and preserve the existing upstream drain/usage capture behavior.
- Do not start downstream keepalive before the first downstream business write, so the relay does not keep never-started/failed handshakes alive artificially.

## Validation from a downstream deployment

After adding downstream WS Ping/Pong keepalive, long Codex image_generation requests behind a proxy showed repeated successful downstream keepalive traces during the same long session, for example:

```text
stage=downstream_ping_ok direction=downstream_keepalive graceful=true wrote_downstream=true
```

The same deployment continued to drain upstream usage correctly when the client connection closed.

## Proposed fix

I can open a PR with a minimal implementation that:

- Adds `Ping(ctx)` support to the passthrough downstream frame connection.
- Starts a downstream keepalive goroutine after the first downstream business write.
- Adds config keys:
  - `gateway.openai_ws.passthrough_downstream_ping_interval_seconds`
  - `gateway.openai_ws.passthrough_downstream_ping_timeout_seconds`
- Adds relay tests for:
  - ping starts only after the first downstream write;
  - successful Ping/Pong during idle periods;
  - Ping failure is handled as graceful client disconnect while preserving upstream drain;
  - idle timeout behavior is avoided while pings succeed.

## Related issues checked

Related but not exact duplicates:

- #2121: SSE keepalive before upstream response headers.
- #2031: image generation HTTP/Cloudflare timeout discussion.
- #1807: Codex WS disconnect symptom, but no downstream WS Ping/Pong root cause.
- #1769 / #2139: ctx_pool/continuation behavior, not passthrough downstream keepalive.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OpenAI WS v2 passthrough lacks downstream WebSocket keepalive for long Codex tasks #3171

Summary

Observed behavior

Root cause

Expected behavior

Validation from a downstream deployment

Proposed fix

Related issues checked

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] OpenAI WS v2 passthrough lacks downstream WebSocket keepalive for long Codex tasks #3171

Description

Summary

Observed behavior

Root cause

Expected behavior

Validation from a downstream deployment

Proposed fix

Related issues checked

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions