Skip to content

[Bug] OpenAI WS v2 passthrough lacks downstream WebSocket keepalive for long Codex tasks #3171

@WesleyZiwen

Description

@WesleyZiwen

Summary

OpenAI Responses WebSocket v2 passthrough mode can stay silent on the downstream client connection during long Codex tasks, for example Codex image_generation. When Sub2API is behind an idle-sensitive proxy/load balancer/CDN, the downstream WebSocket may be closed before response.completed, even though the upstream OpenAI WebSocket is still alive.

This is different from the existing SSE keepalive issues and from ctx_pool continuation issues. The affected path is the WS v2 passthrough relay.

Observed behavior

A Codex client using Responses WebSocket v2 through Sub2API passthrough may report errors similar to:

stream disconnected before completion: websocket closed by server before response.completed

This is easiest to reproduce with long-running image generation or other turns where the upstream can be busy for a while without producing downstream business frames.

Root cause

The passthrough relay forwards client/upstream frames, but it did not actively keep the downstream WebSocket alive when there were no downstream business writes for a while.

If an intermediate proxy has an idle timeout, it can close the client-facing WS connection during the long upstream turn. Sub2API then sees a graceful client-side EOF/close and has to drain upstream only for usage/accounting.

Expected behavior

For WS v2 passthrough mode, Sub2API should optionally send WebSocket Ping control frames to the downstream client after the first downstream business frame, when the downstream side has been idle for a configured interval.

Suggested behavior:

  • Default interval around 20 seconds.
  • Configurable timeout for waiting on Pong, around 5 seconds.
  • 0 interval disables the behavior.
  • If the downstream connection does not support active Ping, log/trace this and keep existing relay behavior.
  • If Ping/Pong fails, treat it as a graceful client disconnect and preserve the existing upstream drain/usage capture behavior.
  • Do not start downstream keepalive before the first downstream business write, so the relay does not keep never-started/failed handshakes alive artificially.

Validation from a downstream deployment

After adding downstream WS Ping/Pong keepalive, long Codex image_generation requests behind a proxy showed repeated successful downstream keepalive traces during the same long session, for example:

stage=downstream_ping_ok direction=downstream_keepalive graceful=true wrote_downstream=true

The same deployment continued to drain upstream usage correctly when the client connection closed.

Proposed fix

I can open a PR with a minimal implementation that:

  • Adds Ping(ctx) support to the passthrough downstream frame connection.
  • Starts a downstream keepalive goroutine after the first downstream business write.
  • Adds config keys:
    • gateway.openai_ws.passthrough_downstream_ping_interval_seconds
    • gateway.openai_ws.passthrough_downstream_ping_timeout_seconds
  • Adds relay tests for:
    • ping starts only after the first downstream write;
    • successful Ping/Pong during idle periods;
    • Ping failure is handled as graceful client disconnect while preserving upstream drain;
    • idle timeout behavior is avoided while pings succeed.

Related issues checked

Related but not exact duplicates:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions