Skip to content

client/v3: remove cancelled substream from w.substreams before closing recvc#21981

Open
crawfordxx wants to merge 2 commits into
etcd-io:mainfrom
crawfordxx:fix-watch-panic-send-on-closed-recvc-21969
Open

client/v3: remove cancelled substream from w.substreams before closing recvc#21981
crawfordxx wants to merge 2 commits into
etcd-io:mainfrom
crawfordxx:fix-watch-panic-send-on-closed-recvc-21969

Conversation

@crawfordxx

Copy link
Copy Markdown

Root cause

When a watch cancel response arrives in run(), ws.recvc is closed and
the substream is added to the closing map — but it is not removed from
w.substreams.

If a progress-notify response (WatchId == InvalidWatchID) is received
before closingc is drained, broadcastResponse iterates every entry in
w.substreams and executes:

case ws.recvc <- wr:

Sending on an already-closed channel panics with "send on closed channel".

Fix

Delete the cancelled substream from w.substreams immediately after
close(ws.recvc), closing the race window. closeSubstream still calls
delete(w.substreams, ws.id) when closingc is drained; the redundant
delete is a safe no-op on a map.

Test

Added TestBroadcastResponseNoPanicOnCancelledSubstream which reproduces
the race window (close ws.recvc, remove from w.substreams, then call
broadcastResponse) and asserts no panic occurs.

Fixes #21969

…g recvc

When a watch cancel response is received in run(), ws.recvc is closed and
the substream is added to the closing map, but was not removed from
w.substreams.  If a progress-notify response (WatchId == InvalidWatchID)
arrives before closingc is drained, broadcastResponse iterates all entries
in w.substreams and tries to send on the already-closed recvc channel,
causing a panic: "send on closed channel".

Fix: delete the cancelled substream from w.substreams immediately after
closing its recvc, so that broadcastResponse cannot observe a closed
channel.  closeSubstream, which is called later when closingc is drained,
still performs delete(w.substreams, ws.id); the redundant delete is a
safe no-op.

Add a regression test that reproduces the race window and confirms
broadcastResponse does not panic after the substream is removed.

Fixes etcd-io#21969

Signed-off-by: crawfordxx <crawfordxx@users.noreply.github.com>
@k8s-ci-robot

Copy link
Copy Markdown

Hi @crawfordxx. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: crawfordxx
Once this PR has been reviewed and has the lgtm label, please assign siyuanfoundation for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread client/v3/watch_test.go Outdated
Comment on lines +240 to +241
close(ws.recvc)
delete(w.substreams, ws.id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented out the substreams deletion in watch.go and this test still passes. Is that intentional?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right. The original test called delete(w.substreams, ws.id) in the test body itself, so broadcastResponse never saw the closed channel regardless of whether the deletion in run() was present. The test was essentially testing a no-op.

I have restructured it into two subtests (e884c38):

  1. "panics when closed recvc is still in substreams" — closes ws.recvc but does not remove it from w.substreams, then calls broadcastResponse and asserts that a panic occurs. This documents the pre-fix bug state and will catch any future regression that re-introduces a closed channel into the broadcast path.

  2. "no panic after substream removed from substreams" — applies the same eager-delete that run() now does, then verifies broadcastResponse does not panic.

The first subtest is the key addition: it would have failed with the original test, and will fail again if the delete is removed from watch.go.

The original TestBroadcastResponseNoPanicOnCancelledSubstream always
passed regardless of whether the fix in watch.go was present, because
the test itself called delete(w.substreams, ws.id) before invoking
broadcastResponse — bypassing the exact code path under test.

Restructure the test as two subtests:

1. "panics when closed recvc is still in substreams" — proves that
   broadcastResponse panics on a closed channel that is still present
   in w.substreams.  This documents the pre-fix bug state and will
   catch any future regression that re-introduces a closed channel
   into the broadcast path.

2. "no panic after substream removed from substreams" — proves that
   eagerly removing the substream (what run() now does) prevents the
   panic.

Signed-off-by: crawfordxx <crawfordxx@users.noreply.github.com>
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: crawfordxx
Once this PR has been reviewed and has the lgtm label, please assign siyuanfoundation for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

client/v3: panic: send on closed channel in watchGRPCStream.broadcastResponse

3 participants