Skip to content

client/v3: panic: send on closed channel in watchGRPCStream.broadcastResponse #21969

Description

@chenk008

Bug report

What happened?

A panic: send on closed channel occurs in watchGRPCStream.broadcastResponse when a progress notify response is broadcast to a substream whose recvc channel has already been closed by a prior cancel response.

panic: send on closed channel

goroutine 15028947 [running]:
go.etcd.io/etcd/client/v3.(*watchGRPCStream).broadcastResponse(...)
	go.etcd.io/etcd/client/v3@v3.6.8/watch.go:741
go.etcd.io/etcd/client/v3.(*watchGRPCStream).dispatchEvent(0xd94df769380, 0xd94b8661ce0)
	go.etcd.io/etcd/client/v3@v3.6.8/watch.go:732 +0x2f4
go.etcd.io/etcd/client/v3.(*watchGRPCStream).run(0xd94df769380)
	go.etcd.io/etcd/client/v3@v3.6.8/watch.go:628 +0x62c

Root cause

In run(), when a watch cancel response is received (pbresp.Canceled && pbresp.CompactRevision == 0), the code closes ws.recvc and adds the substream to the closing map, but does not remove it from w.substreams:

https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L620-L626

case pbresp.Canceled && pbresp.CompactRevision == 0:
    delete(cancelSet, pbresp.WatchId)
    if ws, ok := w.substreams[pbresp.WatchId]; ok {
        // signal to stream goroutine to update closingc
        close(ws.recvc)
        closing[ws] = struct{}{}
    }

Meanwhile, broadcastResponse iterates all entries in w.substreams and sends on ws.recvc:

https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L745-L753

func (w *watchGRPCStream) broadcastResponse(wr *WatchResponse) bool {
    for _, ws := range w.substreams {
        select {
        case ws.recvc <- wr:   // panic: send on closed channel
        case <-ws.donec:
        }
    }
    return true
}

The case <-ws.donec guard does not protect against this because ws.donec is closed later in the substream goroutine's deferred cleanup, not when recvc is closed. So there is a window where recvc is closed but donec is not, the substream is still in w.substreams, and broadcastResponse panics.

Proposed fix

Add delete(w.substreams, pbresp.WatchId) after closing[ws] = struct{}{} in the cancel handler:

case pbresp.Canceled && pbresp.CompactRevision == 0:
    delete(cancelSet, pbresp.WatchId)
    if ws, ok := w.substreams[pbresp.WatchId]; ok {
        close(ws.recvc)
        closing[ws] = struct{}{}
        delete(w.substreams, pbresp.WatchId)
    }

This is safe because:

  • closeSubstream (called later via closingc) also calls delete(w.substreams, ws.id), which is a no-op on an already-deleted key.
  • The defer teardown at run() exit iterates w.substreams and checks the closing map, so it won't double-close recvc.
  • The "no more watchers" check (len(w.substreams) + len(w.resuming) == 0) correctly reflects reality one entry sooner.

What did you expect to happen?

No panic. A canceled substream should be unreachable by broadcastResponse immediately after its recvc is closed.

How can we reproduce it?

This requires a race between a watch cancel response and a progress notify response on the same gRPC watch stream. It is more likely to occur under high watch churn with progress notify enabled.

etcd version

Verified present on main, v3.6.12, v3.7.0-rc.0, and v3.8.0-alpha.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions