Bug report
What happened?
A panic: send on closed channel occurs in watchGRPCStream.broadcastResponse when a progress notify response is broadcast to a substream whose recvc channel has already been closed by a prior cancel response.
panic: send on closed channel
goroutine 15028947 [running]:
go.etcd.io/etcd/client/v3.(*watchGRPCStream).broadcastResponse(...)
go.etcd.io/etcd/client/v3@v3.6.8/watch.go:741
go.etcd.io/etcd/client/v3.(*watchGRPCStream).dispatchEvent(0xd94df769380, 0xd94b8661ce0)
go.etcd.io/etcd/client/v3@v3.6.8/watch.go:732 +0x2f4
go.etcd.io/etcd/client/v3.(*watchGRPCStream).run(0xd94df769380)
go.etcd.io/etcd/client/v3@v3.6.8/watch.go:628 +0x62c
Root cause
In run(), when a watch cancel response is received (pbresp.Canceled && pbresp.CompactRevision == 0), the code closes ws.recvc and adds the substream to the closing map, but does not remove it from w.substreams:
https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L620-L626
case pbresp.Canceled && pbresp.CompactRevision == 0:
delete(cancelSet, pbresp.WatchId)
if ws, ok := w.substreams[pbresp.WatchId]; ok {
// signal to stream goroutine to update closingc
close(ws.recvc)
closing[ws] = struct{}{}
}
Meanwhile, broadcastResponse iterates all entries in w.substreams and sends on ws.recvc:
https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L745-L753
func (w *watchGRPCStream) broadcastResponse(wr *WatchResponse) bool {
for _, ws := range w.substreams {
select {
case ws.recvc <- wr: // panic: send on closed channel
case <-ws.donec:
}
}
return true
}
The case <-ws.donec guard does not protect against this because ws.donec is closed later in the substream goroutine's deferred cleanup, not when recvc is closed. So there is a window where recvc is closed but donec is not, the substream is still in w.substreams, and broadcastResponse panics.
Proposed fix
Add delete(w.substreams, pbresp.WatchId) after closing[ws] = struct{}{} in the cancel handler:
case pbresp.Canceled && pbresp.CompactRevision == 0:
delete(cancelSet, pbresp.WatchId)
if ws, ok := w.substreams[pbresp.WatchId]; ok {
close(ws.recvc)
closing[ws] = struct{}{}
delete(w.substreams, pbresp.WatchId)
}
This is safe because:
closeSubstream (called later via closingc) also calls delete(w.substreams, ws.id), which is a no-op on an already-deleted key.
- The defer teardown at
run() exit iterates w.substreams and checks the closing map, so it won't double-close recvc.
- The "no more watchers" check (
len(w.substreams) + len(w.resuming) == 0) correctly reflects reality one entry sooner.
What did you expect to happen?
No panic. A canceled substream should be unreachable by broadcastResponse immediately after its recvc is closed.
How can we reproduce it?
This requires a race between a watch cancel response and a progress notify response on the same gRPC watch stream. It is more likely to occur under high watch churn with progress notify enabled.
etcd version
Verified present on main, v3.6.12, v3.7.0-rc.0, and v3.8.0-alpha.0.
Bug report
What happened?
A
panic: send on closed channeloccurs inwatchGRPCStream.broadcastResponsewhen a progress notify response is broadcast to a substream whoserecvcchannel has already been closed by a prior cancel response.Root cause
In
run(), when a watch cancel response is received (pbresp.Canceled && pbresp.CompactRevision == 0), the code closesws.recvcand adds the substream to theclosingmap, but does not remove it fromw.substreams:https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L620-L626
Meanwhile,
broadcastResponseiterates all entries inw.substreamsand sends onws.recvc:https://github.com/etcd-io/etcd/blob/main/client/v3/watch.go#L745-L753
The
case <-ws.donecguard does not protect against this becausews.donecis closed later in the substream goroutine's deferred cleanup, not whenrecvcis closed. So there is a window whererecvcis closed butdonecis not, the substream is still inw.substreams, andbroadcastResponsepanics.Proposed fix
Add
delete(w.substreams, pbresp.WatchId)afterclosing[ws] = struct{}{}in the cancel handler:This is safe because:
closeSubstream(called later viaclosingc) also callsdelete(w.substreams, ws.id), which is a no-op on an already-deleted key.run()exit iteratesw.substreamsand checks theclosingmap, so it won't double-closerecvc.len(w.substreams) + len(w.resuming) == 0) correctly reflects reality one entry sooner.What did you expect to happen?
No panic. A canceled substream should be unreachable by
broadcastResponseimmediately after itsrecvcis closed.How can we reproduce it?
This requires a race between a watch cancel response and a progress notify response on the same gRPC watch stream. It is more likely to occur under high watch churn with progress notify enabled.
etcd version
Verified present on
main,v3.6.12,v3.7.0-rc.0, andv3.8.0-alpha.0.