balancer/pickfirst: fix flaky metrics tests by awaiting async metrics by mbissa · Pull Request #9020 · grpc/grpc-go

mbissa · 2026-03-25T17:50:13Z

Fixed a race condition in pickfirst metrics tests where the assertions would immediately poll the stats.TestMetricsRecorder. Depending on the goroutine scheduling, capturing of metrics like 'grpc.subchannel.open_connections' would occasionally not be fully recorded before the test checked them, causing flaky failures. This introduces an awaitMetric helper to wait for the metrics to reach their expected values.

RELEASE NOTES: none

codecov · 2026-03-25T17:55:34Z

Codecov Report

❌ Patch coverage is 70.10309% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.18%. Comparing base (0f3086d) to head (2669e2e).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/testutils/stats/test_metrics_recorder.go	70.10%	18 Missing and 11 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9020      +/-   ##
==========================================
- Coverage   83.28%   83.18%   -0.11%     
==========================================
  Files         418      419       +1     
  Lines       33741    33895     +154     
==========================================
+ Hits        28102    28194      +92     
- Misses       4232     4274      +42     
- Partials     1407     1427      +20

Files with missing lines	Coverage Δ
internal/testutils/stats/test_metrics_recorder.go	`72.58% <70.10%> (-5.28%)`	⬇️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal · 2026-03-26T06:50:46Z

 		t.Fatalf("EmptyCall() failed: %v", err)
 	}

-	if got, _ := tmr.Metric("grpc.lb.pick_first.connection_attempts_succeeded"); got != 1 {


All pick_first metrics (I'm not sure about subchannel metrics) MUST be emitted before an RPC returns, as the LB policy is responsible for making the channel active (which is what allows the RPC to succeed). If the metrics aren't emitted in this order, it likely indicates a bug causing race conditions within the channel. Because of this, we shouldn't need to change the validation logic. Could you please revert the changes to the pick_first metrics?

reverted pick first and metrics that do not need waiting out of the new construct.

arjan-bal · 2026-03-26T19:21:36Z

 	return gotMetrics
 }
+
+func awaitMetric(ctx context.Context, t *testing.T, tmr *stats.TestMetricsRecorder, name string, want float64) {


Instead of a frequent polling, we should use channels to efficiently block the test until the expected metric is available. The TestMetricsRecorder (TMR) already has a method that allows waiting for a particular metric to be emitted:

grpc-go/internal/testutils/stats/test_metrics_recorder.go

Lines 160 to 170 in c20fba0

func (r *TestMetricsRecorder) WaitForFloat64Count(ctx context.Context, metricsDataWant MetricsData) error {

got, err := r.floatCountCh.Receive(ctx)

if err != nil {

return fmt.Errorf("timeout waiting for float64Count")

}

metricsDataGot := got.(MetricsData)

if diff := cmp.Diff(metricsDataGot, metricsDataWant); diff != "" {

return fmt.Errorf("float64count metricsData received unexpected value (-got, +want): %v", diff)

}

return nil

}

However, it currently has a few issues that make it difficult to use:

Strict Ordering: It fails if the very next metric emitted doesn't match the expected one. When writing e2e tests, it is difficult to account for all metrics and the exact order in which they will be emitted.

Data Loss: It consumes and discards buffered metrics while waiting.

Currently, the only users of these WaitForX methods seem to be the metrics recorder list tests. I think we should improve this test utility to make it more ergonomic for other e2e tests to use.

I propose we combine the logic of the Metric() and WaitForX methods.

To address the first issue, we can update the WaitForX method to keep reading from the channel until either the context expires or a metric with a matching name is found for comparision.

To address the second issue, we can store all read MetricsData in the TMR's inner map. When WaitForX is called, it can first check this map for the required data; if it's found, it can proceed with the match directly without waiting on the channel.

What do you think?

sounds attempt in correct direction
What do you think about below @arjan-bal :

Instead of WaitForX consuming the channel directly, the TMR will internally "pump" all incoming metrics into a thread-safe map.

The internal map will store a slice of values (map[string][]float64) instead of just the latest value. This allows tests to verify state transitions (e.g., checking that a counter hit 1 and then 2) and nothing is discarded and users would know that in case of races if more than one values can come then they want to read the first one or the last one.

For point 2, the usual way to verify transitions is to pause the test, verify the current value of the metric, trigger the event that causes the metric to update, and then verify the new value. The test controls the events that cause the metrics to be emitted, instead of running to completion and verifying the metrics afterward. This seems like a stricter way to verify metrics. Is there a case where doing this is currently difficult?

We do follow the "Trigger -> Verify -> Trigger -> Verify" pattern (e.g., connect -> verify 1 -> disconnect -> verify -1).
Even after the connection state transitions, the test thread can run ahead of the metrics recorder.
Because of this asynchronous delay, we still need the WaitForX methods to block until the metric is recorded. Furthermore, under CPU load, the background goroutines can record multiple values (e.g., both 1 and -1) back-to-back before the test thread is rescheduled to verify. The slice-based map prevents the later value from overwriting the earlier one, ensuring both transitions are verified in the correct sequence.

Fixed a race condition in pickfirst metrics tests where the assertions would immediately poll the stats.TestMetricsRecorder. Depending on the goroutine scheduling, asynchronous metrics like 'grpc.subchannel.open_connections' would occasionally not be fully recorded before the test checked them, causing flaky failures. This introduces an awaitMetric helper to wait for the metrics to reach their expected values.

… in TestMetricsRecorder

mbissa · 2026-06-14T18:28:17Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the TestMetricsRecorder in internal/testutils/stats to record and track all metric updates sequentially rather than only storing the most recent value. It also updates pickfirst balancer metrics tests to use the new sequential waiting helpers. A critical race condition was identified in the Record* methods of TestMetricsRecorder where metrics are sent to channels before being appended to the internal data slice under lock, which can cause waitForMetric to miss updates and block indefinitely.

arjan-bal

Can

arjan-bal · 2026-06-15T08:10:34Z


-		data: make(map[string]float64),
+		data:     make(map[string][]MetricsData),
+		consumed: make(map[string]int),


Instead of using the consumed map to track the head pointer of the queue, we can remove the first element by slicing:

data[name] = data[name][1:]

arjan-bal · 2026-06-15T08:12:30Z

@@ -68,15 +71,27 @@
 func (r *TestMetricsRecorder) Metric(name string) (float64, bool) {


We should mention in the godoc that this doesn't consume the metric value.

arjan-bal · 2026-06-15T08:29:11Z

 	intCountCh       *testutils.Channel
 	floatCountCh     *testutils.Channel
 	intHistoCh       *testutils.Channel
 	floatHistoCh     *testutils.Channel
 	intGaugeCh       *testutils.Channel
 	intUpDownCountCh *testutils.Channel


These channels are used only to signal that a metric has been received; the actual metric is read directly from the data map. We can replace them with a single channel. The only side effect is that the loop in waitForMetric might unblock a few extra times for different metric types. However, it will simply check the data map and block again, so correctness is maintained.

arjan-bal · 2026-06-15T08:31:44Z

 // WaitForInt64CountIncr waits for an int64 count metric to be recorded and
 // verifies that the recorded metrics data incr matches the expected incr.
 // Returns an error if failed to wait or received wrong data.
 func (r *TestMetricsRecorder) WaitForInt64CountIncr(ctx context.Context, incrWant int64) error {


We should replace WaitForInt64CountIncr with WaitForInt64Count, which checks the metric type, not just the value. This would allow us to use a single channel for signaling.

arjan-bal · 2026-06-15T08:33:02Z

+	got, err := r.waitForMetric(ctx, r.intUpDownCountCh, want.Handle.Name)
+	if err != nil {
+		return fmt.Errorf("timeout waiting for int64UpDownCount")
+	}


We should propagate the error instead of returning a fresh error. Same for other methods.

arjan-bal · 2026-06-15T08:34:37Z

+// WaitForInt64UpDownCount waits for an int64 up-down count metric and
+// verifies that it matches want. Returns an error if failed to wait or
+// received wrong data.
+func (r *TestMetricsRecorder) WaitForInt64UpDownCount(ctx context.Context, want MetricsData) error {


We should mention that this method consumes the metrics until finds the expected metric in the godoc. Same for other methods.

arjan-bal · 2026-06-15T08:38:32Z

+func waitForInt64Count(ctx context.Context, t *testing.T, tmr *stats.TestMetricsRecorder, name string, want int64) {
+	t.Helper()
+	err := tmr.WaitForInt64Count(ctx, stats.MetricsData{
+		Handle:  &estats.MetricDescriptor{Name: name},
+		IntIncr: want,
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+}


style: Assertion helpers are not considered idiomatic in Go, see go/go-style/best-practices#test-functions

Instead, you can should inline the calls, even if it results in a little repetition.

mbissa added this to the 1.81 Release milestone Mar 25, 2026

mbissa added the Type: Bug label Mar 25, 2026

mbissa requested a review from arjan-bal March 25, 2026 18:42

mbissa assigned arjan-bal Mar 25, 2026

arjan-bal reviewed Mar 26, 2026

View reviewed changes

arjan-bal assigned mbissa and unassigned arjan-bal Mar 26, 2026

mbissa assigned arjan-bal and unassigned mbissa Mar 26, 2026

arjan-bal reviewed Mar 26, 2026

View reviewed changes

arjan-bal assigned mbissa and unassigned arjan-bal Mar 26, 2026

Pranjali-2501 modified the milestones: 1.81 Release, 1.82 Release Apr 15, 2026

mbissa assigned arjan-bal and unassigned mbissa Apr 28, 2026

arjan-bal assigned mbissa and unassigned arjan-bal Apr 28, 2026

mbissa modified the milestones: 1.82 Release, 1.83 Release Jun 5, 2026

mbissa added 5 commits June 10, 2026 09:41

balancer/pickfirst: revert pick_first metrics to synchronous checks

0c93633

review comments

2848dac

review comments

c969a89

testutils/stats: store metric slice and support sequence verification…

7d50135

… in TestMetricsRecorder

mbissa force-pushed the fix-flaky-pickfirst-metrics-test branch from d5d8759 to 7d50135 Compare June 13, 2026 10:03

mbissa assigned arjan-bal and unassigned mbissa Jun 13, 2026

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread internal/testutils/stats/test_metrics_recorder.go Outdated

fixing race review comments

2669e2e

arjan-bal reviewed Jun 15, 2026

View reviewed changes

arjan-bal assigned mbissa and unassigned arjan-bal Jun 15, 2026

	func (r *TestMetricsRecorder) WaitForFloat64Count(ctx context.Context, metricsDataWant MetricsData) error {
	got, err := r.floatCountCh.Receive(ctx)
	if err != nil {
	return fmt.Errorf("timeout waiting for float64Count")
	}
	metricsDataGot := got.(MetricsData)
	if diff := cmp.Diff(metricsDataGot, metricsDataWant); diff != "" {
	return fmt.Errorf("float64count metricsData received unexpected value (-got, +want): %v", diff)
	}
	return nil
	}

		@@ -68,15 +71,27 @@
		func (r *TestMetricsRecorder) Metric(name string) (float64, bool) {

Conversation

mbissa commented Mar 25, 2026

Uh oh!

codecov Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbissa commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 25, 2026 •

edited

Loading