[CASCL-1386] (5/11) Scale down the cluster-autoscaler before eviction#3164
[CASCL-1386] (5/11) Scale down the cluster-autoscaler before eviction#3164L3n41c wants to merge 1 commit into
Conversation
|
1107a29 to
49deb5c
Compare
6423e9e to
dfa2be6
Compare
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (74.07%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## lenaic/CASCL-1386-evict-04-preflight #3164 +/- ##
========================================================================
+ Coverage 44.01% 44.04% +0.03%
========================================================================
Files 388 388
Lines 31061 31087 +26
========================================================================
+ Hits 13671 13692 +21
- Misses 16499 16501 +2
- Partials 891 894 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
49deb5c to
dd7f679
Compare
dfa2be6 to
1303499
Compare
dd7f679 to
a8fa15e
Compare
1303499 to
4a328cf
Compare
a8fa15e to
b03651d
Compare
05481ca to
ce64343
Compare
b03651d to
374d307
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce64343e28
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return nil | ||
| } | ||
| scale.Spec.Replicas = 0 | ||
| if _, err = clientset.AppsV1().Deployments(ca.Namespace).UpdateScale(ctx, ca.Name, scale, metav1.UpdateOptions{}); err != nil { |
There was a problem hiding this comment.
Wait for autoscaler pods before draining
When the Deployment scale update is accepted but the existing cluster-autoscaler pod is still terminating (for example during its normal deletion/grace period), this returns immediately and Run proceeds to PDB creation and target eviction without waiting for the Deployment/selector pods to reach zero. During that window the still-running autoscaler can react to pods made pending by the drain and create new legacy nodes, which is the race this step is intended to prevent; please wait for the Deployment status or matching pods to show the autoscaler is actually stopped before returning.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 1ed7ac0. After the scale update, scaleDownClusterAutoscaler now polls the Deployment's Scale subresource (wait.PollUntilContextTimeout, 2s interval / 2min timeout) until Status.Replicas == 0 before returning, so Run no longer proceeds to PDB creation and target eviction while the cluster-autoscaler pod is still running. New unit tests cover both the blocks-until-zero and the timeout paths.
| if _, err = clientset.AppsV1().Deployments(ca.Namespace).UpdateScale(ctx, ca.Name, scale, metav1.UpdateOptions{}); err != nil { | ||
| return err | ||
| } | ||
| log.Printf("Scaled Deployment %s/%s to 0 replicas.", ca.Namespace, ca.Name) |
There was a problem hiding this comment.
Shall it wait for actual scaledown to happen?
There was a problem hiding this comment.
Yes — it does now. After requesting the scale-down, scaleDownClusterAutoscaler polls the Deployment's Scale subresource (wait.PollUntilContextTimeout, 2s interval / 2min timeout) until Status.Replicas == 0 before returning, so the drain can't start while the cluster-autoscaler pod is still live. Done in 1ed7ac08.
ce64343 to
1ed7ac0
Compare
…-nodes Part of a stack splitting #3026 (too large to review in one piece) into small pieces that each build and pass tests on their own. The command is fully functional only once the whole stack lands.
1ed7ac0 to
563df65
Compare
What does this PR do?
Implements
scaleDownClusterAutoscaler: scales the cluster-autoscaler Deployment to 0 replicas so it cannot provision new legacy nodes mid-drain.Motivation
PR #3026 is too large to review as a single change. This is part 5 of 11 of a stack that splits it into small, individually-reviewable pieces that each build and pass tests on their own. See #3026 for the full feature context and end-to-end QA.
Additional Notes
Stacked PR — the base branch is
lenaic/CASCL-1386-evict-04-preflight, notmain, so the diff shows only this step's change. Implements logic the skeleton (part 1) left stubbed. The code is taken verbatim from #3026 (rebased ontomain); there are no logic changes versus #3026.Minimum Agent Versions
N/A —
kubectl-datadogplugin only.Describe your test plan
go test ./cmd/kubectl-datadog/autoscaling/cluster/...passes at this commit. Full end-to-end QA is tracked on #3026.Checklist
qa/skip-qalabel applied (not independently shippable; QA tracked on [CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups #3026)