Skip to content

[CASCL-1386] (5/11) Scale down the cluster-autoscaler before eviction#3164

Open
L3n41c wants to merge 1 commit into
lenaic/CASCL-1386-evict-04-preflightfrom
lenaic/CASCL-1386-evict-05-cluster-autoscaler
Open

[CASCL-1386] (5/11) Scale down the cluster-autoscaler before eviction#3164
L3n41c wants to merge 1 commit into
lenaic/CASCL-1386-evict-04-preflightfrom
lenaic/CASCL-1386-evict-05-cluster-autoscaler

Conversation

@L3n41c

@L3n41c L3n41c commented Jun 18, 2026

Copy link
Copy Markdown
Member

What does this PR do?

Implements scaleDownClusterAutoscaler: scales the cluster-autoscaler Deployment to 0 replicas so it cannot provision new legacy nodes mid-drain.

Motivation

PR #3026 is too large to review as a single change. This is part 5 of 11 of a stack that splits it into small, individually-reviewable pieces that each build and pass tests on their own. See #3026 for the full feature context and end-to-end QA.

Additional Notes

Stacked PR — the base branch is lenaic/CASCL-1386-evict-04-preflight, not main, so the diff shows only this step's change. Implements logic the skeleton (part 1) left stubbed. The code is taken verbatim from #3026 (rebased onto main); there are no logic changes versus #3026.

Minimum Agent Versions

N/A — kubectl-datadog plugin only.

Describe your test plan

go test ./cmd/kubectl-datadog/autoscaling/cluster/... passes at this commit. Full end-to-end QA is tracked on #3026.

Checklist

@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jun 18, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

DataDog/datadog-operator | unit_tests   View in Datadog   GitLab

validation | build   View in Datadog   GitHub Actions

DataDog/datadog-operator | e2e_autoscaling   View in Datadog   GitLab

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 563df65 | Docs | Datadog PR Page | Give us feedback!

@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-04-preflight branch from 1107a29 to 49deb5c Compare June 18, 2026 15:41
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch from 6423e9e to dfa2be6 Compare June 18, 2026 15:41
@L3n41c L3n41c changed the title [CASCL-1386] (5/12) Scale down the cluster-autoscaler before eviction [CASCL-1386] (5/11) Scale down the cluster-autoscaler before eviction Jun 18, 2026
@codecov-commenter

codecov-commenter commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.04%. Comparing base (374d307) to head (1ed7ac0).
⚠️ Report is 16 commits behind head on lenaic/CASCL-1386-evict-04-preflight.

Files with missing lines Patch % Lines
...dog/autoscaling/cluster/evict/clusterautoscaler.go 74.07% 4 Missing and 3 partials ⚠️

❌ Your patch status has failed because the patch coverage (74.07%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@                           Coverage Diff                            @@
##           lenaic/CASCL-1386-evict-04-preflight    #3164      +/-   ##
========================================================================
+ Coverage                                 44.01%   44.04%   +0.03%     
========================================================================
  Files                                       388      388              
  Lines                                     31061    31087      +26     
========================================================================
+ Hits                                      13671    13692      +21     
- Misses                                    16499    16501       +2     
- Partials                                    891      894       +3     
Flag Coverage Δ
unittests 44.04% <74.07%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...dog/autoscaling/cluster/evict/clusterautoscaler.go 75.00% <74.07%> (+75.00%) ⬆️

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 374d307...1ed7ac0. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-04-preflight branch from 49deb5c to dd7f679 Compare June 19, 2026 14:21
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch from dfa2be6 to 1303499 Compare June 19, 2026 14:21
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-04-preflight branch from dd7f679 to a8fa15e Compare June 22, 2026 13:24
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch from 1303499 to 4a328cf Compare June 22, 2026 13:25
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-04-preflight branch from a8fa15e to b03651d Compare June 23, 2026 08:49
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch 2 times, most recently from 05481ca to ce64343 Compare June 25, 2026 08:18
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-04-preflight branch from b03651d to 374d307 Compare June 25, 2026 08:18
@L3n41c L3n41c marked this pull request as ready for review June 25, 2026 08:35
@L3n41c L3n41c requested review from a team as code owners June 25, 2026 08:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce64343e28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return nil
}
scale.Spec.Replicas = 0
if _, err = clientset.AppsV1().Deployments(ca.Namespace).UpdateScale(ctx, ca.Name, scale, metav1.UpdateOptions{}); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for autoscaler pods before draining

When the Deployment scale update is accepted but the existing cluster-autoscaler pod is still terminating (for example during its normal deletion/grace period), this returns immediately and Run proceeds to PDB creation and target eviction without waiting for the Deployment/selector pods to reach zero. During that window the still-running autoscaler can react to pods made pending by the drain and create new legacy nodes, which is the race this step is intended to prevent; please wait for the Deployment status or matching pods to show the autoscaler is actually stopped before returning.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1ed7ac0. After the scale update, scaleDownClusterAutoscaler now polls the Deployment's Scale subresource (wait.PollUntilContextTimeout, 2s interval / 2min timeout) until Status.Replicas == 0 before returning, so Run no longer proceeds to PDB creation and target eviction while the cluster-autoscaler pod is still running. New unit tests cover both the blocks-until-zero and the timeout paths.

if _, err = clientset.AppsV1().Deployments(ca.Namespace).UpdateScale(ctx, ca.Name, scale, metav1.UpdateOptions{}); err != nil {
return err
}
log.Printf("Scaled Deployment %s/%s to 0 replicas.", ca.Namespace, ca.Name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall it wait for actual scaledown to happen?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — it does now. After requesting the scale-down, scaleDownClusterAutoscaler polls the Deployment's Scale subresource (wait.PollUntilContextTimeout, 2s interval / 2min timeout) until Status.Replicas == 0 before returning, so the drain can't start while the cluster-autoscaler pod is still live. Done in 1ed7ac08.

@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch from ce64343 to 1ed7ac0 Compare June 26, 2026 12:16
…-nodes

Part of a stack splitting #3026 (too large to review in one piece) into
small pieces that each build and pass tests on their own. The command is
fully functional only once the whole stack lands.
@L3n41c L3n41c force-pushed the lenaic/CASCL-1386-evict-05-cluster-autoscaler branch from 1ed7ac0 to 563df65 Compare June 26, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants