Skip to content

test: cover LWS multi-vendor GPU resources#1350

Merged
ev-shindin merged 2 commits into
llm-d:mainfrom
xuhui-lu:xlu/1106-lws-multivendor-tests
Jun 30, 2026
Merged

test: cover LWS multi-vendor GPU resources#1350
ev-shindin merged 2 commits into
llm-d:mainfrom
xuhui-lu:xlu/1106-lws-multivendor-tests

Conversation

@xuhui-lu

Copy link
Copy Markdown
Contributor

Summary

  • add LWS GPU counting coverage for supported resource names: NVIDIA, AMD, Intel Gaudi, Intel i915, and Intel Xe
  • add a mixed-vendor LWS case across leader and worker containers
  • keep this as a unit-test-only follow-up for the multi-vendor GPU work

Part of #1106.

Testing

  • go test ./internal/utils/scaletarget
  • go test ./internal/utils/...

Also tried go test ./...; it is not clean locally for unrelated reasons:

  • internal/engines/analyzers/throughput: FitITLModel flat-line test currently fails
  • test/chart: local helm binary is not installed
  • test/e2e: entered cluster preflight / auth flow and was interrupted

Copilot AI review requested due to automatic review settings June 27, 2026 13:09
Signed-off-by: Xuhui Lu <swimming.fish06@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends unit-test coverage for LWSAccessor.GetTotalGPUsPerReplica() to ensure GPU counting works across the full set of supported vendor resource names (NVIDIA, AMD, Intel Gaudi, Intel i915, Intel Xe), including a mixed-vendor leader/worker scenario. This supports the broader multi-vendor GPU support work tracked in #1106 by guarding against regressions in GPU resource parsing for LeaderWorkerSet scale targets.

Changes:

  • Added a table-driven unit test covering all supported GPU resource names for LWS GPU counting.
  • Added a mixed-vendor LWS test case that combines different vendor resources across leader and worker containers.
  • Introduced small test helpers to reduce duplication when building LWS/container GPU requests.

@xuhui-lu xuhui-lu marked this pull request as draft June 27, 2026 13:20
@xuhui-lu xuhui-lu marked this pull request as ready for review June 28, 2026 07:54
ev-shindin
ev-shindin previously approved these changes Jun 29, 2026

@ev-shindin ev-shindin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Look to some minor (non-blocking) comments

Comment thread internal/utils/scaletarget/lws_test.go
Comment thread internal/utils/scaletarget/lws_test.go
Signed-off-by: Xuhui Lu <swimming.fish06@gmail.com>
@ev-shindin

Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions

Copy link
Copy Markdown
Contributor

🚀 Kind E2E (full) triggered by /ok-to-test

View the Kind E2E workflow run

@github-actions

Copy link
Copy Markdown
Contributor

🚀 OpenShift E2E — approve and run (/ok-to-test)

View the OpenShift E2E workflow run

@github-actions

Copy link
Copy Markdown
Contributor

GPU Pre-flight Check ✅

GPUs are available for e2e-openshift tests. Proceeding with deployment.

Resource Total Allocated Available
GPUs 50 27 23
Cluster Value
Nodes 16 (7 with GPUs)
Total CPU 993 cores
Total Memory 10383 Gi
GPUs required 4 (min) / 6 (recommended)

@ev-shindin ev-shindin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xuhui-lu ! To follow-up: every existing and new test sets a non-zero GPU request, so the if total == 0 { return 1 } default is never exercised — change it to return 0 and the whole suite still passes. Add a CPU-only (no GPU requests) LWS case asserting 1

@ev-shindin ev-shindin merged commit 3be8e78 into llm-d:main Jun 30, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants