CI/WA: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524#1768
Closed
Alexey-Rivkin wants to merge 1 commit into
Closed
CI/WA: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524#1768Alexey-Rivkin wants to merge 1 commit into
Alexey-Rivkin wants to merge 1 commit into
Conversation
The DL CPP VRAM test fails `ibv_reg_dmabuf_mr ... Unknown error 524` (ENOTSUPP) on the gb200-nvl4-ts2 nodes whenever mlx5_16 has a usable RoCE GID. UCX finds a data-direct "sibling" rail NIC for the GPU (mlx5_0..15, in the GPU PCIe subtree) and so exports the CUDA buffer as a PCIe-mapped dma-buf, but then registers it on mlx5_16 - a different, non-sibling NIC in a different IOMMU group. The NVIDIA driver rejects the PCIe-mapped attach across IOMMU groups (FORCE_PCIE topology check), giving 524. Runs only "pass" today when mlx5_16 has no GID and UCX silently falls back to TCP - i.e. RDMA isn't actually exercised. Setting UCX_IB_DIRECT_NIC=n disables sibling detection, so the buffer is exported with the default mapping and registers successfully on mlx5_16. Verified on ts2-65/79/88: default config fails 524, UCX_IB_DIRECT_NIC=n passes. Lets CI exercise real RDMA on the GB200 fleet instead of dodging it. Mirrors the existing UCX_IB_REG_METHODS knob on the same step.
|
👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Contributor
Author
|
/build |
Contributor
Author
|
Proven insufficient, as GDAKI ignores the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
Workaround (WA). Add
UCX_IB_DIRECT_NIC=nto the DL CPP test env.Why?
DL CPP VRAM test fails
ibv_reg_dmabuf_mr ... 524ongb200-nvl4-ts2nodes: UCX finds a data-direct ("direct NIC") sibling rail NIC for the GPU and exports the CUDA buffer PCIe-mapped, then registers it onmlx5_16- a non-sibling NIC in a different IOMMU group → NVIDIA driver rejects (FORCE_PCIEtopology) → 524.UCX_IB_DIRECT_NIC=ndisables only the data-direct/direct-NIC path (unusable here anyway - the sibling rail NICsmlx5_0..15are down). GPU RDMA is not disabled: the buffer registers via the standard dmabuf mapping and the test still runs RDMA overmlx5_16(verifieducp_mem_mapSuccess on ts2-65/79/88; default config 524s).WA only; proper fix is cluster-side (enable the GPU-sibling rail NICs, or
GrdmaPciTopoCheckOverride=1).