Skip to content

CI/WA: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524#1768

Closed
Alexey-Rivkin wants to merge 1 commit into
ai-dynamo:mainfrom
Alexey-Rivkin:ci-ucx-ib-direct-nic-off
Closed

CI/WA: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524#1768
Alexey-Rivkin wants to merge 1 commit into
ai-dynamo:mainfrom
Alexey-Rivkin:ci-ucx-ib-direct-nic-off

Conversation

@Alexey-Rivkin

@Alexey-Rivkin Alexey-Rivkin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What?

Workaround (WA). Add UCX_IB_DIRECT_NIC=n to the DL CPP test env.

Why?

DL CPP VRAM test fails ibv_reg_dmabuf_mr ... 524 on gb200-nvl4-ts2 nodes: UCX finds a data-direct ("direct NIC") sibling rail NIC for the GPU and exports the CUDA buffer PCIe-mapped, then registers it on mlx5_16 - a non-sibling NIC in a different IOMMU group → NVIDIA driver rejects (FORCE_PCIE topology) → 524.

UCX_IB_DIRECT_NIC=n disables only the data-direct/direct-NIC path (unusable here anyway - the sibling rail NICs mlx5_0..15 are down). GPU RDMA is not disabled: the buffer registers via the standard dmabuf mapping and the test still runs RDMA over mlx5_16 (verified ucp_mem_map Success on ts2-65/79/88; default config 524s).

WA only; proper fix is cluster-side (enable the GPU-sibling rail NICs, or GrdmaPciTopoCheckOverride=1).

The DL CPP VRAM test fails `ibv_reg_dmabuf_mr ... Unknown error 524`
(ENOTSUPP) on the gb200-nvl4-ts2 nodes whenever mlx5_16 has a usable
RoCE GID. UCX finds a data-direct "sibling" rail NIC for the GPU
(mlx5_0..15, in the GPU PCIe subtree) and so exports the CUDA buffer as
a PCIe-mapped dma-buf, but then registers it on mlx5_16 - a different,
non-sibling NIC in a different IOMMU group. The NVIDIA driver rejects
the PCIe-mapped attach across IOMMU groups (FORCE_PCIE topology check),
giving 524. Runs only "pass" today when mlx5_16 has no GID and UCX
silently falls back to TCP - i.e. RDMA isn't actually exercised.

Setting UCX_IB_DIRECT_NIC=n disables sibling detection, so the buffer is
exported with the default mapping and registers successfully on mlx5_16.
Verified on ts2-65/79/88: default config fails 524, UCX_IB_DIRECT_NIC=n
passes. Lets CI exercise real RDMA on the GB200 fleet instead of dodging
it. Mirrors the existing UCX_IB_REG_METHODS knob on the same step.
@github-actions

Copy link
Copy Markdown

👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

/build

@Alexey-Rivkin Alexey-Rivkin changed the title CI: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524 CI/WA: disable UCX direct-NIC on DL CPP test to avoid dmabuf 524 Jun 12, 2026
@Alexey-Rivkin

Copy link
Copy Markdown
Contributor Author

Proven insufficient, as GDAKI ignores the UCX_IB_DIRECT_NIC flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant