[TENT] Fix CUDA event leak in nvlink/mnnvl transports by jfeng18 · Pull Request #2507 · kvcache-ai/Mooncake

jfeng18 · 2026-06-16T14:48:17Z

Description

startTransfer() in the NVLink and MNNVL TENT transports creates a cudaEvent per submit (cudaEventCreateWithFlags) and assigns it to every task in that submit, but nothing ever calls cudaEventDestroy — so each submit leaks one event for the lifetime of the process. (freeSubBatch only Slab-deallocated the sub-batch; the streams are intentionally pool-managed, but the manually-created events had no owner.) Confirmed there is no cudaEventDestroy anywhere under tent/.

Fix:

Add a SubBatch-level std::vector<cudaEvent_t> completion_events. startTransfer pushes the event there once per successful create.
freeSubBatch destroys every event in that vector. Because submitTransferTasks can be called multiple times on one SubBatch, a SubBatch may own several distinct events — the vector frees all of them, each exactly once. The event is shared across a submit's tasks, so it is stored once (keyed by submit, not by task), which avoids the double-free a per-task destroy would cause.
Also check the cudaEventCreateWithFlags return value: on failure, mark the tasks FAILED and return without recording/storing an uninitialized handle (previously the garbage handle was assigned to tasks and later polled).
Added the explicit #include <vector> the two headers were relying on transitively.

This is a follow-up to the discussion on #2505 (the events were flagged there; that PR was scoped to the status-staleness fix).

Module

Transfer Engine (mooncake-transfer-engine)

Type of Change

Bug fix

How Has This Been Tested?

Verified statically: every created event is pushed exactly once and destroyed exactly once; multiple submits each add and free their event; the create-failure path stores no handle; events are destroyed before Slab::deallocate; cudaEventDestroy on an in-flight event is legal (driver defers reclamation).
Not validated on hardware — I do not have NVLink / MNNVL devices locally. CI builds the relevant USE_* paths; a reviewer with the hardware can confirm at runtime.

Checklist

I have performed a self-review of my own code.
I have added tests (these paths require GPU hardware to exercise).

🤖 Generated with Claude Code

startTransfer() created a cudaEvent per submit (shared by all tasks in that submit) but nothing ever called cudaEventDestroy, so every submit leaked one event for the process lifetime. Track the events in a SubBatch-level vector (one push per successful create) and destroy them in freeSubBatch. Since submitTransferTasks can be called multiple times on one SubBatch, a SubBatch may own several distinct events; the vector frees all of them exactly once, avoiding the double-free that a per-task destroy would cause (the event is shared across a submit's tasks). Also check the cudaEventCreateWithFlags return: on failure, mark the tasks FAILED instead of recording/querying an uninitialized handle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces tracking and proper cleanup of CUDA completion events in both MNNVL and NVLink transports to prevent resource leaks. It adds error handling for cudaEventCreateWithFlags and stores the created events in the sub-batch to destroy them during sub-batch deallocation. The reviewer feedback suggests checking the return value of cudaEventRecord to avoid potential silent data corruption and recommends leveraging RAII by defining virtual destructors for MnnvlSubBatch and NVLinkSubBatch to automatically manage the lifecycle of completion events.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-16T14:50:48Z

+        return;
+    }
    cudaEventRecord(event, batch->async_stream.get());
+    batch->completion_events.push_back(event);


The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.

Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.

gemini-code-assist · 2026-06-16T14:50:48Z

+        return;
+    }
    cudaEventRecord(event, batch->async_stream.get());
+    batch->completion_events.push_back(event);


The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.

Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.

gemini-code-assist · 2026-06-16T14:50:48Z

    CUDAStreamHandle async_stream;
+    // Completion events created in startTransfer (one per submit). Owned here
+    // and destroyed in freeSubBatch so they are not leaked.
+    std::vector<cudaEvent_t> completion_events;


Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for MnnvlSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.

Suggested change

std::vector<cudaEvent_t> completion_events;

std::vector<cudaEvent_t> completion_events;

virtual ~MnnvlSubBatch() {

for (auto event : completion_events) {

cudaEventDestroy(event);

}

}

Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.

gemini-code-assist · 2026-06-16T14:50:49Z

+    // Destroy completion events created in startTransfer (each pushed exactly
+    // once, so no double-destroy). Streams are pool-managed and not destroyed.
+    for (auto event : mnnvl_batch->completion_events) {
+        cudaEventDestroy(event);
+    }


With the destructor added to MnnvlSubBatch, we no longer need to manually destroy the completion events here. We can safely remove this loop.

Removed in 5c2d495 — cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.

gemini-code-assist · 2026-06-16T14:50:49Z

    CUDAStreamHandle async_stream;
+    // Completion events created in startTransfer (one per submit). Owned here
+    // and destroyed in freeSubBatch so they are not leaked.
+    std::vector<cudaEvent_t> completion_events;


Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for NVLinkSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.

Suggested change

std::vector<cudaEvent_t> completion_events;

std::vector<cudaEvent_t> completion_events;

virtual ~NVLinkSubBatch() {

for (auto event : completion_events) {

cudaEventDestroy(event);

}

}

Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.

gemini-code-assist · 2026-06-16T14:50:49Z

+    // Destroy completion events created in startTransfer (each pushed exactly
+    // once, so no double-destroy). Streams are pool-managed and not destroyed.
+    for (auto event : shm_batch->completion_events) {
+        cudaEventDestroy(event);
+    }


With the destructor added to NVLinkSubBatch, we no longer need to manually destroy the completion events here. We can safely remove this loop.

Removed in 5c2d495 — cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.

codecov-commenter · 2026-06-16T15:26:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Move cudaEventDestroy into ~NVLinkSubBatch()/~MnnvlSubBatch(); Slab<T>::deallocate calls ~T() (the sole free path), so events are released exactly once per free without a manual loop. Check cudaEventRecord return: on failure, destroy the event, mark tasks FAILED, and return so getTransferStatus never queries an un-recorded event. Addresses Gemini review comments on kvcache-ai#2507. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

qoderai · 2026-06-17T03:57:09Z

I’ve reviewed the PR and left inline comments on the CUDA event lifetime and error-handling paths; overall the leak fix looks correct with a couple of minor, non-blocking robustness nits.

🤖 Generated by Qoder • View workflow run

qoderai

Overall this PR cleanly fixes the CUDA event leak in both MNNVL and NVLink TENT transports and tightens error handling around event creation/recording.

What looks good

Each submit now creates at most one completion event, stores it once per sub-batch in completion_events, and relies on ~MnnvlSubBatch / ~NVLinkSubBatch (invoked via Slab<T>::deallocate) to destroy all events, eliminating the previous leak without introducing double-destroys.
cudaEventCreateWithFlags and cudaEventRecord are both checked; on failure, the code logs, destroys any unregistered event, marks all tasks in the submit as FAILED, and returns, so getTransferStatus never queries an unrecorded event.
The NVLink and MNNVL paths are now symmetric in behavior and ownership, which makes future maintenance less error-prone.

Minor nits / suggestions (non-blocking)

In the cudaEventRecord failure paths, after calling cudaEventDestroy(event) you could optionally clear task->completion_event (e.g. set to nullptr) for each task, to guard against any future code paths that might look at the handle even when status_word is already FAILED.
If Transport::SubBatch has or grows a virtual destructor, consider marking the sub-batch destructors as override for extra clarity: virtual ~MnnvlSubBatch() override { ... } / virtual ~NVLinkSubBatch() override { ... }.

With those small nits in mind, the core behavior and resource management in this change look correct and robust.

🤖 Generated by Qoder • View workflow run

qoderai · 2026-06-17T03:57:44Z

+    // Completion events created in startTransfer (one per submit). Destroyed by
+    // the destructor (RAII); Slab<T>::deallocate() invokes ~MnnvlSubBatch()
+    // before reusing the storage, so this runs on every free.
+    std::vector<cudaEvent_t> completion_events;
+    ~MnnvlSubBatch() {
+        for (auto event : completion_events) cudaEventDestroy(event);
+    }


The RAII approach for completion events in MnnvlSubBatch looks good: events are created at submit time, stored once per submit in completion_events, and ~MnnvlSubBatch walks that vector and destroys each handle exactly once when the Slab deallocates the sub-batch.

Given that Transport::SubBatch already has a virtual method (size()), adding a virtual destructor here is appropriate and ensures the cleanup runs even if a SubBatchRef is held polymorphically. I don’t see any additional ownership paths for these events, so this design is sound.

Very small style nit: if there is (or ever will be) a non-trivial base-class destructor in Transport::SubBatch, explicitly marking this one as override could make the intent even clearer: virtual ~MnnvlSubBatch() override { ... }. Not a blocker either way.

🤖 Generated by Qoder • Fix in Qoder

qoderai · 2026-06-17T03:57:45Z

+    // Completion events created in startTransfer (one per submit). Destroyed by
+    // the destructor (RAII); Slab<T>::deallocate() invokes ~NVLinkSubBatch()
+    // before reusing the storage, so this runs on every free.
+    std::vector<cudaEvent_t> completion_events;
+    ~NVLinkSubBatch() {
+        for (auto event : completion_events) cudaEventDestroy(event);
+    }


Same as in the MNNVL header: the RAII ownership of completion events in NVLinkSubBatch is well defined and matches how the events are created and stored at submit time.

Given the polymorphic use of Transport::SubBatch, having a virtual destructor here ensures the events are always destroyed when the Slab frees the sub-batch.

Optional style nit: consider declaring this destructor as override if Transport::SubBatch has or will grow a virtual destructor, to make the inheritance relationship explicit: virtual ~NVLinkSubBatch() override { ... }.

🤖 Generated by Qoder • Fix in Qoder

qoderai · 2026-06-17T03:57:50Z

+        for (auto *task : tasks) task->status_word = TransferStatusEnum::FAILED;
+        return;
+    }
+    batch->completion_events.push_back(event);


MnnvlSubBatch event lifetime looks correct: each submit creates exactly one event, pushes it into completion_events, and the RAII destructor in the header destroys all stored events on Slab deallocation, so there is no leak and no double-destroy.

For error handling, both cudaEventCreateWithFlags and cudaEventRecord now check return codes, log failures, destroy the (un-pushed) event on record failure, and mark all tasks in the submit as FAILED, which prevents getTransferStatus from ever querying an unrecorded event. This closes the silent data corruption hole that would come from treating a never-recorded event as immediately complete.

One small suggestion: consider also clearing task->completion_event (e.g. setting it to nullptr) on the failure paths after destroying the event, to avoid any accidental reuse of a stale handle if future code paths start checking it before status_word.

🤖 Generated by Qoder • Fix in Qoder

qoderai · 2026-06-17T03:57:52Z

+        for (auto* task : tasks) task->status_word = TransferStatusEnum::FAILED;
+        return;
+    }
+    batch->completion_events.push_back(event);


The NVLink path mirrors the MNNVL fix and looks consistent: one event per submit is created, recorded on async_stream, stored in completion_events, and freed once via the NVLinkSubBatch destructor on Slab deallocation.

The new error handling for both cudaEventCreateWithFlags and cudaEventRecord prevents unrecorded events from being queried and correctly marks all tasks as FAILED, avoiding the silent completion issue.

Same minor nit as on the MNNVL side: after cudaEventDestroy in the cudaEventRecord failure path, clearing task->completion_event (e.g. to nullptr) would make it more robust against future changes that might accidentally rely on the handle even when status_word has been set to FAILED.

🤖 Generated by Qoder • Fix in Qoder

jfeng18 requested review from 00fish0, alogfans, chestnut-Q, doujiang24, dtcccc and staryxchen as code owners June 16, 2026 14:48

jfeng18 mentioned this pull request Jun 16, 2026

[TENT] Fix stale getTransferStatus in nvlink/mnnvl/ascend transports #2505

Open

4 tasks

github-actions Bot added run-ci Transfer Engine labels Jun 16, 2026

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

qoderai Bot reviewed Jun 17, 2026

View reviewed changes

Conversation

jfeng18 commented Jun 16, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jfeng18 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 16, 2026

Codecov Report

Uh oh!

qoderai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qoderai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qoderai Bot commented Jun 17, 2026 •

edited

Loading