Skip to content

[TENT] Fix CUDA event leak in nvlink/mnnvl transports#2507

Open
jfeng18 wants to merge 2 commits into
kvcache-ai:mainfrom
jfeng18:fix/tent-cuda-event-leak
Open

[TENT] Fix CUDA event leak in nvlink/mnnvl transports#2507
jfeng18 wants to merge 2 commits into
kvcache-ai:mainfrom
jfeng18:fix/tent-cuda-event-leak

Conversation

@jfeng18

@jfeng18 jfeng18 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

startTransfer() in the NVLink and MNNVL TENT transports creates a cudaEvent per submit (cudaEventCreateWithFlags) and assigns it to every task in that submit, but nothing ever calls cudaEventDestroy — so each submit leaks one event for the lifetime of the process. (freeSubBatch only Slab-deallocated the sub-batch; the streams are intentionally pool-managed, but the manually-created events had no owner.) Confirmed there is no cudaEventDestroy anywhere under tent/.

Fix:

  • Add a SubBatch-level std::vector<cudaEvent_t> completion_events. startTransfer pushes the event there once per successful create.
  • freeSubBatch destroys every event in that vector. Because submitTransferTasks can be called multiple times on one SubBatch, a SubBatch may own several distinct events — the vector frees all of them, each exactly once. The event is shared across a submit's tasks, so it is stored once (keyed by submit, not by task), which avoids the double-free a per-task destroy would cause.
  • Also check the cudaEventCreateWithFlags return value: on failure, mark the tasks FAILED and return without recording/storing an uninitialized handle (previously the garbage handle was assigned to tasks and later polled).
  • Added the explicit #include <vector> the two headers were relying on transitively.

This is a follow-up to the discussion on #2505 (the events were flagged there; that PR was scoped to the status-staleness fix).

Module

  • Transfer Engine (mooncake-transfer-engine)

Type of Change

  • Bug fix

How Has This Been Tested?

  • Verified statically: every created event is pushed exactly once and destroyed exactly once; multiple submits each add and free their event; the create-failure path stores no handle; events are destroyed before Slab::deallocate; cudaEventDestroy on an in-flight event is legal (driver defers reclamation).
  • Not validated on hardware — I do not have NVLink / MNNVL devices locally. CI builds the relevant USE_* paths; a reviewer with the hardware can confirm at runtime.

Checklist

  • I have performed a self-review of my own code.
  • I have added tests (these paths require GPU hardware to exercise).

🤖 Generated with Claude Code

startTransfer() created a cudaEvent per submit (shared by all tasks in
that submit) but nothing ever called cudaEventDestroy, so every submit
leaked one event for the process lifetime.

Track the events in a SubBatch-level vector (one push per successful
create) and destroy them in freeSubBatch. Since submitTransferTasks can
be called multiple times on one SubBatch, a SubBatch may own several
distinct events; the vector frees all of them exactly once, avoiding the
double-free that a per-task destroy would cause (the event is shared
across a submit's tasks).

Also check the cudaEventCreateWithFlags return: on failure, mark the
tasks FAILED instead of recording/querying an uninitialized handle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tracking and proper cleanup of CUDA completion events in both MNNVL and NVLink transports to prevent resource leaks. It adds error handling for cudaEventCreateWithFlags and stores the created events in the sub-batch to destroy them during sub-batch deallocation. The reviewer feedback suggests checking the return value of cudaEventRecord to avoid potential silent data corruption and recommends leveraging RAII by defining virtual destructors for MnnvlSubBatch and NVLinkSubBatch to automatically manage the lifecycle of completion events.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

return;
}
cudaEventRecord(event, batch->async_stream.get());
batch->completion_events.push_back(event);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.

return;
}
cudaEventRecord(event, batch->async_stream.get());
batch->completion_events.push_back(event);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.

CUDAStreamHandle async_stream;
// Completion events created in startTransfer (one per submit). Owned here
// and destroyed in freeSubBatch so they are not leaked.
std::vector<cudaEvent_t> completion_events;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for MnnvlSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.

Suggested change
std::vector<cudaEvent_t> completion_events;
std::vector<cudaEvent_t> completion_events;
virtual ~MnnvlSubBatch() {
for (auto event : completion_events) {
cudaEventDestroy(event);
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.

Comment on lines +157 to +161
// Destroy completion events created in startTransfer (each pushed exactly
// once, so no double-destroy). Streams are pool-managed and not destroyed.
for (auto event : mnnvl_batch->completion_events) {
cudaEventDestroy(event);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the destructor added to MnnvlSubBatch, we no longer need to manually destroy the completion events here. We can safely remove this loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 5c2d495cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.

CUDAStreamHandle async_stream;
// Completion events created in startTransfer (one per submit). Owned here
// and destroyed in freeSubBatch so they are not leaked.
std::vector<cudaEvent_t> completion_events;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for NVLinkSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.

Suggested change
std::vector<cudaEvent_t> completion_events;
std::vector<cudaEvent_t> completion_events;
virtual ~NVLinkSubBatch() {
for (auto event : completion_events) {
cudaEventDestroy(event);
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.

Comment on lines +118 to +122
// Destroy completion events created in startTransfer (each pushed exactly
// once, so no double-destroy). Streams are pool-managed and not destroyed.
for (auto event : shm_batch->completion_events) {
cudaEventDestroy(event);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the destructor added to NVLinkSubBatch, we no longer need to manually destroy the completion events here. We can safely remove this loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 5c2d495cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Move cudaEventDestroy into ~NVLinkSubBatch()/~MnnvlSubBatch(); Slab<T>::deallocate calls ~T() (the sole free path), so events are released exactly once per free without a manual loop.

Check cudaEventRecord return: on failure, destroy the event, mark tasks FAILED, and return so getTransferStatus never queries an un-recorded event.

Addresses Gemini review comments on kvcache-ai#2507.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qoderai

qoderai Bot commented Jun 17, 2026

Copy link
Copy Markdown

I’ve reviewed the PR and left inline comments on the CUDA event lifetime and error-handling paths; overall the leak fix looks correct with a couple of minor, non-blocking robustness nits.


🤖 Generated by QoderView workflow run

@qoderai qoderai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this PR cleanly fixes the CUDA event leak in both MNNVL and NVLink TENT transports and tightens error handling around event creation/recording.

What looks good

  • Each submit now creates at most one completion event, stores it once per sub-batch in completion_events, and relies on ~MnnvlSubBatch / ~NVLinkSubBatch (invoked via Slab<T>::deallocate) to destroy all events, eliminating the previous leak without introducing double-destroys.
  • cudaEventCreateWithFlags and cudaEventRecord are both checked; on failure, the code logs, destroys any unregistered event, marks all tasks in the submit as FAILED, and returns, so getTransferStatus never queries an unrecorded event.
  • The NVLink and MNNVL paths are now symmetric in behavior and ownership, which makes future maintenance less error-prone.

Minor nits / suggestions (non-blocking)

  • In the cudaEventRecord failure paths, after calling cudaEventDestroy(event) you could optionally clear task->completion_event (e.g. set to nullptr) for each task, to guard against any future code paths that might look at the handle even when status_word is already FAILED.
  • If Transport::SubBatch has or grows a virtual destructor, consider marking the sub-batch destructors as override for extra clarity: virtual ~MnnvlSubBatch() override { ... } / virtual ~NVLinkSubBatch() override { ... }.

With those small nits in mind, the core behavior and resource management in this change look correct and robust.


🤖 Generated by QoderView workflow run

Comment on lines +49 to +55
// Completion events created in startTransfer (one per submit). Destroyed by
// the destructor (RAII); Slab<T>::deallocate() invokes ~MnnvlSubBatch()
// before reusing the storage, so this runs on every free.
std::vector<cudaEvent_t> completion_events;
~MnnvlSubBatch() {
for (auto event : completion_events) cudaEventDestroy(event);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RAII approach for completion events in MnnvlSubBatch looks good: events are created at submit time, stored once per submit in completion_events, and ~MnnvlSubBatch walks that vector and destroys each handle exactly once when the Slab deallocates the sub-batch.

Given that Transport::SubBatch already has a virtual method (size()), adding a virtual destructor here is appropriate and ensures the cleanup runs even if a SubBatchRef is held polymorphically. I don’t see any additional ownership paths for these events, so this design is sound.

Very small style nit: if there is (or ever will be) a non-trivial base-class destructor in Transport::SubBatch, explicitly marking this one as override could make the intent even clearer: virtual ~MnnvlSubBatch() override { ... }. Not a blocker either way.


🤖 Generated by QoderFix in Qoder

Comment on lines +52 to +58
// Completion events created in startTransfer (one per submit). Destroyed by
// the destructor (RAII); Slab<T>::deallocate() invokes ~NVLinkSubBatch()
// before reusing the storage, so this runs on every free.
std::vector<cudaEvent_t> completion_events;
~NVLinkSubBatch() {
for (auto event : completion_events) cudaEventDestroy(event);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the MNNVL header: the RAII ownership of completion events in NVLinkSubBatch is well defined and matches how the events are created and stored at submit time.

Given the polymorphic use of Transport::SubBatch, having a virtual destructor here ensures the events are always destroyed when the Slab frees the sub-batch.

Optional style nit: consider declaring this destructor as override if Transport::SubBatch has or will grow a virtual destructor, to make the inheritance relationship explicit: virtual ~NVLinkSubBatch() override { ... }.


🤖 Generated by QoderFix in Qoder

for (auto *task : tasks) task->status_word = TransferStatusEnum::FAILED;
return;
}
batch->completion_events.push_back(event);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MnnvlSubBatch event lifetime looks correct: each submit creates exactly one event, pushes it into completion_events, and the RAII destructor in the header destroys all stored events on Slab deallocation, so there is no leak and no double-destroy.

For error handling, both cudaEventCreateWithFlags and cudaEventRecord now check return codes, log failures, destroy the (un-pushed) event on record failure, and mark all tasks in the submit as FAILED, which prevents getTransferStatus from ever querying an unrecorded event. This closes the silent data corruption hole that would come from treating a never-recorded event as immediately complete.

One small suggestion: consider also clearing task->completion_event (e.g. setting it to nullptr) on the failure paths after destroying the event, to avoid any accidental reuse of a stale handle if future code paths start checking it before status_word.


🤖 Generated by QoderFix in Qoder

for (auto* task : tasks) task->status_word = TransferStatusEnum::FAILED;
return;
}
batch->completion_events.push_back(event);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NVLink path mirrors the MNNVL fix and looks consistent: one event per submit is created, recorded on async_stream, stored in completion_events, and freed once via the NVLinkSubBatch destructor on Slab deallocation.

The new error handling for both cudaEventCreateWithFlags and cudaEventRecord prevents unrecorded events from being queried and correctly marks all tasks as FAILED, avoiding the silent completion issue.

Same minor nit as on the MNNVL side: after cudaEventDestroy in the cudaEventRecord failure path, clearing task->completion_event (e.g. to nullptr) would make it more robust against future changes that might accidentally rely on the handle even when status_word has been set to FAILED.


🤖 Generated by QoderFix in Qoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants