[TENT] Fix CUDA event leak in nvlink/mnnvl transports#2507
Conversation
startTransfer() created a cudaEvent per submit (shared by all tasks in that submit) but nothing ever called cudaEventDestroy, so every submit leaked one event for the process lifetime. Track the events in a SubBatch-level vector (one push per successful create) and destroy them in freeSubBatch. Since submitTransferTasks can be called multiple times on one SubBatch, a SubBatch may own several distinct events; the vector frees all of them exactly once, avoiding the double-free that a per-task destroy would cause (the event is shared across a submit's tasks). Also check the cudaEventCreateWithFlags return: on failure, mark the tasks FAILED instead of recording/querying an uninitialized handle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces tracking and proper cleanup of CUDA completion events in both MNNVL and NVLink transports to prevent resource leaks. It adds error handling for cudaEventCreateWithFlags and stores the created events in the sub-batch to destroy them during sub-batch deallocation. The reviewer feedback suggests checking the return value of cudaEventRecord to avoid potential silent data corruption and recommends leveraging RAII by defining virtual destructors for MnnvlSubBatch and NVLinkSubBatch to automatically manage the lifecycle of completion events.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| return; | ||
| } | ||
| cudaEventRecord(event, batch->async_stream.get()); | ||
| batch->completion_events.push_back(event); |
There was a problem hiding this comment.
The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.
There was a problem hiding this comment.
Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.
| return; | ||
| } | ||
| cudaEventRecord(event, batch->async_stream.get()); | ||
| batch->completion_events.push_back(event); |
There was a problem hiding this comment.
The return value of cudaEventRecord (on the previous line) is not checked. If cudaEventRecord fails, the event will not be recorded, but cudaEventQuery in getTransferStatus might still return cudaSuccess immediately (since a newly created event is considered complete/signaled), leading to silent data corruption because the application thinks the transfer is complete when it is not. Please check the return value of cudaEventRecord and handle any failures appropriately.
There was a problem hiding this comment.
Good catch — added a cudaEventRecord return check in 5c2d495: on failure we log, cudaEventDestroy the (un-pushed) event, mark the batch's tasks FAILED, and return, so getTransferStatus never queries an un-recorded event.
| CUDAStreamHandle async_stream; | ||
| // Completion events created in startTransfer (one per submit). Owned here | ||
| // and destroyed in freeSubBatch so they are not leaked. | ||
| std::vector<cudaEvent_t> completion_events; |
There was a problem hiding this comment.
Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for MnnvlSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.
| std::vector<cudaEvent_t> completion_events; | |
| std::vector<cudaEvent_t> completion_events; | |
| virtual ~MnnvlSubBatch() { | |
| for (auto event : completion_events) { | |
| cudaEventDestroy(event); | |
| } | |
| } |
There was a problem hiding this comment.
Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.
| // Destroy completion events created in startTransfer (each pushed exactly | ||
| // once, so no double-destroy). Streams are pool-managed and not destroyed. | ||
| for (auto event : mnnvl_batch->completion_events) { | ||
| cudaEventDestroy(event); | ||
| } |
There was a problem hiding this comment.
Removed in 5c2d495 — cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.
| CUDAStreamHandle async_stream; | ||
| // Completion events created in startTransfer (one per submit). Owned here | ||
| // and destroyed in freeSubBatch so they are not leaked. | ||
| std::vector<cudaEvent_t> completion_events; |
There was a problem hiding this comment.
Instead of manually destroying the completion events in freeSubBatch, we can leverage RAII by defining a virtual destructor for NVLinkSubBatch. This ensures that the events are automatically and safely destroyed when the sub-batch is deallocated, preventing potential leaks and improving maintainability.
| std::vector<cudaEvent_t> completion_events; | |
| std::vector<cudaEvent_t> completion_events; | |
| virtual ~NVLinkSubBatch() { | |
| for (auto event : completion_events) { | |
| cudaEventDestroy(event); | |
| } | |
| } |
There was a problem hiding this comment.
Adopted in 5c2d495 — added the destructor as suggested. Confirmed Slab<T>::deallocate() calls object->~T() and is the sole free path for this type, so events are released exactly once per free; the manual loop in freeSubBatch is removed.
| // Destroy completion events created in startTransfer (each pushed exactly | ||
| // once, so no double-destroy). Streams are pool-managed and not destroyed. | ||
| for (auto event : shm_batch->completion_events) { | ||
| cudaEventDestroy(event); | ||
| } |
There was a problem hiding this comment.
Removed in 5c2d495 — cudaEventDestroy now lives in the ~SubBatch() destructor, which Slab<T>::deallocate() invokes, so this manual loop is no longer needed.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Move cudaEventDestroy into ~NVLinkSubBatch()/~MnnvlSubBatch(); Slab<T>::deallocate calls ~T() (the sole free path), so events are released exactly once per free without a manual loop. Check cudaEventRecord return: on failure, destroy the event, mark tasks FAILED, and return so getTransferStatus never queries an un-recorded event. Addresses Gemini review comments on kvcache-ai#2507. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I’ve reviewed the PR and left inline comments on the CUDA event lifetime and error-handling paths; overall the leak fix looks correct with a couple of minor, non-blocking robustness nits. 🤖 Generated by Qoder • View workflow run |
There was a problem hiding this comment.
Overall this PR cleanly fixes the CUDA event leak in both MNNVL and NVLink TENT transports and tightens error handling around event creation/recording.
What looks good
- Each submit now creates at most one completion event, stores it once per sub-batch in
completion_events, and relies on~MnnvlSubBatch/~NVLinkSubBatch(invoked viaSlab<T>::deallocate) to destroy all events, eliminating the previous leak without introducing double-destroys. cudaEventCreateWithFlagsandcudaEventRecordare both checked; on failure, the code logs, destroys any unregistered event, marks all tasks in the submit asFAILED, and returns, sogetTransferStatusnever queries an unrecorded event.- The NVLink and MNNVL paths are now symmetric in behavior and ownership, which makes future maintenance less error-prone.
Minor nits / suggestions (non-blocking)
- In the
cudaEventRecordfailure paths, after callingcudaEventDestroy(event)you could optionally cleartask->completion_event(e.g. set tonullptr) for each task, to guard against any future code paths that might look at the handle even whenstatus_wordis alreadyFAILED. - If
Transport::SubBatchhas or grows a virtual destructor, consider marking the sub-batch destructors asoverridefor extra clarity:virtual ~MnnvlSubBatch() override { ... }/virtual ~NVLinkSubBatch() override { ... }.
With those small nits in mind, the core behavior and resource management in this change look correct and robust.
🤖 Generated by Qoder • View workflow run
| // Completion events created in startTransfer (one per submit). Destroyed by | ||
| // the destructor (RAII); Slab<T>::deallocate() invokes ~MnnvlSubBatch() | ||
| // before reusing the storage, so this runs on every free. | ||
| std::vector<cudaEvent_t> completion_events; | ||
| ~MnnvlSubBatch() { | ||
| for (auto event : completion_events) cudaEventDestroy(event); | ||
| } |
There was a problem hiding this comment.
The RAII approach for completion events in MnnvlSubBatch looks good: events are created at submit time, stored once per submit in completion_events, and ~MnnvlSubBatch walks that vector and destroys each handle exactly once when the Slab deallocates the sub-batch.
Given that Transport::SubBatch already has a virtual method (size()), adding a virtual destructor here is appropriate and ensures the cleanup runs even if a SubBatchRef is held polymorphically. I don’t see any additional ownership paths for these events, so this design is sound.
Very small style nit: if there is (or ever will be) a non-trivial base-class destructor in Transport::SubBatch, explicitly marking this one as override could make the intent even clearer: virtual ~MnnvlSubBatch() override { ... }. Not a blocker either way.
🤖 Generated by Qoder • Fix in Qoder
| // Completion events created in startTransfer (one per submit). Destroyed by | ||
| // the destructor (RAII); Slab<T>::deallocate() invokes ~NVLinkSubBatch() | ||
| // before reusing the storage, so this runs on every free. | ||
| std::vector<cudaEvent_t> completion_events; | ||
| ~NVLinkSubBatch() { | ||
| for (auto event : completion_events) cudaEventDestroy(event); | ||
| } |
There was a problem hiding this comment.
Same as in the MNNVL header: the RAII ownership of completion events in NVLinkSubBatch is well defined and matches how the events are created and stored at submit time.
Given the polymorphic use of Transport::SubBatch, having a virtual destructor here ensures the events are always destroyed when the Slab frees the sub-batch.
Optional style nit: consider declaring this destructor as override if Transport::SubBatch has or will grow a virtual destructor, to make the inheritance relationship explicit: virtual ~NVLinkSubBatch() override { ... }.
🤖 Generated by Qoder • Fix in Qoder
| for (auto *task : tasks) task->status_word = TransferStatusEnum::FAILED; | ||
| return; | ||
| } | ||
| batch->completion_events.push_back(event); |
There was a problem hiding this comment.
MnnvlSubBatch event lifetime looks correct: each submit creates exactly one event, pushes it into completion_events, and the RAII destructor in the header destroys all stored events on Slab deallocation, so there is no leak and no double-destroy.
For error handling, both cudaEventCreateWithFlags and cudaEventRecord now check return codes, log failures, destroy the (un-pushed) event on record failure, and mark all tasks in the submit as FAILED, which prevents getTransferStatus from ever querying an unrecorded event. This closes the silent data corruption hole that would come from treating a never-recorded event as immediately complete.
One small suggestion: consider also clearing task->completion_event (e.g. setting it to nullptr) on the failure paths after destroying the event, to avoid any accidental reuse of a stale handle if future code paths start checking it before status_word.
🤖 Generated by Qoder • Fix in Qoder
| for (auto* task : tasks) task->status_word = TransferStatusEnum::FAILED; | ||
| return; | ||
| } | ||
| batch->completion_events.push_back(event); |
There was a problem hiding this comment.
The NVLink path mirrors the MNNVL fix and looks consistent: one event per submit is created, recorded on async_stream, stored in completion_events, and freed once via the NVLinkSubBatch destructor on Slab deallocation.
The new error handling for both cudaEventCreateWithFlags and cudaEventRecord prevents unrecorded events from being queried and correctly marks all tasks as FAILED, avoiding the silent completion issue.
Same minor nit as on the MNNVL side: after cudaEventDestroy in the cudaEventRecord failure path, clearing task->completion_event (e.g. to nullptr) would make it more robust against future changes that might accidentally rely on the handle even when status_word has been set to FAILED.
🤖 Generated by Qoder • Fix in Qoder
Description
startTransfer()in the NVLink and MNNVL TENT transports creates acudaEventper submit (cudaEventCreateWithFlags) and assigns it to every task in that submit, but nothing ever callscudaEventDestroy— so each submit leaks one event for the lifetime of the process. (freeSubBatchonly Slab-deallocated the sub-batch; the streams are intentionally pool-managed, but the manually-created events had no owner.) Confirmed there is nocudaEventDestroyanywhere undertent/.Fix:
std::vector<cudaEvent_t> completion_events.startTransferpushes the event there once per successful create.freeSubBatchdestroys every event in that vector. BecausesubmitTransferTaskscan be called multiple times on one SubBatch, a SubBatch may own several distinct events — the vector frees all of them, each exactly once. The event is shared across a submit's tasks, so it is stored once (keyed by submit, not by task), which avoids the double-free a per-task destroy would cause.cudaEventCreateWithFlagsreturn value: on failure, mark the tasks FAILED and return without recording/storing an uninitialized handle (previously the garbage handle was assigned to tasks and later polled).#include <vector>the two headers were relying on transitively.This is a follow-up to the discussion on #2505 (the events were flagged there; that PR was scoped to the status-staleness fix).
Module
mooncake-transfer-engine)Type of Change
How Has This Been Tested?
Slab::deallocate;cudaEventDestroyon an in-flight event is legal (driver defers reclamation).USE_*paths; a reviewer with the hardware can confirm at runtime.Checklist
🤖 Generated with Claude Code