Why similar CPU memory leak fixes were split across multiple PRs, and how you pinpoint leaks down to specific lines? #23242

ShenYuhan · 2026-04-20T09:38:47Z

ShenYuhan
Apr 20, 2026

Hi SGLang maintainers,

We are currently doing an end-to-end investigation of a CPU memory leak in our own deployment (RSS keeps growing over time), and we’d like to learn from SGLang’s debugging methodology and best practices.

We noticed several PRs that appear to be similar in nature (memory leak / memory growth / resource lifecycle type fixes), but they were landed across multiple PRs over time:

PR #17191 (commit 737a118)
PR #17400 (commit 1b97fa7)
PR #20130 (commit 07359ef)
PR #21624 (commit 9cb362f)
Could you share some context on why these seemingly similar changes were split into multiple PRs? For example:

Were earlier PRs intentionally scoped/incremental to reduce risk, and later PRs addressed additional leak sites discovered afterwards?
Or is the general principle “fix leak sites as we find them”, rather than trying to fully eliminate all leaks in one large change?
More importantly, we’d love to understand your workflow to pinpoint a CPU memory leak down to a specific file/line of code in a complex serving system. Any practical details would help a lot, e.g.:

What symptoms/signals typically trigger the investigation (RSS trend, per-process memory, per-request growth, fragmentation patterns, etc.)?
What tools do you commonly rely on (Python-level: tracemalloc / objgraph / heapy, native-level: jemalloc/tcmalloc stats, heap profiling, memray, pprof, valgrind/massif, ASan/LSan, etc.)?
How do you narrow down from “RSS keeps growing” → “which component/request path retains memory” → “which object/allocation site is leaking”?
Any recommended minimal workload pattern in SGLang to reproduce memory growth faster (e.g., high concurrency, long streaming, specific scheduling/queueing paths)?

Thanks in advance!

ZX41R · 2026-05-06T21:33:57Z

ZX41R
May 6, 2026

For serving systems, I would not expect all CPU memory growth fixes to land in one PR. In practice the symptoms all look like “RSS grows”, but the causes are usually different enough that small PRs are safer: one path may retain request objects, another may keep tokenizer/cache metadata, another may leave queues or futures referenced, and another may just be allocator fragmentation after high-water traffic.

The workflow that has worked best for me is to avoid starting from RSS alone. Split the problem into four buckets first:

Python objects still referenced.
Native heap allocations still referenced.
Allocator fragmentation / arenas not returned to the OS.
Non-heap memory: mmap, file cache, CUDA pinned memory, shared memory, etc.

For a long-running Python service I usually run a fixed replay workload and sample these at intervals:

cat /proc/$PID/smaps_rollup
pmap -x $PID | tail -n 20
python -X tracemalloc ...

Inside the process, take tracemalloc snapshots before/after N identical requests and compare by traceback. If Python retained objects are the cause, that gets you to a file/line quickly. If tracemalloc is flat but RSS grows, switch to native profiling: memray, jemalloc profiling (MALLOC_CONF=prof:true,prof_active:true,lg_prof_interval:...), or tcmalloc heap profiles depending on how the deployment is built.

The fastest reproducer is usually not maximum QPS. It is a small deterministic loop that exercises one request shape thousands of times with warmup separated from measurement:

warm up model
record baseline
run same prompt / same sampling params / same streaming mode 1000x
force gc only for measurement, not as a fix
record Python heap, native heap, smaps_rollup

If RSS grows only with concurrency, then inspect queues, pending futures, callbacks, and per-request state that should be released when streaming finishes. If it grows with one serial request shape, diff heap snapshots by allocation site. If heap profiles are flat but RSS remains high, it is probably allocator retention or fragmentation, and the fix may be batching/lifetime changes rather than a literal missing del.

That is also why the PRs tend to be incremental: once one retained path is removed, the next largest growth source becomes visible. Large “fix all leaks” PRs are hard to review because they mix unrelated lifetime bugs with allocator behavior.

1 reply

ShenYuhan May 11, 2026
Author

Thanks a lot for the detailed responses. This is exactly the kind of debugging methodology we were hoping to learn from.

reallyticsai · 2026-05-08T09:08:22Z

reallyticsai
May 8, 2026

We’ve run into similar situations—memory leaks often show up in stages, and the fixes rarely fit into a single PR. In practice, the approach is usually “fix leak sites as we find them,” especially when dealing with a large codebase with complex resource lifecycles. Early PRs tend to be scoped tightly around the most obvious offenders (e.g., specific module or function), and once those are fixed and monitored, secondary leaks often surface that weren’t obvious before (e.g., due to increased throughput or new usage patterns). Incremental fixes help reduce risk and simplify review.

For pinpointing leaks, we monitor RSS trends for each service (prometheus + grafana dashboards), and drill down using tools depending on the stack:

Python:
- tracemalloc for tracking allocation sources
- objgraph to visualize reference chains
- heapy for snapshotting heap and finding object types with unusual growth
Native (C/C++):
- jemalloc/tcmalloc stats (especially their profiling extensions)
- valgrind massif for heap profiling
- gperftools for live heap snapshots

Once a leak is suspected, we run with periodic heap dumps and correlate timestamped allocations with business logic. In Python, a quick way is to wrap suspect blocks with tracemalloc snapshots:

import tracemalloc

tracemalloc.start()
# suspect block
snapshot1 = tracemalloc.take_snapshot()
# ... run workload ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
    print(stat)

Fragmentation is trickier—sometimes it’s not an actual leak but heap fragmentation. We’ve used jemalloc profiling and also checked if RSS is growing but resident heap objects aren’t. Overall, it’s iterative: fix, monitor, repeat. The incremental PRs you saw are a sign of healthy, ongoing root-cause investigation.

1 reply

ShenYuhan May 11, 2026
Author

Thanks a lot for the detailed explanations — this is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why similar CPU memory leak fixes were split across multiple PRs, and how you pinpoint leaks down to specific lines? #23242

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why similar CPU memory leak fixes were split across multiple PRs, and how you pinpoint leaks down to specific lines? #23242

Uh oh!

ShenYuhan Apr 20, 2026

Replies: 2 comments · 2 replies

Uh oh!

ZX41R May 6, 2026

Uh oh!

ShenYuhan May 11, 2026 Author

Uh oh!

reallyticsai May 8, 2026

Uh oh!

ShenYuhan May 11, 2026 Author

ShenYuhan
Apr 20, 2026

Replies: 2 comments 2 replies

ZX41R
May 6, 2026

ShenYuhan May 11, 2026
Author

reallyticsai
May 8, 2026

ShenYuhan May 11, 2026
Author