Replies: 2 comments 2 replies
-
|
For serving systems, I would not expect all CPU memory growth fixes to land in one PR. In practice the symptoms all look like “RSS grows”, but the causes are usually different enough that small PRs are safer: one path may retain request objects, another may keep tokenizer/cache metadata, another may leave queues or futures referenced, and another may just be allocator fragmentation after high-water traffic. The workflow that has worked best for me is to avoid starting from RSS alone. Split the problem into four buckets first:
For a long-running Python service I usually run a fixed replay workload and sample these at intervals: cat /proc/$PID/smaps_rollup
pmap -x $PID | tail -n 20
python -X tracemalloc ...Inside the process, take The fastest reproducer is usually not maximum QPS. It is a small deterministic loop that exercises one request shape thousands of times with warmup separated from measurement: If RSS grows only with concurrency, then inspect queues, pending futures, callbacks, and per-request state that should be released when streaming finishes. If it grows with one serial request shape, diff heap snapshots by allocation site. If heap profiles are flat but RSS remains high, it is probably allocator retention or fragmentation, and the fix may be batching/lifetime changes rather than a literal missing That is also why the PRs tend to be incremental: once one retained path is removed, the next largest growth source becomes visible. Large “fix all leaks” PRs are hard to review because they mix unrelated lifetime bugs with allocator behavior. |
Beta Was this translation helpful? Give feedback.
-
|
We’ve run into similar situations—memory leaks often show up in stages, and the fixes rarely fit into a single PR. In practice, the approach is usually “fix leak sites as we find them,” especially when dealing with a large codebase with complex resource lifecycles. Early PRs tend to be scoped tightly around the most obvious offenders (e.g., specific module or function), and once those are fixed and monitored, secondary leaks often surface that weren’t obvious before (e.g., due to increased throughput or new usage patterns). Incremental fixes help reduce risk and simplify review. For pinpointing leaks, we monitor RSS trends for each service (prometheus + grafana dashboards), and drill down using tools depending on the stack:
Once a leak is suspected, we run with periodic heap dumps and correlate timestamped allocations with business logic. In Python, a quick way is to wrap suspect blocks with tracemalloc snapshots: import tracemalloc
tracemalloc.start()
# suspect block
snapshot1 = tracemalloc.take_snapshot()
# ... run workload ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
print(stat)Fragmentation is trickier—sometimes it’s not an actual leak but heap fragmentation. We’ve used jemalloc profiling and also checked if RSS is growing but resident heap objects aren’t. Overall, it’s iterative: fix, monitor, repeat. The incremental PRs you saw are a sign of healthy, ongoing root-cause investigation. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi SGLang maintainers,
We are currently doing an end-to-end investigation of a CPU memory leak in our own deployment (RSS keeps growing over time), and we’d like to learn from SGLang’s debugging methodology and best practices.
We noticed several PRs that appear to be similar in nature (memory leak / memory growth / resource lifecycle type fixes), but they were landed across multiple PRs over time:
PR #17191 (commit 737a118)
PR #17400 (commit 1b97fa7)
PR #20130 (commit 07359ef)
PR #21624 (commit 9cb362f)
Could you share some context on why these seemingly similar changes were split into multiple PRs? For example:
Were earlier PRs intentionally scoped/incremental to reduce risk, and later PRs addressed additional leak sites discovered afterwards?
Or is the general principle “fix leak sites as we find them”, rather than trying to fully eliminate all leaks in one large change?
More importantly, we’d love to understand your workflow to pinpoint a CPU memory leak down to a specific file/line of code in a complex serving system. Any practical details would help a lot, e.g.:
What symptoms/signals typically trigger the investigation (RSS trend, per-process memory, per-request growth, fragmentation patterns, etc.)?
What tools do you commonly rely on (Python-level: tracemalloc / objgraph / heapy, native-level: jemalloc/tcmalloc stats, heap profiling, memray, pprof, valgrind/massif, ASan/LSan, etc.)?
How do you narrow down from “RSS keeps growing” → “which component/request path retains memory” → “which object/allocation site is leaking”?
Any recommended minimal workload pattern in SGLang to reproduce memory growth faster (e.g., high concurrency, long streaming, specific scheduling/queueing paths)?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions