[Bugfix] Environment pre-check code adaptation#1006
Conversation
ygwpz
left a comment
There was a problem hiding this comment.
Follow-up Review Summary
✅ Previous Issue Addressed
The encoding issue with the corrupted arrow character ("~R" instead of "→") has been fixed by replacing it with the word "to". Good fix!
🟠 New Issues Found
I found a few issues in the newly introduced bandwidth check implementation:
-
L3 - Global multiprocessing.Manager() at module level: The
dump_bw_listandload_bw_listare created at module level, causing unnecessary server process spawn and state accumulation across multiple runs. -
L4 - Resource leak: mmap objects in
make_array()are not explicitly closed. -
L4 - Exception swallowed: In
remote_local_hccn_cards_ping_test, exception is caught but not logged. -
Minor - Unused import: The
threadingmodule is imported but no longer used.
💡 Suggestions
- Consider making bandwidth test parameters configurable via
config.yaml - Add cleanup handling for mmap objects
See inline comments for specific code locations.
0f392fc to
18b9213
Compare
ygwpz
left a comment
There was a problem hiding this comment.
The environment pre-check documentation and script updates look reasonable.
Purpose
Implement the Environment PreCheck Suite to perform comprehensive health checks for Ascend NPU clusters before model training/deployment, ensuring device availability, network connectivity, and storage performance stability.
Modifications
run_hccn_device_status_checkfunction: Implements HCCN device status detection (link neighbor, physical link, network health, IP configuration) and adds overall status aggregation logic.remote_local_hccn_cards_ping_testfunction: Supports local/remote HCCN card ping tests for NPU network connectivity verification with unified result return.Test
All tests have passed.