[Store] add codec inference and recursive structure expansion#2521
[Store] add codec inference and recursive structure expansion#2521zxpdemonio wants to merge 3 commits into
Conversation
Add the first layer of DataProto-style structured object encoding (split from kvcache-ai#2050): type-aware codec selection and recursive dict/list expansion to leaf columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces codec inference and recursive structure expansion for structured object stores, allowing automatic detection of optimal codecs (such as PyTorch tensors, numeric sequences, JSON, or media) and recursive traversal of nested dictionary and list structures. The review feedback highlights several key robustness improvements: catching potential OverflowError exceptions when evaluating extremely large integers with NumPy, allowing None elements during list expansion to prevent unnecessary fallback to pickle, and verifying uniform dimensions (ndim) for PyTorch tensors to avoid downstream encoding failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…sion - _can_tensor: reject mixed-ndim tensor columns - _can_numeric_scalar: catch OverflowError from np.result_type - _try_expand_list: allow None items in list expansion - Add tests for all three fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _can_numeric_sequence: catch OverflowError from np.result_type - _can_json: catch OverflowError and RecursionError from json.dumps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
Background & Motivation
In RLHF (Reinforcement Learning from Human Feedback) training pipelines, the actor and learner stages exchange rollout data through mooncake store. A single rollout batch (typically represented as a
DataProtoobject) contains heterogeneous fields: dense tensors for token IDs and attention masks, ragged tensors for variable-length responses, byte arrays for serialized reward signals, strings for prompts, JSON metadata, and nested dicts/lists that group these fields hierarchically.Today, the caller must manually serialize each field, choose an appropriate wire format, and manage the Mooncake put/get lifecycle per field. This is error-prone, verbose, and prevents Mooncake from applying type-specific optimizations (e.g., zero-copy tensor transfer via registered buffers vs. pickle fallback for opaque Python objects).
PR #2050 introduced a complete DataProto transfer framework that automates this: it inspects the runtime structure of a rollout batch, recursively decomposes nested dicts/lists into flat leaf columns, selects the best codec for each leaf, and encodes/decodes them through Mooncake's structured object store. That PR was large (~3000 lines) and is being split into reviewable pieces by functional layer.
What this PR does (PR 3 of the #2050 series)
This PR adds the first two layers of the encoding pipeline — the parts that run before any actual serialization or I/O:
Recursive structure expansion (
infer_structure): Given a batch of N rows where each row is an arbitrary Python value (dict, list, tensor, string, etc.), recursively decomposes dict and short-list rows into child columns. For example, N rows of{"tokens": tensor, "meta": {"step": int, "lr": float}}become three flat leaf columns:tokens,meta.step, andmeta.lr. Interior nodes (dict expansions, list expansions) are recorded for later reconstruction.Codec inference (
_choose_leaf_codec): For each leaf column (a list of N homogeneous-typed values), selects the optimal storage codec by trying type predicates in priority order:ragged_tensor→media_list_ragged→typed_ragged→ndarray→bytes_ragged→media_bytes→utf8_ragged→json_ragged→pickle_ragged_fallback.The decision includes the codec name and any type metadata (e.g., tensor dtype, numpy result dtype) needed by the encoder in the next PR.
What this PR does NOT do
put_structured_object/materialize(PR 5)This is a pure-function layer with no side effects, making it independently testable and reviewable.
Split plan from #2050
Design decisions
infer_structuretries dict/list expansion before calling_choose_leaf_codec. Otherwise,[{"x": 1}, {"x": 2}]would be accepted asjson_raggedinstead of being expanded into a scalar columnxwithndarraycodec — losing the opportunity for zero-copy numeric transfer._try_expand_dict/_try_expand_listreturn computed data: The expansion check and the actual expansion need the same derived values (sorted keys, max length, per-row lengths). These functions return the computed data on success (list[str]or(int, list[int])) andNoneon rejection, avoiding a second traversal ininfer_structure._check_allhelper: Five of the eight codec predicates follow an identical "filter nulls → type-check all → accept/reject" pattern._check_allfactors this out; the remaining three predicates (_can_tensor,_can_numeric_sequence,_can_numeric_scalar) have extra logic (dtype uniformity check,np.result_typepromotion) and stay as standalone functions.ValueError. Prevents stack overflow on pathological inputs.infer_structureare underscore-prefixed —_CodecDecision,_InferredLeaf,_InferredNode,_choose_leaf_codec, etc.Module
mooncake-wheel)Type of Change
How Has This Been Tested?
Test commands:
Test results (18 tests):
Codec selection:
Structure expansion:
Checklist
AI Assistance Disclosure
Claude Opus 4.6 assisted with code generation, review (3 parallel agents covering simplicity/safety/performance), and test writing. All changes reviewed and validated by the human submitter.