Skip to content

[Store] add codec inference and recursive structure expansion#2521

Open
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/codec
Open

[Store] add codec inference and recursive structure expansion#2521
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/codec

Conversation

@zxpdemonio

@zxpdemonio zxpdemonio commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Description

Background & Motivation

In RLHF (Reinforcement Learning from Human Feedback) training pipelines, the actor and learner stages exchange rollout data through mooncake store. A single rollout batch (typically represented as a DataProto object) contains heterogeneous fields: dense tensors for token IDs and attention masks, ragged tensors for variable-length responses, byte arrays for serialized reward signals, strings for prompts, JSON metadata, and nested dicts/lists that group these fields hierarchically.

Today, the caller must manually serialize each field, choose an appropriate wire format, and manage the Mooncake put/get lifecycle per field. This is error-prone, verbose, and prevents Mooncake from applying type-specific optimizations (e.g., zero-copy tensor transfer via registered buffers vs. pickle fallback for opaque Python objects).

PR #2050 introduced a complete DataProto transfer framework that automates this: it inspects the runtime structure of a rollout batch, recursively decomposes nested dicts/lists into flat leaf columns, selects the best codec for each leaf, and encodes/decodes them through Mooncake's structured object store. That PR was large (~3000 lines) and is being split into reviewable pieces by functional layer.

What this PR does (PR 3 of the #2050 series)

This PR adds the first two layers of the encoding pipeline — the parts that run before any actual serialization or I/O:

  1. Recursive structure expansion (infer_structure): Given a batch of N rows where each row is an arbitrary Python value (dict, list, tensor, string, etc.), recursively decomposes dict and short-list rows into child columns. For example, N rows of {"tokens": tensor, "meta": {"step": int, "lr": float}} become three flat leaf columns: tokens, meta.step, and meta.lr. Interior nodes (dict expansions, list expansions) are recorded for later reconstruction.

  2. Codec inference (_choose_leaf_codec): For each leaf column (a list of N homogeneous-typed values), selects the optimal storage codec by trying type predicates in priority order: ragged_tensormedia_list_raggedtyped_raggedndarraybytes_raggedmedia_bytesutf8_raggedjson_raggedpickle_ragged_fallback.
    The decision includes the codec name and any type metadata (e.g., tensor dtype, numpy result dtype) needed by the encoder in the next PR.

What this PR does NOT do

  • No actual encoding/decoding (PR 4)
  • No integration into put_structured_object / materialize (PR 5)
  • No store I/O, no buffer registration, no network calls

This is a pure-function layer with no side effects, making it independently testable and reviewable.

Split plan from #2050

  • PR 1 — Buffer pool (local buffer backing) — Merged
  • PR 2 — Structured object interfaces (tensor APIs) — Merged
  • PR 3 (this)Codec inference + recursive structure expansionThis PR
  • PR 4 — Leaf encoders/decoders (ragged tensor, typed ragged, bytes, media, text, json, pickle) — Planned
  • PR 5 — Integration into put/materialize paths + end-to-end DataProto transfer — Planned

Design decisions

  • Expansion before codec inference: infer_structure tries dict/list expansion before calling
    _choose_leaf_codec. Otherwise, [{"x": 1}, {"x": 2}] would be accepted as json_ragged instead of being expanded into a scalar column x with ndarray codec — losing the opportunity for zero-copy numeric transfer.
  • _try_expand_dict / _try_expand_list return computed data: The expansion check and the actual expansion need the same derived values (sorted keys, max length, per-row lengths). These functions return the computed data on success (list[str] or (int, list[int])) and None on rejection, avoiding a second traversal in infer_structure.
  • _check_all helper: Five of the eight codec predicates follow an identical "filter nulls → type-check all → accept/reject" pattern. _check_all factors this out; the remaining three predicates (_can_tensor, _can_numeric_sequence, _can_numeric_scalar) have extra logic (dtype uniformity check, np.result_type promotion) and stay as standalone functions.
  • Recursion depth limit: Capped at 32 levels with a descriptive ValueError. Prevents stack overflow on pathological inputs.
  • Internal symbols: All new names except infer_structure are underscore-prefixed — _CodecDecision,
    _InferredLeaf, _InferredNode, _choose_leaf_codec, etc.

Module

  • Python Wheel (mooncake-wheel)

Type of Change

  • New feature

How Has This Been Tested?

Test commands:

PYTHONPATH=mooncake-wheel python -m pytest tests/test_put_get_tensor.py::TestCodecInference -v

Test results (18 tests):

Codec selection:

  • tensor, mixed-dtype rejection, numeric sequence, bytes, text, json, scalar, fallback, with-nulls, empty input, all-None

Structure expansion:

  • flat leaf, dict expansion, dict with None rows, nested dict, list expansion with full path/lengths verification, depth limit
  • Unit tests pass
  • No store or network dependency — runs in pure Python + numpy + torch-cpu

Checklist

  • I have performed a self-review of my own code
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • AI tools were used (specify below)

Claude Opus 4.6 assisted with code generation, review (3 parallel agents covering simplicity/safety/performance), and test writing. All changes reviewed and validated by the human submitter.

Add the first layer of DataProto-style structured object encoding
(split from kvcache-ai#2050): type-aware codec selection and recursive dict/list
expansion to leaf columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces codec inference and recursive structure expansion for structured object stores, allowing automatic detection of optimal codecs (such as PyTorch tensors, numeric sequences, JSON, or media) and recursive traversal of nested dictionary and list structures. The review feedback highlights several key robustness improvements: catching potential OverflowError exceptions when evaluating extremely large integers with NumPy, allowing None elements during list expansion to prevent unnecessary fallback to pickle, and verifying uniform dimensions (ndim) for PyTorch tensors to avoid downstream encoding failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-wheel/mooncake/structured_object_store.py
Comment thread mooncake-wheel/mooncake/structured_object_store.py Outdated
Comment thread mooncake-wheel/mooncake/structured_object_store.py
zxpdemonio and others added 2 commits June 17, 2026 17:31
…sion

- _can_tensor: reject mixed-ndim tensor columns
- _can_numeric_scalar: catch OverflowError from np.result_type
- _try_expand_list: allow None items in list expansion
- Add tests for all three fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- _can_numeric_sequence: catch OverflowError from np.result_type
- _can_json: catch OverflowError and RecursionError from json.dumps

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants