[Store] add codec inference and recursive structure expansion by zxpdemonio · Pull Request #2521 · kvcache-ai/Mooncake

zxpdemonio · 2026-06-17T17:05:13Z

Description

Background & Motivation

In RLHF (Reinforcement Learning from Human Feedback) training pipelines, the actor and learner stages exchange rollout data through mooncake store. A single rollout batch (typically represented as a DataProto object) contains heterogeneous fields: dense tensors for token IDs and attention masks, ragged tensors for variable-length responses, byte arrays for serialized reward signals, strings for prompts, JSON metadata, and nested dicts/lists that group these fields hierarchically.

Today, the caller must manually serialize each field, choose an appropriate wire format, and manage the Mooncake put/get lifecycle per field. This is error-prone, verbose, and prevents Mooncake from applying type-specific optimizations (e.g., zero-copy tensor transfer via registered buffers vs. pickle fallback for opaque Python objects).

PR #2050 introduced a complete DataProto transfer framework that automates this: it inspects the runtime structure of a rollout batch, recursively decomposes nested dicts/lists into flat leaf columns, selects the best codec for each leaf, and encodes/decodes them through Mooncake's structured object store. That PR was large (~3000 lines) and is being split into reviewable pieces by functional layer.

What this PR does (PR 3 of the #2050 series)

This PR adds the first two layers of the encoding pipeline — the parts that run before any actual serialization or I/O:

Recursive structure expansion (infer_structure): Given a batch of N rows where each row is an arbitrary Python value (dict, list, tensor, string, etc.), recursively decomposes dict and short-list rows into child columns. For example, N rows of {"tokens": tensor, "meta": {"step": int, "lr": float}} become three flat leaf columns: tokens, meta.step, and meta.lr. Interior nodes (dict expansions, list expansions) are recorded for later reconstruction.
Codec inference (_choose_leaf_codec): For each leaf column (a list of N homogeneous-typed values), selects the optimal storage codec by trying type predicates in priority order: ragged_tensor → media_list_ragged → typed_ragged → ndarray → bytes_ragged → media_bytes → utf8_ragged → json_ragged → pickle_ragged_fallback.
The decision includes the codec name and any type metadata (e.g., tensor dtype, numpy result dtype) needed by the encoder in the next PR.

What this PR does NOT do

No actual encoding/decoding (PR 4)
No integration into put_structured_object / materialize (PR 5)
No store I/O, no buffer registration, no network calls

This is a pure-function layer with no side effects, making it independently testable and reviewable.

Split plan from #2050

PR 1 — Buffer pool (local buffer backing) — Merged
PR 2 — Structured object interfaces (tensor APIs) — Merged
PR 3 (this) — Codec inference + recursive structure expansion — This PR
PR 4 — Leaf encoders/decoders (ragged tensor, typed ragged, bytes, media, text, json, pickle) — Planned
PR 5 — Integration into put/materialize paths + end-to-end DataProto transfer — Planned

Design decisions

Expansion before codec inference: infer_structure tries dict/list expansion before calling
_choose_leaf_codec. Otherwise, [{"x": 1}, {"x": 2}] would be accepted as json_ragged instead of being expanded into a scalar column x with ndarray codec — losing the opportunity for zero-copy numeric transfer.
_try_expand_dict / _try_expand_list return computed data: The expansion check and the actual expansion need the same derived values (sorted keys, max length, per-row lengths). These functions return the computed data on success (list[str] or (int, list[int])) and None on rejection, avoiding a second traversal in infer_structure.
_check_all helper: Five of the eight codec predicates follow an identical "filter nulls → type-check all → accept/reject" pattern. _check_all factors this out; the remaining three predicates (_can_tensor, _can_numeric_sequence, _can_numeric_scalar) have extra logic (dtype uniformity check, np.result_type promotion) and stay as standalone functions.
Recursion depth limit: Capped at 32 levels with a descriptive ValueError. Prevents stack overflow on pathological inputs.
Internal symbols: All new names except infer_structure are underscore-prefixed — _CodecDecision,
_InferredLeaf, _InferredNode, _choose_leaf_codec, etc.

Module

Python Wheel (mooncake-wheel)

Type of Change

New feature

How Has This Been Tested?

Test commands:

PYTHONPATH=mooncake-wheel python -m pytest tests/test_put_get_tensor.py::TestCodecInference -v

Test results (18 tests):

Codec selection:

tensor, mixed-dtype rejection, numeric sequence, bytes, text, json, scalar, fallback, with-nulls, empty input, all-None

Structure expansion:

flat leaf, dict expansion, dict with None rows, nested dict, list expansion with full path/lengths verification, depth limit
Unit tests pass
No store or network dependency — runs in pure Python + numpy + torch-cpu

Checklist

I have performed a self-review of my own code
I have added tests to prove my changes are effective
For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

AI tools were used (specify below)

Claude Opus 4.6 assisted with code generation, review (3 parallel agents covering simplicity/safety/performance), and test writing. All changes reviewed and validated by the human submitter.

Add the first layer of DataProto-style structured object encoding (split from kvcache-ai#2050): type-aware codec selection and recursive dict/list expansion to leaf columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces codec inference and recursive structure expansion for structured object stores, allowing automatic detection of optimal codecs (such as PyTorch tensors, numeric sequences, JSON, or media) and recursive traversal of nested dictionary and list structures. The review feedback highlights several key robustness improvements: catching potential OverflowError exceptions when evaluating extremely large integers with NumPy, allowing None elements during list expansion to prevent unnecessary fallback to pickle, and verifying uniform dimensions (ndim) for PyTorch tensors to avoid downstream encoding failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…sion - _can_tensor: reject mixed-ndim tensor columns - _can_numeric_scalar: catch OverflowError from np.result_type - _try_expand_list: allow None items in list expansion - Add tests for all three fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- _can_numeric_sequence: catch OverflowError from np.result_type - _can_json: catch OverflowError and RecursionError from json.dumps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-06-17T18:07:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

zxpdemonio requested review from ShangmingCai and stmatengss as code owners June 17, 2026 17:05

github-actions Bot added run-ci Installation Tests labels Jun 17, 2026

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread mooncake-wheel/mooncake/structured_object_store.py

Comment thread mooncake-wheel/mooncake/structured_object_store.py Outdated

Comment thread mooncake-wheel/mooncake/structured_object_store.py

zxpdemonio and others added 2 commits June 17, 2026 17:31

[Store] harden exception handling in codec predicates

9590b2f

- _can_numeric_sequence: catch OverflowError from np.result_type - _can_json: catch OverflowError and RecursionError from json.dumps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] add codec inference and recursive structure expansion#2521

[Store] add codec inference and recursive structure expansion#2521
zxpdemonio wants to merge 3 commits into
kvcache-ai:mainfrom
openanolis:cruz/codec

zxpdemonio commented Jun 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zxpdemonio commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background & Motivation

What this PR does (PR 3 of the #2050 series)

What this PR does NOT do

Split plan from #2050

Design decisions

Module

Type of Change

How Has This Been Tested?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 17, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zxpdemonio commented Jun 17, 2026 •

edited

Loading