Skip to content

feat(datasets): support multimodal dataset items#1710

Merged
wochinge merged 24 commits into
mainfrom
feature/lfe-10289-python-sdk-changes
Jun 23, 2026
Merged

feat(datasets): support multimodal dataset items#1710
wochinge merged 24 commits into
mainfrom
feature/lfe-10289-python-sdk-changes

Conversation

@wochinge

@wochinge wochinge commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds multimodal dataset items: LangfuseMedia in input/expected_output/metadata is uploaded and replaced with a reference string on create_dataset_item.
  • get_dataset(...) resolves media reference strings to LangfuseMediaReference objects (with fetch_bytes/fetch_base64/fetch_data_uri helpers).
  • Regenerates the dataset-items + media API clients from the base branch.

Linear

Major Decisions

  • No new external dependency — hand-rolled the small JSONPath subset the API emits instead of pulling in jsonpath-ng.
  • Always resolve references — only callers of the new API receive them, so an opt-in flag added surface for no benefit.

Review Focus

  • langfuse/_utils/json_path.py — the hand-rolled parser (tests generated from jsonpath-plus).
  • Generated API client changes are hand-synced from the base branch via fern (spec not yet released).

Greptile Summary

This PR adds multimodal support to Langfuse dataset items, allowing LangfuseMedia objects to be embedded in input, expected_output, and metadata fields and uploaded trace-lessly, and adding a resolve_media_references=True option to get_dataset() that hydrates backend-provided JSONPath references into LangfuseMediaReference objects.

  • create_dataset_item now recursively discovers LangfuseMedia values via $..this`` (jsonpath-ng), uploads each one synchronously through the updated _upload_media_sync (which accepts `trace_id=None`), deduplicates by `media_id` within a single call, and replaces each value with its reference string before sending to the API.
  • get_dataset(resolve_media_references=True) passes include_media_references to the list endpoint, then walks backend-provided JSONPath annotations to replace reference strings with frozen LangfuseMediaReference dataclass instances that expose fetch_bytes(), fetch_base64(), and fetch_data_uri() helpers.
  • Generated API types (DatasetItemMediaReference, DatasetItemMediaReferenceField, DatasetItemMediaReferenceMedia) and the GetMediaUploadUrlRequest schema are updated to make trace_id and field optional, enabling trace-less uploads.

Confidence Score: 4/5

Safe to merge; all changed paths work correctly for the expected JSON-like data shapes, and the two main concerns are style/fragility rather than broken behavior.

The media-upload and hydration flows are logically sound and covered by both unit and E2E tests. The two notable findings are: _replace_json_path_value discards the return value of update() and relies implicitly on in-place mutation (works today, diverges from how _process_dataset_item_media handles the same operation); and fetch_bytes() makes a blocking HTTP call with no timeout. Neither produces wrong output in the current code paths, but both leave the door open for subtle failures on refactoring or slow network conditions.

langfuse/_client/client.py (_replace_json_path_value) and langfuse/media.py (LangfuseMediaReference.fetch_bytes) are the two spots that would benefit from a second look before shipping to customers at scale.

Sequence Diagram

sequenceDiagram
    participant User
    participant Langfuse SDK
    participant MediaManager
    participant Langfuse API
    participant Object Storage

    Note over User,Object Storage: create_dataset_item with LangfuseMedia

    User->>Langfuse SDK: create_dataset_item(input={"img": LangfuseMedia(...)})
    Langfuse SDK->>Langfuse SDK: _process_dataset_item_media()<br/>jsonpath-ng discovers LangfuseMedia nodes
    Langfuse SDK->>MediaManager: _upload_media_sync(media)
    MediaManager->>Langfuse API: media.get_upload_url(trace_id=None, field=None)
    Langfuse API-->>MediaManager: upload_url + media_id
    MediaManager->>Object Storage: PUT bytes
    Object Storage-->>MediaManager: 200 OK
    MediaManager->>Langfuse API: media.patch(media_id, upload_http_status=200)
    MediaManager-->>Langfuse SDK: done
    Langfuse SDK->>Langfuse API: dataset_items.create(input={"img": "@@@langfuseMedia:..."})
    Langfuse API-->>Langfuse SDK: DatasetItem
    Langfuse SDK-->>User: DatasetItem (reference strings)

    Note over User,Object Storage: get_dataset with resolve_media_references=True

    User->>Langfuse SDK: get_dataset(name, resolve_media_references=True)
    Langfuse SDK->>Langfuse API: dataset_items.list(includeMediaReferences=true)
    Langfuse API-->>Langfuse SDK: items with mediaReferences[] (JSONPaths + signed URLs)
    loop For each item
        Langfuse SDK->>Langfuse SDK: _hydrate_dataset_item_media_references()<br/>jsonpath-ng update per JSONPath to LangfuseMediaReference
    end
    Langfuse SDK-->>User: DatasetClient (LangfuseMediaReference in fields)
    User->>Langfuse SDK: item.input["img"].fetch_bytes()
    Langfuse SDK->>Object Storage: httpx.get(signed_url)
    Object Storage-->>Langfuse SDK: bytes
    Langfuse SDK-->>User: bytes
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
langfuse/_client/client.py:3418-3426
`_replace_json_path_value` silently discards the return value of `update()`, relying entirely on in-place mutation of `value`. This works today because API-deserialized data is always a mutable `dict` or `list`, but it diverges from the pattern used in `_process_dataset_item_media` (which correctly captures `data = match.full_path.update(data, ...)`). If `value` were ever an immutable type at a non-root path, the replacement would be silently lost without any log or exception.

```suggestion
        try:
            value = parse_jsonpath(json_path).update(value, replacement)
        except Exception as e:
            langfuse_logger.debug(
                f"Failed to hydrate dataset media reference at JSONPath {json_path}",
                exc_info=e,
            )

        return value
```

### Issue 2 of 3
langfuse/media.py:50-53
`httpx.get(self.url)` is called without a timeout, so it can block indefinitely if the signed URL is slow or the connection hangs. This is especially relevant in hot paths such as evaluation loops that call `fetch_bytes()` on each dataset item's resolved media.

```suggestion
    def fetch_bytes(self, timeout: float = 30.0) -> bytes:
        """Fetch the media content from the signed URL."""
        response = httpx.get(self.url, timeout=timeout)
        response.raise_for_status()
```

### Issue 3 of 3
langfuse/_task_manager/media_manager.py:233-260
**Silent return on incomplete media is undocumented**

`_upload_media_sync` silently returns `None` when any content field is `None`. The caller `_upload_dataset_item_media` then adds `media_id` to `uploaded_media_ids` and returns the reference string as if the upload succeeded — meaning the dataset item would store a reference to media that was never uploaded.

In practice this path is currently unreachable: the earlier check `if reference_string is None or media_id is None: raise ValueError(...)` in `_upload_dataset_item_media` ensures all content fields are populated before this method is called (because `_media_id` is derived from `_content_sha256_hash`, which requires `_content_bytes`). But the silent return offers no signal to future callers that skip that guard, and no explanation is left in a comment. A guarded `raise` or at minimum a comment explaining the invariant would make this safer.

Reviews (1): Last reviewed commit: "feat(datasets): support media references" | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

wochinge added 2 commits June 15, 2026 11:40
Adds dataset item media upload and optional get_dataset media hydration.
@github-actions

Copy link
Copy Markdown

@claude review

Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/media.py Outdated
Comment thread langfuse/_task_manager/media_manager.py Outdated
Comment thread langfuse/_client/client.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd791b64ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/media.py Outdated
Comment thread langfuse/_client/client.py
Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/_client/client.py
Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/_client/resource_manager.py
Comment thread langfuse/_task_manager/media_manager.py
Comment thread langfuse/media.py
Comment thread pyproject.toml Outdated
Comment thread langfuse/_client/client.py
Resolved LangfuseMediaReference items from get_dataset(resolve_media_
references=True) discarded the original @@@langfuseMedia:...@@@ string,
so feeding them back into create_dataset_item or run_experiment
serialized them as opaque dicts and orphaned the media. Persist the
reference string and emit it from both _process_dataset_item_media and
EventSerializer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py
wochinge added a commit to langfuse/langfuse-js that referenced this pull request Jun 16, 2026
Mirror langfuse/langfuse-python#1710 to add media support to datasets:

- LangfuseMediaReference (core): a signed-URL media handle returned when
  resolving dataset media, with urlIsExpired / fetchBytes / fetchBase64 /
  fetchDataUri helpers for feeding media to LLM providers
- uploadMedia (core): reusable upload routine that works without a trace
  context; MediaService now delegates to it instead of duplicating the
  upload + backoff logic
- DatasetManager.createItem: uploads any LangfuseMedia found in input,
  expectedOutput, or metadata (deduped) and replaces it with a reference
  string before creating the item; createDatasetItem now routes here
- DatasetManager.get(resolveMediaReferences): requests includeMediaReferences
  and hydrates reference strings into LangfuseMediaReference objects using
  jsonpath-plus (eval-free, so it is safe under strict CSP / edge runtimes)
- re-export LangfuseMedia and LangfuseMediaReference from @langfuse/client

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wochinge and others added 2 commits June 16, 2026 16:25
_process_dataset_item_media only recursed into list/dict, so a
LangfuseMedia held in a tuple, set, or frozenset slipped through to
fern's encoder and got silently base64-inlined instead of uploaded.
Widen the walker to those containers and rebuild them in place.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…name

The DatasetItemMediaReferenceField enum value changed from
expected_output to expectedOutput. Decouple hydration from the wire
value by mapping the enum member to the model attribute, so the rename
(and any future one) no longer silently skips expected-output media
references.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py Outdated
Rebuilding tuples via type(data)(iterable) breaks namedtuple/NamedTuple
subclasses, which take positional field args rather than an iterable.
Keep list/set/frozenset handling and leave tuples untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/resource_manager.py
get_singleton_httpx_client silently returned the first-inserted
instance, so a LangfuseMediaReference fetched from one client could go
out through another client's transport (proxy/CA/mTLS). Mirror
get_client: warn and fall back to default httpx when multiple clients
exist, and let fetch_bytes/fetch_base64/fetch_data_uri take an explicit
httpx client to deterministically honor per-client settings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_utils/serializer.py
Dataset item media uploads now send datasetId + datasetItemId + field
instead of a trace context. create_dataset_item settles the item id up
front (generating one when absent) so media can be linked before the
item exists, threads the field and resolves the dataset id for each
upload, and reuses the id for the create call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@blacksmith-sh

This comment has been minimized.

Comment thread langfuse/_client/client.py Outdated
Resolving the dataset id inside the per-media upload path fired one
datasets.get per distinct media and could orphan already-uploaded media
if a later lookup failed. Resolve it once up front, before any upload,
and thread dataset_id through the media processing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/_task_manager/media_upload_queue.py
wochinge added a commit to langfuse/langfuse-js that referenced this pull request Jun 19, 2026
The media upload endpoint now requires the (dataset, item) a media belongs
to instead of a trace context. createItem settles the item id up front (the
item need not exist yet), resolves the dataset id, and uploads each
LangfuseMedia with its field (input/expectedOutput/metadata). Mirrors
langfuse/langfuse-python#1710.
Comment thread langfuse/_client/client.py Outdated
wochinge and others added 2 commits June 19, 2026 15:21
URL-encode the dataset name in the create_dataset_item media lookup to
match the other datasets.* call sites, and add the new dataset_id /
dataset_item_id keys to the _upload_job test fixture so
_process_upload_media_job no longer KeyErrors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace each LangfuseMedia with its reference string and collect the
media to upload in a single traversal per field, then resolve the
dataset id once and upload the collected set. A plain item no longer
does any datasets.get; media items do exactly one, before any upload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/_client/client.py Outdated
Comment thread langfuse/media.py Outdated
Comment thread langfuse/media.py

@hassiebp hassiebp left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tobi - only major questions is around the added dependency

wochinge and others added 5 commits June 22, 2026 18:05
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the resolve_media_references flag on get_dataset and always hydrate
media reference strings to LangfuseMediaReference objects. Regenerate the
dataset-items and media API clients from the base branch: the
includeMediaReferences query flag is gone, mediaReferences is a
non-optional list, and the media upload field is now required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The API only emits jsonpath-plus normalized paths ($, ['key'], [int]) for
dataset item media references, so parse exactly that grammar in a small
json_path helper module instead of depending on jsonpath-ng (which also
breaks under python -OO). Tests are generated from jsonpath-plus, and the
e2e exercises multiple media across nested keys and list indices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parametrizing with datetime.now() baked per-worker timestamps into the
test ids, so xdist workers collected different ids and aborted with a
collection error. Use fixed past/future timestamps instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wochinge wochinge requested a review from hassiebp June 22, 2026 17:06
Comment thread langfuse/_client/client.py

@hassiebp hassiebp left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

wochinge and others added 3 commits June 23, 2026 11:26
The earlier main merge kept the feature branch's generated code; sync it
to main's now-released codegen (sessions regenerated, media reference
media required again). Drop the dataset hydration's media-None guard,
which is dead now that media is non-optional.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 langfuse/api/commons/types/dataset_item.py:56-58DatasetItem.media_references is declared as required (bare pydantic.Field(), no default), so the moment a user upgrades the SDK while their Langfuse server still returns the old shape, every endpoint that parses a DatasetItem raises pydantic.ValidationError: media_references Field requiredget_dataset, api.dataset_items.get, api.dataset_items.create, create_dataset_item. The PR description acknowledges the spec is "not yet released", so this affects every existing self-hosted server today, and rolling/canary deploys after release will still see intermittent failures. One-line fix at langfuse/api/commons/types/dataset_item.py:58: pydantic.Field(default_factory=list) — the downstream _hydrate_dataset_item_media_references already does item.media_references or [], so no other code needs to change.

    Extended reasoning...

    What the bug is

    langfuse/api/commons/types/dataset_item.py:56-58 declares the new field as:

    media_references: typing_extensions.Annotated[
        typing.List[DatasetItemMediaReference], FieldMetadata(alias="mediaReferences")
    ] = pydantic.Field()

    pydantic.Field() with no default and no Optional[...] wrapper makes the field required. UniversalBaseModel.model_config sets extra="allow", which permits unknown extra keys in responses but does not relax required-field validation — missing required fields still raise.

    Step-by-step proof

    Reproduced locally against this branch:

    from langfuse.api.commons.types.dataset_item import DatasetItem
    from langfuse.api.core.pydantic_utilities import parse_obj_as
    
    parse_obj_as(DatasetItem, {
        "id": "item-1", "status": "ACTIVE",
        "input": "hello", "expectedOutput": "world", "metadata": None,
        "datasetId": "ds-1", "datasetName": "ds",
        "createdAt": "2024-01-01T00:00:00Z", "updatedAt": "2024-01-01T00:00:00Z",
        # No mediaReferences — exactly what older servers return today.
    })
    # pydantic.ValidationError: 1 validation error for DatasetItem
    # media_references
    #   Field required [type=missing, ...]

    The code path that triggers it

    Every dataset endpoint returns a DatasetItem and so hits this validation:

    • get_dataset(name)dataset_items.list() → parses each response item.
    • api.dataset_items.get(id) → single-item retrieval.
    • api.dataset_items.create(...) → parses the response of a create call.
    • create_dataset_item(...) → wraps api.dataset_items.create.

    So creating an item against an older server (which doesn't return mediaReferences) fails on the response parse even though the request itself was valid. There is no graceful degradation path.

    Why existing safeguards don't catch it

    • extra="allow" only allows additional keys; it does not make required keys optional.
    • _hydrate_dataset_item_media_references (langfuse/_client/client.py:3485) starts with the defensive media_references = item.media_references or [], signaling the author anticipated empty/None — but pydantic validation runs before this code is reached, so the guard is unreachable.
    • The new unit tests construct DatasetItem with media_references=[] explicitly (see tests/unit/test_propagate_attributes.py:2602); no test exercises the old-server response shape.

    Addressing the "intentional design" refutation

    The refuter argues the required-ness is deliberate: the commit message says "mediaReferences is a non-optional list", the file is Fern-generated, and the team has signaled coordinated server+SDK rollout. Each point has a counter:

    1. "Spec mandates the field be present." The PR description itself states "spec not yet released" — so today, the contract is "the field is not in any deployed server's response". Even when the spec ships, the SDK release will likely lead the server upgrade for self-hosted users (a well-documented Langfuse deployment model), and rolling/canary deploys will surface this as intermittent ValidationError on dataset endpoints.

    2. "Hand-editing Fern output creates maintenance debt." The maintainer's own inline comments in this PR (e.g. wochinge: "actual (non generated changes)") confirm the team already hand-syncs these files. A one-line change to default_factory=list matches existing practice and survives regeneration (Fern can emit it from a default in the spec).

    3. "Coordinated rollout is acceptable for opt-in features." media_references is not opt-in — every get_dataset / dataset_items.get call hits the new field, regardless of whether the user is using multimodal features at all. A text-only dataset workflow breaks the same as a multimodal one.

    4. "Reviewers didn't flag it." True, but reviewers also did not surface compat hazards generally on this PR — the only forward-compat concerns raised were about runtime envs (python -OO) and serializer opacity. The required-vs-optional question on a newly-added field was not part of the review thread.

    Impact

    Loud failure (raises ValidationError) on every dataset endpoint, against any server that doesn't yet emit mediaReferences. Affects:

    • Self-hosted Langfuse users until they upgrade the server.
    • Rolling/canary deploys where some replicas still return the old shape.
    • Any user who upgrades the SDK first (the typical "pip install --upgrade" path).

    How to fix

    One-line change at langfuse/api/commons/types/dataset_item.py:58:

    media_references: typing_extensions.Annotated[
        typing.List[DatasetItemMediaReference], FieldMetadata(alias="mediaReferences")
    ] = pydantic.Field(default_factory=list)

    The downstream _hydrate_dataset_item_media_references is already media_references = item.media_references or [], so it handles both an empty list and a None cleanly — no other code paths need to change.

@wochinge wochinge enabled auto-merge (squash) June 23, 2026 10:01
@wochinge wochinge merged commit 1134428 into main Jun 23, 2026
19 checks passed
@wochinge wochinge deleted the feature/lfe-10289-python-sdk-changes branch June 23, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants