feat(datasets): support multimodal dataset items#1710
Conversation
Adds dataset item media upload and optional get_dataset media hydration.
|
@claude review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd791b64ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…hon-sdk-changes # Conflicts: # uv.lock
Resolved LangfuseMediaReference items from get_dataset(resolve_media_ references=True) discarded the original @@@langfuseMedia:...@@@ string, so feeding them back into create_dataset_item or run_experiment serialized them as opaque dicts and orphaned the media. Persist the reference string and emit it from both _process_dataset_item_media and EventSerializer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror langfuse/langfuse-python#1710 to add media support to datasets: - LangfuseMediaReference (core): a signed-URL media handle returned when resolving dataset media, with urlIsExpired / fetchBytes / fetchBase64 / fetchDataUri helpers for feeding media to LLM providers - uploadMedia (core): reusable upload routine that works without a trace context; MediaService now delegates to it instead of duplicating the upload + backoff logic - DatasetManager.createItem: uploads any LangfuseMedia found in input, expectedOutput, or metadata (deduped) and replaces it with a reference string before creating the item; createDatasetItem now routes here - DatasetManager.get(resolveMediaReferences): requests includeMediaReferences and hydrates reference strings into LangfuseMediaReference objects using jsonpath-plus (eval-free, so it is safe under strict CSP / edge runtimes) - re-export LangfuseMedia and LangfuseMediaReference from @langfuse/client Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_process_dataset_item_media only recursed into list/dict, so a LangfuseMedia held in a tuple, set, or frozenset slipped through to fern's encoder and got silently base64-inlined instead of uploaded. Widen the walker to those containers and rebuild them in place. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…name The DatasetItemMediaReferenceField enum value changed from expected_output to expectedOutput. Decouple hydration from the wire value by mapping the enum member to the model attribute, so the rename (and any future one) no longer silently skips expected-output media references. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuilding tuples via type(data)(iterable) breaks namedtuple/NamedTuple subclasses, which take positional field args rather than an iterable. Keep list/set/frozenset handling and leave tuples untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
get_singleton_httpx_client silently returned the first-inserted instance, so a LangfuseMediaReference fetched from one client could go out through another client's transport (proxy/CA/mTLS). Mirror get_client: warn and fall back to default httpx when multiple clients exist, and let fetch_bytes/fetch_base64/fetch_data_uri take an explicit httpx client to deterministically honor per-client settings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dataset item media uploads now send datasetId + datasetItemId + field instead of a trace context. create_dataset_item settles the item id up front (generating one when absent) so media can be linked before the item exists, threads the field and resolves the dataset id for each upload, and reuses the id for the create call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Resolving the dataset id inside the per-media upload path fired one datasets.get per distinct media and could orphan already-uploaded media if a later lookup failed. Resolve it once up front, before any upload, and thread dataset_id through the media processing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The media upload endpoint now requires the (dataset, item) a media belongs to instead of a trace context. createItem settles the item id up front (the item need not exist yet), resolves the dataset id, and uploads each LangfuseMedia with its field (input/expectedOutput/metadata). Mirrors langfuse/langfuse-python#1710.
URL-encode the dataset name in the create_dataset_item media lookup to match the other datasets.* call sites, and add the new dataset_id / dataset_item_id keys to the _upload_job test fixture so _process_upload_media_job no longer KeyErrors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace each LangfuseMedia with its reference string and collect the media to upload in a single traversal per field, then resolve the dataset id once and upload the collected set. A plain item no longer does any datasets.get; media items do exactly one, before any upload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hassiebp
left a comment
There was a problem hiding this comment.
Thanks Tobi - only major questions is around the added dependency
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the resolve_media_references flag on get_dataset and always hydrate media reference strings to LangfuseMediaReference objects. Regenerate the dataset-items and media API clients from the base branch: the includeMediaReferences query flag is gone, mediaReferences is a non-optional list, and the media upload field is now required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The API only emits jsonpath-plus normalized paths ($, ['key'], [int]) for dataset item media references, so parse exactly that grammar in a small json_path helper module instead of depending on jsonpath-ng (which also breaks under python -OO). Tests are generated from jsonpath-plus, and the e2e exercises multiple media across nested keys and list indices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parametrizing with datetime.now() baked per-worker timestamps into the test ids, so xdist workers collected different ids and aborted with a collection error. Use fixed past/future timestamps instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The earlier main merge kept the feature branch's generated code; sync it to main's now-released codegen (sessions regenerated, media reference media required again). Drop the dataset hydration's media-None guard, which is dead now that media is non-optional. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
langfuse/api/commons/types/dataset_item.py:56-58—DatasetItem.media_referencesis declared as required (barepydantic.Field(), no default), so the moment a user upgrades the SDK while their Langfuse server still returns the old shape, every endpoint that parses aDatasetItemraisespydantic.ValidationError: media_references Field required—get_dataset,api.dataset_items.get,api.dataset_items.create,create_dataset_item. The PR description acknowledges the spec is "not yet released", so this affects every existing self-hosted server today, and rolling/canary deploys after release will still see intermittent failures. One-line fix at langfuse/api/commons/types/dataset_item.py:58:pydantic.Field(default_factory=list)— the downstream_hydrate_dataset_item_media_referencesalready doesitem.media_references or [], so no other code needs to change.Extended reasoning...
What the bug is
langfuse/api/commons/types/dataset_item.py:56-58declares the new field as:media_references: typing_extensions.Annotated[ typing.List[DatasetItemMediaReference], FieldMetadata(alias="mediaReferences") ] = pydantic.Field()
pydantic.Field()with no default and noOptional[...]wrapper makes the field required.UniversalBaseModel.model_configsetsextra="allow", which permits unknown extra keys in responses but does not relax required-field validation — missing required fields still raise.Step-by-step proof
Reproduced locally against this branch:
from langfuse.api.commons.types.dataset_item import DatasetItem from langfuse.api.core.pydantic_utilities import parse_obj_as parse_obj_as(DatasetItem, { "id": "item-1", "status": "ACTIVE", "input": "hello", "expectedOutput": "world", "metadata": None, "datasetId": "ds-1", "datasetName": "ds", "createdAt": "2024-01-01T00:00:00Z", "updatedAt": "2024-01-01T00:00:00Z", # No mediaReferences — exactly what older servers return today. }) # pydantic.ValidationError: 1 validation error for DatasetItem # media_references # Field required [type=missing, ...]
The code path that triggers it
Every dataset endpoint returns a
DatasetItemand so hits this validation:get_dataset(name)→dataset_items.list()→ parses each response item.api.dataset_items.get(id)→ single-item retrieval.api.dataset_items.create(...)→ parses the response of a create call.create_dataset_item(...)→ wrapsapi.dataset_items.create.
So creating an item against an older server (which doesn't return
mediaReferences) fails on the response parse even though the request itself was valid. There is no graceful degradation path.Why existing safeguards don't catch it
extra="allow"only allows additional keys; it does not make required keys optional._hydrate_dataset_item_media_references(langfuse/_client/client.py:3485) starts with the defensivemedia_references = item.media_references or [], signaling the author anticipated empty/None — but pydantic validation runs before this code is reached, so the guard is unreachable.- The new unit tests construct
DatasetItemwithmedia_references=[]explicitly (seetests/unit/test_propagate_attributes.py:2602); no test exercises the old-server response shape.
Addressing the "intentional design" refutation
The refuter argues the required-ness is deliberate: the commit message says "mediaReferences is a non-optional list", the file is Fern-generated, and the team has signaled coordinated server+SDK rollout. Each point has a counter:
-
"Spec mandates the field be present." The PR description itself states "spec not yet released" — so today, the contract is "the field is not in any deployed server's response". Even when the spec ships, the SDK release will likely lead the server upgrade for self-hosted users (a well-documented Langfuse deployment model), and rolling/canary deploys will surface this as intermittent
ValidationErroron dataset endpoints. -
"Hand-editing Fern output creates maintenance debt." The maintainer's own inline comments in this PR (e.g. wochinge: "actual (non generated changes)") confirm the team already hand-syncs these files. A one-line change to
default_factory=listmatches existing practice and survives regeneration (Fern can emit it from a default in the spec). -
"Coordinated rollout is acceptable for opt-in features."
media_referencesis not opt-in — everyget_dataset/dataset_items.getcall hits the new field, regardless of whether the user is using multimodal features at all. A text-only dataset workflow breaks the same as a multimodal one. -
"Reviewers didn't flag it." True, but reviewers also did not surface compat hazards generally on this PR — the only forward-compat concerns raised were about runtime envs (
python -OO) and serializer opacity. The required-vs-optional question on a newly-added field was not part of the review thread.
Impact
Loud failure (raises
ValidationError) on every dataset endpoint, against any server that doesn't yet emitmediaReferences. Affects:- Self-hosted Langfuse users until they upgrade the server.
- Rolling/canary deploys where some replicas still return the old shape.
- Any user who upgrades the SDK first (the typical "pip install --upgrade" path).
How to fix
One-line change at
langfuse/api/commons/types/dataset_item.py:58:media_references: typing_extensions.Annotated[ typing.List[DatasetItemMediaReference], FieldMetadata(alias="mediaReferences") ] = pydantic.Field(default_factory=list)
The downstream
_hydrate_dataset_item_media_referencesis alreadymedia_references = item.media_references or [], so it handles both an empty list and aNonecleanly — no other code paths need to change.
Summary
LangfuseMediaininput/expected_output/metadatais uploaded and replaced with a reference string oncreate_dataset_item.get_dataset(...)resolves media reference strings toLangfuseMediaReferenceobjects (withfetch_bytes/fetch_base64/fetch_data_urihelpers).Linear
Major Decisions
jsonpath-ng.Review Focus
langfuse/_utils/json_path.py— the hand-rolled parser (tests generated from jsonpath-plus).Greptile Summary
This PR adds multimodal support to Langfuse dataset items, allowing
LangfuseMediaobjects to be embedded ininput,expected_output, andmetadatafields and uploaded trace-lessly, and adding aresolve_media_references=Trueoption toget_dataset()that hydrates backend-provided JSONPath references intoLangfuseMediaReferenceobjects.create_dataset_itemnow recursively discoversLangfuseMediavalues via$..this`` (jsonpath-ng), uploads each one synchronously through the updated_upload_media_sync(which accepts `trace_id=None`), deduplicates by `media_id` within a single call, and replaces each value with its reference string before sending to the API.get_dataset(resolve_media_references=True)passesinclude_media_referencesto the list endpoint, then walks backend-provided JSONPath annotations to replace reference strings with frozenLangfuseMediaReferencedataclass instances that exposefetch_bytes(),fetch_base64(), andfetch_data_uri()helpers.DatasetItemMediaReference,DatasetItemMediaReferenceField,DatasetItemMediaReferenceMedia) and theGetMediaUploadUrlRequestschema are updated to maketrace_idandfieldoptional, enabling trace-less uploads.Confidence Score: 4/5
Safe to merge; all changed paths work correctly for the expected JSON-like data shapes, and the two main concerns are style/fragility rather than broken behavior.
The media-upload and hydration flows are logically sound and covered by both unit and E2E tests. The two notable findings are:
_replace_json_path_valuediscards the return value ofupdate()and relies implicitly on in-place mutation (works today, diverges from how_process_dataset_item_mediahandles the same operation); andfetch_bytes()makes a blocking HTTP call with no timeout. Neither produces wrong output in the current code paths, but both leave the door open for subtle failures on refactoring or slow network conditions.langfuse/_client/client.py(_replace_json_path_value) andlangfuse/media.py(LangfuseMediaReference.fetch_bytes) are the two spots that would benefit from a second look before shipping to customers at scale.Sequence Diagram
sequenceDiagram participant User participant Langfuse SDK participant MediaManager participant Langfuse API participant Object Storage Note over User,Object Storage: create_dataset_item with LangfuseMedia User->>Langfuse SDK: create_dataset_item(input={"img": LangfuseMedia(...)}) Langfuse SDK->>Langfuse SDK: _process_dataset_item_media()<br/>jsonpath-ng discovers LangfuseMedia nodes Langfuse SDK->>MediaManager: _upload_media_sync(media) MediaManager->>Langfuse API: media.get_upload_url(trace_id=None, field=None) Langfuse API-->>MediaManager: upload_url + media_id MediaManager->>Object Storage: PUT bytes Object Storage-->>MediaManager: 200 OK MediaManager->>Langfuse API: media.patch(media_id, upload_http_status=200) MediaManager-->>Langfuse SDK: done Langfuse SDK->>Langfuse API: dataset_items.create(input={"img": "@@@langfuseMedia:..."}) Langfuse API-->>Langfuse SDK: DatasetItem Langfuse SDK-->>User: DatasetItem (reference strings) Note over User,Object Storage: get_dataset with resolve_media_references=True User->>Langfuse SDK: get_dataset(name, resolve_media_references=True) Langfuse SDK->>Langfuse API: dataset_items.list(includeMediaReferences=true) Langfuse API-->>Langfuse SDK: items with mediaReferences[] (JSONPaths + signed URLs) loop For each item Langfuse SDK->>Langfuse SDK: _hydrate_dataset_item_media_references()<br/>jsonpath-ng update per JSONPath to LangfuseMediaReference end Langfuse SDK-->>User: DatasetClient (LangfuseMediaReference in fields) User->>Langfuse SDK: item.input["img"].fetch_bytes() Langfuse SDK->>Object Storage: httpx.get(signed_url) Object Storage-->>Langfuse SDK: bytes Langfuse SDK-->>User: bytesPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(datasets): support media references" | Re-trigger Greptile