Skip to content

Feature Request: Async invocation mode for pipeline deployment /invoke (return run_id immediately, poll for status) #4865

Description

@eliottiti

Contact Details [Optional]

eliott.iticsohn@brevo.com

Feature Description

Add an asynchronous invocation mode to the pipeline deployment HTTP server. In async mode, POST /invoke would return a run_id (HTTP 202) immediately after enqueueing the run, instead of blocking until the pipeline finishes. Clients would then poll a status endpoint to retrieve progress and outputs.

Today, /invoke is sync-only. The timeout field already accepted by BaseDeploymentInvocationRequest is silently dropped on the server: src/zenml/deployers/server/service.py:355

# Unused parameters for future implementation
_ = request.run_name, request.timeout

The parent feature #3928 — Pipeline Serving (Deploy Pipelines as Always-Warm HTTP Endpoints) listed POST /invoke (sync/async) in its MVP description, but only the sync path was delivered before the issue was closed. This is a follow-up scoped to the missing async path.

Problem or Use Case

Pipeline deployments can take several minutes to execute. With sync-only /invoke, the caller must hold an HTTP connection open for the entire pipeline duration, which:

  • Hits idle/read timeouts on every LB / reverse proxy / ingress in the path; multi-minute waits are brittle and operationally painful.
  • Forces callers to pick between long client-side timeouts (risk of mid-flight termination by any intermediary) and short timeouts (premature failure of in-flight runs).
  • Provides no run_id until completion, so callers cannot expose progress to end users, deduplicate concurrent invocations, implement "run already in progress, here is its ID" semantics, or correlate logs and metrics with a specific execution before it ends.
  • Couples client lifecycle to server lifecycle: a client crash, network blip, or scale-down kills observability of an otherwise healthy run.

Cold-start alternatives (snapshot-based runs) impose a 1–3 min image-pull + ZenML bootstrap penalty per request, which is the trade-off that always-warm deployments were introduced to avoid in the first place.

Proposed Solution

Follow standard practice for long-running HTTP APIs (RFC 7231 §6.3.3, RFC 7240 Prefer: respond-async):

  1. Async opt-in on POST /invoke, via either a request body flag (async: true) or an HTTP header (Prefer: respond-async). Sync remains the default — non-breaking.
  2. 202 Accepted response carrying the run_id and a Location header pointing to the status resource (e.g., Location: /runs/{run_id}), with a response body such as {"run_id": ..., "status": "queued"}.
  3. GET /runs/{run_id} returns the current state of the run with a small, stable status state machine — e.g. queued | running | succeeded | failed | cancelled — alongside timing fields (created_at, started_at, finished_at) and the final outputs once succeeded.
  4. Run lifecycle is server-owned: cancelling or terminating the HTTP client after a 202 must not affect the pipeline. The placeholder run is created before returning so the run_id is always queryable, including for runs that fail to start.
  5. Backpressure / queue limits surfaced via standard status codes (429 Too Many Requests with Retry-After) when the deployment cannot accept new runs.

Optional (out of MVP, listed for completeness): a DELETE /runs/{run_id} cancellation endpoint, and webhook delivery as a complement to polling (callback_url in the request body, server POSTs the terminal state).

Alternatives Considered

  • Snapshot runs (zenml pipeline run): rejected — full cold-start per request, defeats the purpose of an always-warm deployment.
  • Custom startup_hook background thread looping back to /invoke: mentioned as a workaround in Feature Request: Native Event-Driven & Transport-Agnostic Pipeline Deployment Ingest #4723. Error-prone and not first-class (no health, no lifecycle, no observability).
  • Long sync HTTP wrapped in an external orchestrator (Temporal/Argo/etc.) with heartbeats: keeps the orchestrator's worker visible but does not keep the HTTP connection itself alive, so it still relies on every infra hop tolerating multi-minute idle reads. Moves the problem rather than solving it.
  • Direct polling of the control plane via Client().get_pipeline_run(): viable for status retrieval, but only once a run_id is known — which today requires the sync call to complete. Async /invoke is the precondition that makes this pattern usable.

Additional Context

Priority

High - Critical for my use case

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

core-teamIssues that are being handled by the core teamplannedPlanned for the short term
No fields configured for Feature.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions