Feature Request: Async invocation mode for pipeline deployment /invoke (return run_id immediately, poll for status)

### Contact Details [Optional]

eliott.iticsohn@brevo.com

### Feature Description

Add an asynchronous invocation mode to the pipeline deployment HTTP server. In async mode, `POST /invoke` would return a `run_id` (HTTP 202) immediately after enqueueing the run, instead of blocking until the pipeline finishes. Clients would then poll a status endpoint to retrieve progress and outputs.

Today, `/invoke` is sync-only. The `timeout` field already accepted by `BaseDeploymentInvocationRequest` is silently dropped on the server: `src/zenml/deployers/server/service.py:355` →

```python
# Unused parameters for future implementation
_ = request.run_name, request.timeout
```

The parent feature **#3928 — Pipeline Serving (Deploy Pipelines as Always-Warm HTTP Endpoints)** listed `POST /invoke (sync/async)` in its MVP description, but only the sync path was delivered before the issue was closed. This is a follow-up scoped to the missing async path.

### Problem or Use Case

Pipeline deployments can take several minutes to execute. With sync-only `/invoke`, the caller must hold an HTTP connection open for the entire pipeline duration, which:

- Hits idle/read timeouts on every LB / reverse proxy / ingress in the path; multi-minute waits are brittle and operationally painful.
- Forces callers to pick between long client-side timeouts (risk of mid-flight termination by any intermediary) and short timeouts (premature failure of in-flight runs).
- Provides no `run_id` until completion, so callers cannot expose progress to end users, deduplicate concurrent invocations, implement "run already in progress, here is its ID" semantics, or correlate logs and metrics with a specific execution before it ends.
- Couples client lifecycle to server lifecycle: a client crash, network blip, or scale-down kills observability of an otherwise healthy run.

Cold-start alternatives (snapshot-based runs) impose a 1–3 min image-pull + ZenML bootstrap penalty per request, which is the trade-off that always-warm deployments were introduced to avoid in the first place.

### Proposed Solution

Follow standard practice for long-running HTTP APIs (RFC 7231 §6.3.3, RFC 7240 `Prefer: respond-async`):

1. **Async opt-in on `POST /invoke`**, via either a request body flag (`async: true`) or an HTTP header (`Prefer: respond-async`). Sync remains the default — non-breaking.
2. **`202 Accepted`** response carrying the `run_id` and a `Location` header pointing to the status resource (e.g., `Location: /runs/{run_id}`), with a response body such as `{"run_id": ..., "status": "queued"}`.
3. **`GET /runs/{run_id}`** returns the current state of the run with a small, stable status state machine — e.g. `queued | running | succeeded | failed | cancelled` — alongside timing fields (`created_at`, `started_at`, `finished_at`) and the final outputs once `succeeded`.
4. **Run lifecycle is server-owned**: cancelling or terminating the HTTP client after a 202 must not affect the pipeline. The placeholder run is created before returning so the `run_id` is always queryable, including for runs that fail to start.
5. **Backpressure / queue limits** surfaced via standard status codes (`429 Too Many Requests` with `Retry-After`) when the deployment cannot accept new runs.

Optional (out of MVP, listed for completeness): a `DELETE /runs/{run_id}` cancellation endpoint, and webhook delivery as a complement to polling (`callback_url` in the request body, server `POST`s the terminal state).

### Alternatives Considered

- **Snapshot runs (`zenml pipeline run`):** rejected — full cold-start per request, defeats the purpose of an always-warm deployment.
- **Custom `startup_hook` background thread looping back to `/invoke`:** mentioned as a workaround in #4723. Error-prone and not first-class (no health, no lifecycle, no observability).
- **Long sync HTTP wrapped in an external orchestrator (Temporal/Argo/etc.) with heartbeats:** keeps the orchestrator's worker visible but does not keep the HTTP connection itself alive, so it still relies on every infra hop tolerating multi-minute idle reads. Moves the problem rather than solving it.
- **Direct polling of the control plane via `Client().get_pipeline_run()`:** viable for status retrieval, but only once a `run_id` is known — which today requires the sync call to complete. Async `/invoke` is the precondition that makes this pattern usable.

### Additional Context

- ZenML server / client: 0.94.2 (confirmed unchanged on `develop` for the relevant code path).
- Source pointer: `src/zenml/deployers/server/service.py::execute_pipeline` — the `timeout` field is read off the request but explicitly unused.
- Related: **#3928** (parent feature; async path scoped but not delivered), **#4723** (broader event-driven / transport-agnostic ingest — different surface, complementary).

### Priority

High - Critical for my use case

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Async invocation mode for pipeline deployment /invoke (return run_id immediately, poll for status) #4865

Contact Details [Optional]

Feature Description

Problem or Use Case

Proposed Solution

Alternatives Considered

Additional Context

Priority

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature Request: Async invocation mode for pipeline deployment /invoke (return run_id immediately, poll for status) #4865

Description

Contact Details [Optional]

Feature Description

Problem or Use Case

Proposed Solution

Alternatives Considered

Additional Context

Priority

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions