Skip to main content
Version: 2.0 prerelease

Operator Operating Envelope

This guide defines the operator-facing contract for Durable Workflow v2. Use it to decide which diagnostics block rollouts, which ones are advisory, which queue facts belong to Waterline versus worker telemetry, how to verify rebuild and export workflows, and which deployment shapes are part of the documented operating envelope.

Source-of-truth surfaces

Use these surfaces together:

SurfaceUse it forContract class
php artisan workflow:v2:doctor --strictBackend capability gating before v2 traffic or upgradesBlocking
GET /waterline/api/v2/healthCurrent engine-source readiness plus blocking vs advisory v2 health checksBlocking when status = error, advisory when status = warning
GET /waterline/api/statsDurable fleet totals, backlog counters, repair-loop facts, projection drift counts, worker compatibility summariesAdvisory and benchmarking
php artisan workflow:v2:rebuild-projections ...Previewing and repairing projection driftMaintenance
php artisan workflow:v2:backfill-command-contracts ...Previewing and backfilling legacy command-contract snapshotsMaintenance
php artisan workflow:v2:history-export ... and Waterline history-export routesReplay, archive handoff, and incident artifactsVerification
Waterline archive actions and control-plane archive()Lifecycle state transitions for closed runsLifecycle
Worker SDK metrics, traces, and logsSchedule-to-start latency, poll success, sticky-cache behavior, and custom application telemetryRuntime telemetry

The durable-state operator contract lives in Waterline and the workflow package. Worker telemetry remains the source of truth for latency and process-level behavior inside your workers.

Surface mapping by deployment shape

The Waterline routes in the table above ship inside the embedded Laravel host that installs the durable-workflow/workflow package. Standalone-server deployments do not run Waterline; they publish the equivalent operator contracts as authenticated server endpoints and as dw CLI commands. Read each row of this guide against the surface that exists in your deployment:

Operator questionEmbedded shape (Waterline)Standalone-server shape
Engine-source readiness and blocking vs advisory healthGET /waterline/api/v2/healthGET /api/system/health (admin auth, control-plane v2); dw server:health for liveness and dw server:info for the topology, protocol, and rollout-safety summary
Durable fleet totals, backlog, repair, worker compatibility, projection driftGET /waterline/api/statsGET /api/system/operator-metrics and dw system:operator-metrics
Selected-run detail and history exportGET /waterline/api/instances/... and /waterline/api/.../history-exportGET /api/workflows/{workflowId}, /runs/{runId}, and /runs/{runId}/history/export (see the Server API Reference)
Operator commands (cancel, terminate, repair, archive, signal/update/query)`POST /waterline/api/instances/.../{cancelterminate
Topology and node-identity discoveryphp artisan workflow:v2:doctor --json (topology object)GET /api/cluster/info, GET /api/health, GET /api/ready (or dw server:info)

The field families and contract names below stay the same regardless of which surface you read them through. When the rest of this guide names a /waterline/... route, treat the matching server route as the equivalent on standalone-server deployments.

Supported topologies

Durable Workflow v2 supports these operator shapes. The shape names in the first column match the topology.current_shape values published by /api/cluster/info and the Server Role Topology manifest, so the operator contract here lines up with the discovery contract your automation already reads.

Operator shape (topology.current_shape)Supported operator contractPrimary failure domainsRecovery and failover expectation
embedded, single nodeWaterline, control-plane routes, health, rebuild, export, and archive all run from one app process against one durable database and one cache store.The Laravel app process, the durable database, and the cache store on one host.Treat host or database loss as a full service interruption. Restore durable state first, bring one app node back to readiness, then verify worker registration before resuming traffic.
embedded, small same-region clusterUse one shared database, one shared cache backend for wake-signal coordination, identical workflow compatibility/config across nodes, and keep active nodes in the same datacenter or region so queue wake-up and timer wake-up latency stay bounded.Shared database, shared cache/wake coordination, load balancer routing, and the singleton scheduler or maintenance role.One app-node loss should reduce capacity, not correctness. Database or Redis failure still blocks the fleet. Scheduler failover and upgrades remain explicit operator procedures rather than automatic HA promises.
standalone_server distributionUse the Self-Hosting Deployments guide for the server-specific deployment matrix, then apply the same health, stats, export, archive, and queue-health distinctions described here through the server-side /api/system/... and /api/workflows/... routes (see the surface mapping above).Shared database, shared Redis, API container set, independently scaled workers, and the single scheduler or maintenance runner.API containers are replaceable; the database, Redis, and singleton scheduler path define recovery order. Restore persistence first, then verify /api/ready, /api/cluster/info, and worker registration before shifting traffic back.
split_control_executionSame product contract as standalone_server, with each role isolated into its own process class (ingress_node, control_plane_node, scheduler_node, matching_node, execution_node). The same operator-metrics, health, and command surfaces apply per-node; route admin reads to the node that hosts the role you are interrogating.Each role runs as its own process class, so the failure-domain checklist in Server Role Topology governs which subsystem fails first. The shared database, Redis, and singleton scheduler election remain fleet-wide failure domains.Recovery follows the same order as standalone_server, but verify topology.current_shape, topology.current_process_class, and topology.current_roles per node before declaring the deployment ready. Hosted routes return 503 topology_role_unavailable when sent to the wrong node class.

split_control_execution is not a separate engine or product. It is the same operator contract as standalone_server with the role-specific process classes named in topology.shape_assignments. Treat the rest of this guide as shape-agnostic for those two server shapes unless a section calls out a specific role. Server Role Topology holds the role vocabulary, authority boundaries, and migration path.

Publish the restore order, backup cadence, expected failover lag, and any region-pinned behavior in the runbook for the topology you operate. The product contract tells you which facts to measure; your deployment contract records the recovery timing, manual steps, and failure domains you accept.

Failure-domain checklist by supported shape

Use the topology table above as the quick summary, then write your runbook against these more explicit loss models:

  • Embedded Laravel, single node: One application process owns the control plane, matching, projection, scheduler, and execution roles together. Losing that process is a full service interruption for durable commands, workflow progress, schedule firing, and operator reads until the same app returns to readiness against intact durable storage.
  • Embedded Laravel, small same-region cluster: Losing one ordinary app node should remove only a share of HTTP and worker capacity while the remaining nodes keep claiming work from the shared durable store. Treat the shared database, the shared cache-backed wake path, and whichever node currently owns the singleton scheduler or maintenance duty as the main correctness boundaries for the fleet.
  • Standalone server distribution (standalone_server): Losing one server_http_node should stop ingress and control-plane commands only on that node; healthy worker nodes can still finish leased work and other API nodes can keep serving traffic. Losing one worker_node should raise backlog, queue age, or compatibility warnings only for the affected (connection, queue, compatibility) scopes. Losing the scheduler_node should pause new schedule fires and maintenance sweeps without invalidating already running workflows. Database or Redis loss is still a fleet-level outage until readiness, topology identity, and worker registration recover.
  • Split-role server distribution (split_control_execution): Each role runs as its own process class — ingress_node, control_plane_node, scheduler_node, matching_node, and execution_node. Losing any one process class only degrades the role it owns: ingress loss stops external HTTP traffic at the edge, control-plane loss makes operator commands fail fast while leased work continues, matching loss falls back to direct ready-task discovery, scheduler loss pauses schedule fires and records missed runs, and execution loss accumulates ready tasks without losing durable state. Database or Redis loss remains a fleet-wide outage; recovery requires the same restore order as the standalone_server shape.

If your deployment depends on different assumptions, treat that topology as a separate runbook with its own validated contract instead of assuming the self-serve guidance still applies unchanged.

Published recovery packet by topology

The supported topologies above are only production-ready when the deployment runbook publishes the matching recovery packet alongside them:

TopologyPublish these operator-owned facts
embedded, single nodeBackup schedule for the database, cache-preservation expectations, the exact app revision and env/config snapshot used for restore, the maximum accepted restore lag, and the latest successful restore rehearsal evidence.
embedded, small same-region clusterEverything from the single-node packet, plus which node or process currently owns scheduler or maintenance duty, the expected impact of losing one ordinary node versus losing the shared database or cache backend, and the failover steps required to restore queue wake coordination.
standalone_server distributionDatabase and Redis backup cadence, pinned server image or digest, auth-material location, the expected failover behavior for server_http_node, worker_node, and scheduler_node, the latest /api/ready plus /api/cluster/info restore verification evidence, and the latest worker re-registration proof after restore.
split_control_execution distributionEverything from the standalone_server packet, plus the per-process-class scaling and failure expectations for ingress_node, control_plane_node, scheduler_node, matching_node, and execution_node, and the routing rules clients use when a hosted route returns 503 topology_role_unavailable from the wrong node class.

If that packet is missing, stale, or untested, treat the topology as development-grade regardless of how many nodes are currently running.

Verify live topology identity before trusting the baseline

For standalone-server and split-role deployments, confirm the node identity that the product itself reports before you interpret queue, scheduler, or role failure signals. GET /api/cluster/info is the source of truth for that identity:

FieldUse it for
topology.current_shapeConfirms whether the node is currently advertising embedded, standalone_server, or split_control_execution.
topology.current_rolesConfirms the logical roles actually hosted by this node.
topology.supported_shapesConfirms which deployment shapes the current server build publicly supports.
topology.shape_assignmentsMaps each supported shape to its documented process-class role bundles so you can compare the current role bundle against the supported topology.

Use those fields as the first topology-drift check during rollouts:

  • In the self-serve standalone-server shape, API nodes should continue to report the api_ingress, control_plane, matching, and history_projection role bundle; scheduler nodes should report scheduler; worker nodes should report execution_plane.
  • In the split-role shape, verify that each node's current_roles match one of the documented role bundles under shape_assignments before you interpret backlog or scheduler lag as a worker problem.
  • If current_roles drift from the deployment plan, treat queue and failover baselines as suspect until the node identity is corrected.

Embedded installs do not publish /api/cluster/info. For the package-local topology view, run php artisan workflow:v2:doctor --json and inspect the topology object. It publishes the same role-topology schema and includes the embedded app's current_shape, current_process_class, current_roles, execution_mode, and nested matching_role summary.

Blocking and advisory diagnostics

Durable Workflow v2 separates blocking diagnostics from advisory diagnostics.

SeverityMeaningTypical operator action
BlockingThe current configuration or readiness state is not safe to trust for v2 trafficStop rollout, fix the prerequisite, rerun verification
AdvisoryThe surface remains readable, but some derived facts need rebuild, backfill, or manual review before you rely on themKeep serving traffic when appropriate, then repair the named surface
HealthyNo current issue was found in that surfaceContinue normal operation

Apply that rule to the shipped surfaces:

  • workflow:v2:doctor --strict blocks when backend capability issues have error severity. Examples include an unsupported queue driver in queue mode or a cache store without locks. Informational queue diagnostics in poll mode remain advisory.
  • GET /waterline/api/v2/health returns:
    • status = ok when the v2 operator surface is ready and the current checks are aligned.
    • status = warning when the surface remains readable but specific facts need rebuild, backfill, or repair before you trust them fully.
    • status = error with HTTP 503 when the engine-source bridge is not ready or a blocking capability problem makes the v2 surface unavailable.
  • GET /waterline/api/stats publishes durable operator facts. Treat those JSON fields as operator diagnostics for dashboards and scripts, not as a metrics scrape endpoint.

Correctness vs acceleration checks

Every v2 health check carries a category of either correctness or acceleration, and the snapshot publishes a per-category rollup so operators can answer two separate questions without re-aggregating the check list.

  • Correctness checks describe whether durable ready-task discovery, projection freshness, command-contract backfill, history retention, worker compatibility, and backend capabilities are intact. A correctness check in status = error means safe task pickup or operator-trusted state is at risk; rollouts should stop until it clears.
  • Acceleration checks describe whether optional wake-signal propagation is keeping up. The durable pollers are the correctness path, so an acceleration check in status = warning means cross-node wake-up latency may be higher than steady state but no task is stranded.

Each entry under checks carries its category, and the snapshot adds a categories rollup so dashboards can summarize both questions at a glance:

{
"status": "warning",
"categories": {
"correctness": {"status": "ok", "check_count": 8},
"acceleration": {"status": "warning", "check_count": 1}
}
}

Treat a degraded acceleration rollup as acceleration-only: investigate cache or wake backend health, but do not block traffic that depends only on durable ready-task discovery. A degraded correctness rollup is the blocking signal. The long_poll_wake_acceleration check is the canonical acceleration entry and never escalates above warning; every other check is a correctness entry.

Queue-health semantics

Queue health is split between durable queue state and worker/runtime telemetry.

Durable queue facts

Use Waterline dashboard stats and queue views for durable task state:

FactMeaning
operator_metrics.backlog.runnable_tasksDurable tasks that are ready to be claimed now.
operator_metrics.backlog.delayed_tasksDurable tasks that exist but are still waiting for available_at.
operator_metrics.backlog.leased_tasksDurable tasks currently claimed by a worker.
operator_metrics.backlog.tasks_added_last_minuteDistinct durable task rows created in the trailing 60 seconds. Treat this as durable queue inflow, not as a transport-attempt counter.
operator_metrics.backlog.tasks_dispatched_last_minuteDistinct durable task rows whose latest successful last_dispatched_at landed in the trailing 60 seconds. Compare it with tasks_added_last_minute to tell whether durable inflow is outrunning dispatch.
operator_metrics.starts.pending_runs, operator_metrics.starts.pending_commands, operator_metrics.starts.ready_tasks, operator_metrics.starts.oldest_pending_start_at, operator_metrics.starts.max_pending_msDurable workflow-start backlog. Use these facts to distinguish starts that have been accepted but have not yet become active workflow-task work from ordinary worker-side queue lag.
operator_metrics.tasks.oldest_ready_due_at, operator_metrics.tasks.max_ready_due_age_msThe oldest currently actionable task and its ready-to-dispatch age. This is the machine-readable backlog-latency pair behind "oldest ready task".
operator_metrics.tasks.dispatch_overdue, operator_metrics.tasks.oldest_dispatch_overdue_since, operator_metrics.tasks.max_dispatch_overdue_age_msReady durable tasks that still have no successful dispatch wake plus the age of the stalest example. Use these facts to spot degraded notifier acceleration without confusing it for ordinary queue growth.
operator_metrics.backlog.unhealthy_tasksDurable tasks with dispatch failure, claim failure, overdue dispatch, or expired lease state.
operator_metrics.backlog.repair_needed_runsOpen runs that do not currently have a trusted durable resume path.
operator_metrics.tasks.oldest_lease_expired_at, operator_metrics.tasks.max_lease_expired_age_msThe oldest expired lease and its age. Use this pair as the primary stuck-lease and duplicate-risk age indicator.
operator_metrics.backlog.oldest_compatibility_blocked_started_at, operator_metrics.backlog.max_compatibility_blocked_age_msThe oldest compatibility routing block and its age. Use this when work is preserved but no compatible worker is currently eligible to claim it.
Active vs stale pollersWhether registered workers are still heartbeating for a queue.
Current leasesWhich workflow or activity tasks are leased right now and whether the lease is expired.

These facts describe durable workflow-task and activity-task traffic only.

When you need queue-local drill-down instead of fleet totals, use the server task-queue visibility routes for backlog age, poller state, current leases, and admission budgets. Those routes do not currently expose per-queue stats.tasks_added_last_minute or stats.tasks_dispatched_last_minute; use the fleet-level operator_metrics.backlog.* pair above to compare durable inflow with dispatch, then use queue-local routes to see which queue is building backlog or has no available worker capacity.

Waterline's GET /waterline/api/v2/health surface publishes the same queue drill-down under queue_visibility.* for the configured namespace. Treat these field families as the typed queue-health contract:

Field familyMeaning
queue_visibility.available, queue_visibility.reasonWhether Waterline can currently produce queue-local visibility for the configured namespace, and why not when it cannot.
queue_visibility.task_queues[].stats.approximate_backlog_count, queue_visibility.task_queues[].stats.approximate_backlog_ageQueue-local backlog count and oldest durable backlog age.
queue_visibility.task_queues[].stats.tasks_added_last_minute, queue_visibility.task_queues[].stats.tasks_dispatched_last_minutePer-queue durable inflow versus dispatch over the trailing 60 seconds. Use these when one hot queue is hidden inside healthy fleet totals.
queue_visibility.task_queues[].stats.pollers.active_count, queue_visibility.task_queues[].stats.pollers.stale_count, queue_visibility.task_queues[].stats.pollers.stale_after_secondsHealthy versus stale pollers on that queue and the stale-heartbeat threshold the snapshot used.
queue_visibility.task_queues[].stats.workflow_tasks.*, queue_visibility.task_queues[].stats.activity_tasks.*Queue-local ready, leased, and expired-lease counts split by workflow-task versus activity-task traffic.
queue_visibility.task_queues[].repair.candidates, dispatch_failed, expired_leases, dispatch_overdueQueue-local repair pressure: durable tasks that already need repair, are dispatch-failed, hold expired leases, or are overdue for redispatch.
queue_visibility.task_queues[].repair.oldest_dispatch_failed_at, max_dispatch_failed_age_ms, oldest_lease_expired_at, max_lease_expired_age_ms, oldest_dispatch_overdue_since, max_dispatch_overdue_age_msQueue-local age signals for the stalest dispatch failure, expired lease, and dispatch-overdue durable task.

coordination_alerts[] on the same GET /waterline/api/v2/health payload is the operator roll-up for those queue-local facts plus the health-check list. Use it as the page-ready summary for warnings and errors, then drill into the matching queue_visibility or checks entries for evidence.

Treat the queue-local admission status as the first-class slot and poller signal for that queue. saturated means live workers are present but every registered slot is already leased. throttled means a server-side lease or dispatch cap is intentionally holding new work. no_slots means workers are registered but exposed zero capacity for that task kind. no_active_workers means the queue has no healthy poller at all, and unavailable means a configured lock-backed admission guard cannot currently prove safety.

Use operator_metrics.starts.* when new workflow starts appear stuck even though steady-state queue lag looks normal. Those facts separate control-plane start admission and first-task creation debt from downstream worker pickup.

Poller pressure and admission budgets

Use task-queue detail routes or dw task-queue:describe when queue flow is degrading and you need to separate "not enough worker capacity" from "intentional server throttling" or "no live poller at all":

Queue statusMeaningTreat it as
acceptingWorkers still have available slots and no server cap is full.Healthy baseline.
saturatedAll registered worker slots are currently leased.Worker-capacity pressure.
throttledA server-side active-lease or dispatch-rate cap is intentionally holding the queue back.Advisory unless the cap is unexpected or the backlog keeps growing beyond the published baseline.
no_slotsActive workers are registered, but none advertise slots for that task kind.Blocking for that queue.
no_active_workersNo healthy poller is currently serving the queue.Blocking for that queue.
unavailableThe queue cannot acquire the lock needed for its configured admission path.Blocking until the admission dependency recovers.

Use these statuses with the queue-flow facts together:

  • tasks_added_last_minute > tasks_dispatched_last_minute plus saturated means durable inflow is outrunning worker capacity.
  • The same rate imbalance plus throttled means the queue is being held back by an explicit server cap and should be judged against that cap's intended contract, not against unrestricted throughput.
  • A rising oldest-ready age plus no_active_workers or stale pollers means the queue has lost healthy claimers and should be treated as a routing outage for that scope.

Matching-role deployment shape

Use operator_metrics.matching_role.* when you need to confirm which matching/dispatch contract the current node is actually serving:

FactMeaning
operator_metrics.matching_role.queue_wake_enabledWhether this node still runs the in-worker broad-poll wake path on queue-worker loop events.
operator_metrics.matching_role.shapein_worker when the node still owns that wake path, dedicated when the wake/repair sweep is expected to run under a separate workflow:v2:repair-pass --loop process.
operator_metrics.matching_role.task_dispatch_modeThe dispatch mode this node is using for ready tasks: queue or poll.
operator_metrics.matching_role.partition_primitivesThe frozen routing axes, in order: connection, queue, compatibility, namespace.
operator_metrics.matching_role.backpressure_modelThe durable admission boundary the engine enforces. Current v2 reports lease_ownership.

These fields are node-local, not fleet-wide. In a mixed-shape rollout, read the snapshot from each node or pod you are cutting over so you can confirm the matching role moved where you intended before you interpret backlog or poller changes as worker health.

Worker and SDK telemetry

Use worker metrics, traces, and logs for:

  • Workflow and activity schedule_to_start latency
  • Poll success rate and sync/eager-dispatch behavior
  • Sticky-cache size and eviction behavior
  • Worker CPU, memory, thread, and event-loop pressure
  • Custom application metrics emitted from activities or worker code

Synchronous queries, live-debug tooling, and other non-durable control-plane calls should be labeled separately in your dashboards. They do not count as durable task backlog and they do not change Waterline repair counters.

Worker compatibility and rollout health

operator_metrics.workers publishes the compatibility facts that determine whether the active worker fleet can safely handle the required workflow contract:

FactMeaning
operator_metrics.workers.required_compatibilityCompatibility markers a worker must advertise to be eligible for work in the namespace.
operator_metrics.workers.active_workersCount of distinct live workers seen through compatibility heartbeat.
operator_metrics.workers.active_worker_scopesCount of (connection, queue) scopes covered by those workers.
operator_metrics.workers.active_workers_supporting_requiredWorkers whose advertised compatibility covers the required markers.
operator_metrics.workers.fleetPer-scope list of every active worker with worker_id, connection, queue, advertised supported markers, a supports_required flag, the heartbeat source (database or cache), and recorded_at.

Use the summary counts to detect rollout states where some workers cannot safely claim the required work, and drill into fleet to identify exactly which (connection, queue) scope is missing coverage. The Waterline operator dashboard renders the same fleet list under its worker compatibility panel so operators do not need to query the metric surface by hand.

When active_workers_supporting_required reaches zero for a namespace, Waterline surfaces a no_compatible_worker_for_task run diagnostic on affected runs so the gap is visible on the run-detail view as well as the metric surface. The companion worker_compatibility health check fires as warning under correctness in the same condition, which flips the correctness category rollup to warning so the fleet gap is visible at a glance and not buried inside the check list.

See Rolling Out Worker Builds With Build IDs for the drain/resume flow that coordinates with these facts during a build-id rollout, and Worker Compatibility and Routing for the pinning contract behind those diagnostics.

Alert semantics

Alert thresholds are deployment-specific. Publish your own numeric baselines for queue age, repair lag, worker coverage, and restore timing, then alert when the contract below stays breached longer than one normal repair or watchdog window for the topology you operate.

Alert familySourceTreat asEscalate whenOperator response
Blocking readinessworkflow:v2:doctor --strict, GET /waterline/api/v2/healthBlockingdoctor --strict returns an error or the health endpoint returns status = error / HTTP 503Stop rollout or traffic shift, fix the blocking prerequisite, then rerun readiness and compatibility checks.
Compatible-worker coverageoperator_metrics.workers.*, worker_compatibility health check, run diagnostic no_compatible_worker_for_taskBlockingactive_workers_supporting_required = 0 for a namespace or required (connection, queue) scopeDrain incompatible workers, register compatible workers, and confirm the correctness rollup clears before trusting new claims.
Durable queue lagWaterline queue views, operator_metrics.backlog.*, worker schedule_to_start telemetryBlocking when sustained; advisory when briefThe oldest ready-task age or schedule-to-start latency stays above the published topology baseline while compatible workers are availableAdd worker capacity, inspect task-queue admission limits, and verify the scheduler or matching path is still making forward progress.
Poller pressure and admission saturationTask-queue detail routes, dw task-queue:describe, queue status, stale pollers, and queue-local add/dispatch ratesBlocking for no_active_workers, no_slots, or unavailable; advisory for intentional throttled statesOne queue stays saturated while its oldest-ready age and add-vs-dispatch gap keep growing, or any queue flips to no_active_workers, no_slots, or unavailable outside a planned maintenance windowAdd worker slots, restore the missing poller cohort, or confirm the server-side cap and lock dependency are behaving as designed before you scale blindly.
Workflow-start backlogoperator_metrics.starts.*, control-plane start telemetry, worker schedule_to_start telemetry for first workflow tasksBlocking when sustained; advisory when briefpending_commands, ready_tasks, or max_pending_ms stay above the published topology baseline while compatible workers and queue capacity are availableInspect the start boundary end to end: confirm start commands are turning into durable tasks, verify matching or dispatch is creating the first task promptly, and separate start-path debt from general worker lag before scaling.
Projection drift and repair debtrun_summary_projection / selected_run_projections health checks, operator_metrics.repair.*AdvisoryDrift warnings persist past one planned rebuild window or the max candidate age keeps climbingRun the rebuild or repair previews, execute the repair, then verify the warning clears and stale ages return to baseline.
Retry or failure stormoperator_metrics.backlog.unhealthy_tasks, durable run diagnostics, worker error telemetryAdvisory, escalating to blocking if it prevents durable progressDispatch-failed, claim-failed, expired-lease, or retry-exhaustion facts climb above the topology baseline and stay elevatedInspect the failing task family, compare worker telemetry with durable error facts, and decide whether to drain traffic or isolate the affected queue.
Wake acceleration degradationlong_poll_wake_acceleration health check and the acceleration category rollupAdvisoryThe acceleration warning persists after cache or notifier maintenance windowsInvestigate cache or wake propagation health. Do not treat this as a correctness outage unless the correctness rollup also degrades.

The goal is to page on durable contract risk, not on every transient signal. Queue and worker alerts should only become blocking when they threaten the operator contract for the topology you actually run.

Rebuild, repair, and restore expectations

Use these checks in order when the operator surface reports drift:

  1. Check GET /waterline/api/v2/health.

    • run_summary_projection and selected_run_projections warnings mean Waterline can still answer, but some list or detail facts need rebuild.
    • command_contract_snapshots warnings mean some legacy runs still need WorkflowStarted contract backfill before operators can trust declared signal, update, or query forms.
    • durable_resume_paths warnings mean open runs need repair before you rely on their projected next resume source.
  2. Preview projection work with:

    php artisan workflow:v2:rebuild-projections --needs-rebuild --dry-run
  3. Rebuild the affected projections:

    php artisan workflow:v2:rebuild-projections --needs-rebuild
  4. Preview command-contract backfill work with:

    php artisan workflow:v2:backfill-command-contracts --dry-run
  5. Backfill command contracts when the current workflow class is still available:

    php artisan workflow:v2:backfill-command-contracts
  6. Use --prune-stale only after your retention workflow has intentionally removed durable rows and you want to delete projection rows whose durable run or history row no longer exists.

operator_metrics.repair.* publishes the repair-loop sweep footprint. Use the candidate counts, selected counts, maximum candidate age, and scan-limit pressure to decide whether repair work is comfortably within your baseline or needs capacity investigation.

Export and archive verification

History export and archive serve different purposes:

  • History export creates a replay/debug/archive artifact.
  • Archive marks a closed run as archived so it leaves active fleet views.
  • Prune removes projection or durable rows after retention has definitely expired.

Use this verification sequence:

  1. Export the selected run:

    php artisan workflow:v2:history-export <workflow-instance-id> --run-id=<workflow-run-id> --output=storage/app/workflow-history/run.json --pretty
  2. Verify the bundle includes the expected run id, schema version, and any configured redaction metadata.

  3. Archive the closed run only after the export artifact is stored where your runbook expects it.

  4. Keep archived-but-not-pruned runs available for incident review.

  5. Prune durable rows through your retention job, then rebuild/prune projections with workflow:v2:rebuild-projections --prune-stale.

For Waterline users, the matching history-export and archive routes are listed in the Waterline Operator API Reference.

Backup, restore, and disaster-recovery contract

Backup, restore, and disaster recovery are part of the operating envelope, not an optional private runbook. For every supported topology, publish and rehearse these facts:

  1. The durable backup set: database backup, server or app image reference, runtime env file or config set, auth material location, and the exact topology or restore notes needed to reattach workers.
  2. The recovery targets: maximum accepted restore lag, expected failover lag, and who is allowed to declare traffic safe again.
  3. The restore order: restore durable persistence first, then cache, then bootstrap or migrations, then the singleton scheduler or maintenance role, then API readiness, then worker registration.
  4. The verification pass: /api/ready or /waterline/api/v2/health, /api/cluster/info where applicable, one representative worker registration, and one representative history export from restored state.
  5. The repair pass after restore: rebuild projections, backfill command contracts if needed, and confirm queue, compatibility, and repair metrics return to baseline before you call the environment healthy.

Do not imply automatic multi-region or hands-free HA behavior unless your published topology contract actually proves it. For the documented self-hosted topologies — including the active/passive multi-region contract in the self-hosting guide — recovery and regional failover are deliberate operator work with explicit checkpoints, not automatic product behavior. Hosted Cloud multi-region replication v1 is scoped separately in the Cloud control-plane contract: it can switch the active runtime target inside a configured primary/secondary pair, but it does not make arbitrary runtime-target migration or active/active writes a general deployment guarantee.

Treat restore rehearsal cadence as part of the public operating contract too. At minimum, rehearse the documented restore sequence:

  • before the first production rollout for a topology
  • after any change to the backup mechanism, schema/bootstrap path, auth model, or deployment topology
  • on a regular recurring cadence that is published in the same runbook as the backup schedule

If you cannot produce the latest successful rehearsal date, elapsed restore time, and verification evidence, then backup and DR remain an unproven claim for that topology.

Benchmark envelope

Durable Workflow v2 publishes the dimensions you should benchmark for your own environment. Record these baselines in staging or canary before production traffic depends on them:

DimensionWhat to baselineSource
Projection healthSteady-state needs_rebuild = 0, rebuild duration after intentional drift, and stale/orphan cleanup time/waterline/api/v2/health, /waterline/api/stats, workflow:v2:rebuild-projections
Queue pressureBacklog age, oldest ready task age, runnable vs delayed task counts, task add vs dispatch rate, dispatch-overdue age, stale poller count, and queue admission status (accepting, saturated, throttled, no_slots, no_active_workers)Waterline dashboard stats and queue views plus operator_metrics.backlog.* / operator_metrics.tasks.*
Workflow-start latencyAccepted start commands waiting for first-task creation, oldest pending-start age, and first-task pickup after admissionoperator_metrics.starts.* plus worker schedule_to_start telemetry
Schedule-to-start latencyWorkflow and activity queue wait from enqueue to startWorker SDK metrics
Timer fan-out wake-up behaviorWake-signal propagation time and the lag between scheduled fire time and ready-task visibility during burst timersWorker telemetry plus same-region wake coordination checks
Repair-loop sweep costCandidate counts, selected counts, max candidate age, max missing-run age, and scan-pressure behavioroperator_metrics.repair.*
History pressureEvent count, history size, and continue-as-new recommendation thresholdsoperator_metrics.history.*

These are benchmark dimensions rather than universal latency promises. Publish your own acceptable ranges for the topology you operate.

Long-soak evidence

Benchmark snapshots are not enough on their own. Before you call a topology trusted for sustained traffic, keep a long-soak evidence packet that shows the system stayed inside its declared envelope over time.

Include at least:

  • workload shape: topology, server image or app revision, worker build ids, queue layout, cache backend, database backend, and the representative mix of workflow starts, timer load, activities, queries, and exports
  • soak window: start and end time, plus enough duration to cover at least one normal repair window, one retention or archive pass if applicable, and one representative business-cycle traffic swing for that environment
  • durable queue stability: backlog age, ready-task age, start backlog age, task add versus dispatch rate, and stale-poller counts staying within the published baseline for the topology
  • correctness stability: no sustained status = error from GET /waterline/api/v2/health, no unexplained growth in operator_metrics.repair.*, and no persistent compatibility gaps in operator_metrics.workers.*
  • process and cache stability: worker memory, CPU, event-loop or thread pressure, and cache/cardinality growth staying bounded rather than climbing monotonically under steady load
  • recovery evidence: the latest successful backup timestamp, latest restore rehearsal timestamp, elapsed restore time, and the verification commands that proved the restored environment was ready

Store the packet where the same operators can retrieve the deployment runbook. If a topology claims published benchmark numbers, alert semantics, or recovery timing without a matching soak packet, treat those numbers as provisional rather than trusted operating-envelope evidence.

End-to-end operator checklist

Use this checklist after upgrades and before trusting a new environment:

  1. Run php artisan workflow:v2:doctor --strict.
  2. Check GET /waterline/api/v2/health and confirm whether the state is ok, warning, or error.
  3. Read GET /waterline/api/stats for backlog, repair, history, command contract, worker compatibility, and projection drift facts.
  4. Run projection rebuild or command-contract backfill previews when health reports drift.
  5. Export one representative run and verify the archive/replay artifact path.
  6. Confirm archived runs leave active fleet views while durable rows remain available until retention cleanup.
  7. Rehearse the restore or failover sequence recorded in your deployment runbook and verify the measured lag matches the published expectation for your topology.