Rolling Upgrades
Run a rolling upgrade when you want to replace API nodes, workers, or the scheduler without taking the deployment offline. This contract covers the small clustered shape from the self-hosting deployments guide: two or three API nodes behind a load balancer, shared external MySQL or PostgreSQL, shared Redis, independently scaled workers, and exactly one scheduler or maintenance runner.
A rolling upgrade is supported when every guarantee on this page holds. Outside that envelope, use the documented stop-the-world flow in the deployment guide instead.
The role vocabulary and shape manifest behind this contract are documented in
Server Role Topology. This guide
focuses on how the current standalone_server process classes roll in place.
What rolling upgrade means here
A rolling upgrade replaces processes one at a time, draining each one before stopping it, while the rest of the fleet keeps serving traffic. The result is zero downtime for the deployment as a whole and a bounded overlap window where old and new processes coexist.
The contract distinguishes four process classes in the current
standalone_server shape, each with its own rollout posture:
- HTTP/API nodes: stateless server processes serving HTTP traffic and
currently hosting the
api_ingress,control_plane,matching, andhistory_projectionroles. - Workers: SDK processes that poll the worker plane and execute
activity and workflow tasks as the
execution_plane. - Scheduler / maintenance runner: the singleton process that fires
schedules and drives activity-timeout and history retention as the
schedulerrole. - Bootstrap: the one-shot process that runs database migrations and default-namespace seeding.
API nodes and workers roll independently. The scheduler is a singleton — you stop the old one before starting the new one, but the rest of the deployment keeps serving traffic across that gap.
Compatible version-skew rules
The rolling-upgrade overlap window is the time during which more than one server image, workflow package version, or worker SDK version is live. The window must satisfy every rule in this section.
Server image and workflow package
- Adjacent versions only. During a rolling upgrade, every API node and worker must run a server image whose workflow package version is the same major version as the cluster's previous package and within one minor version of every other live process. Skipping a major version requires a stop-the-world upgrade.
- Forward-additive migrations. Every Durable Workflow v2 schema change is additive within a major version. New nodes must not require a column or table that has not been migrated in. Old nodes must not break when a new column they do not read is present. The migration order rules below enforce this.
- Adjacent control-plane and worker-protocol versions. Every node
publishes its supported
control_plane.versionandworker_protocol.versionranges fromGET /api/cluster/info. During a rolling upgrade, the new image's supported range must overlap with every live old node's supported range. Discover the range before the rollout, and abort if it does not overlap.
Worker SDK and build identity
- Workers tag every build. Every worker that may participate in a
rolling upgrade must register through
POST /api/worker/registerwith a stablebuild_id. The unversioned cohort (build_id: null) is the pre-rollout default; the worker build-id rollout guide explains the first cutover. - Workflow definition fingerprints stay pinned. Server images
carrying
DW_V2_PIN_TO_RECORDED_FINGERPRINT=true(the default) keep in-flight runs pinned to the workflow definition fingerprint recorded atWorkflowStarted. A new worker that ships a different fingerprint for the same workflow refuses to claim those runs until they finish. - Overlapping-build admission posture. Choose the
DW_V2_FLEET_VALIDATION_MODEposture before you start.warnlets the rollout proceed even when the required compatibility marker has no live worker;failblocks dispatch and fails the readiness contract closed during that window. Production rollouts that require a clean cutover should be onfail.
Schema/bootstrap ordering
Schema changes ride a single bootstrap pass. Order matters.
- Run bootstrap first, exactly once. Run
php artisan server:bootstrap --force(or the published-image equivalent) from one container before starting any new API node, worker, or scheduler. Bootstrap runsmigrateplus default-namespace seeding; it adopts any workflow package migrations whose tables already exist on the connection. - New schema must be backwards-compatible with old code. Every v2 migration in this release path is additive (new tables, new columns, new indexes). Old API nodes and workers continue running against the new schema.
- Do not start new code before bootstrap completes. Roll the new server image only after bootstrap exits successfully. New API nodes and workers may rely on the freshly migrated tables, and starting them before bootstrap finishes is the most common cause of a 5xx surge during the cutover.
- Bootstrap is idempotent. Re-running it is safe. If bootstrap fails partway, fix the underlying error and re-run; the migration ledger picks up where it left off, and the namespace seed is a no-op when the row exists.
A migration that lands on one server before another MUST NOT corrupt the readiness surface. If you discover a non-additive change during planning, take a stop-the-world upgrade window for that release and return to rolling upgrades on the next one.
Drain and admission during the overlap window
Three admission surfaces enforce overlap-window safety automatically:
- Boot-time admission. Every server process loads
BackendCapabilities,LongPollCacheValidator,WorkflowModeGuard, and the readiness contract at boot. A process whose backend or cache cannot satisfy the v2 contract refuses to mark itself ready. - Worker compatibility. When
DW_V2_FLEET_VALIDATION_MODE=failand no live worker advertises the required compatibility marker for a task's connection and queue scope, the matching role blocks dispatch and theworker_compatibilityhealth check escalates fromwarningtoerror. Tasks stay ready and visible — they are never dropped — and the readiness contract returns 503 on that node so the load balancer takes it out of rotation. - Routing safety. A ready task whose required compatibility has no
live worker is preserved and counted under the
compatibility_blocked_runsbacklog metric. Routing safety never silently escalates to task loss; the at-least-once execution guarantee still applies after a routing drain.
To keep the worker overlap window short, drain old worker cohorts as new workers come online. Use the worker build-id rollout flow:
dw task-queue:drain orders-critical --build-id orders-worker-2026-04-21-z9
The cohort's drain_intent flips to draining. Workers under that
build keep finishing in-flight work but stop claiming new tasks. Wait
for the cohort's active_worker_count and draining_worker_count to
reach zero before stopping the old worker processes.
The scheduler does not run a long-lived task queue, so it does not need a worker drain. Stop the old scheduler container, run bootstrap if it has not run yet, and start the new one. The window between the two is bounded by how long it takes the new container to come up; schedule firing resumes from the persisted state on next tick.
If the rollout also introduces a dedicated matching-role deployment, make that
topology change explicit instead of assuming every execution node will keep
doing broad ready-task sweeps. See
Task Matching and Dispatch for
the documented workflow:v2:repair-pass --loop plus
DW_V2_MATCHING_ROLE_QUEUE_WAKE=0 shape. Verify the live node contract from
GET /api/cluster/info: topology.current_shape should still match the
deployment you are cutting over, topology.current_roles should still match
the documented role bundle for that node, and
topology.matching_role.queue_wake_enabled,
topology.matching_role.shape, and topology.matching_role.wake_owner
should show the expected broad-ready-task owner. The default shape reports
queue_wake_enabled: true, shape: "in_worker", and
wake_owner: "worker_loop"; dedicated matching rollouts flip execution nodes
to queue_wake_enabled: false, shape: "dedicated", and
wake_owner: "dedicated_repair_pass".
Readiness and cutover
Use the readiness contract — not just the liveness probe — to decide when traffic flows to a node.
GET /api/healthproves the process is serving HTTP.GET /api/readyproves the process can use its configured runtime dependencies, including migrations, default namespace, and (underDW_V2_FLEET_VALIDATION_MODE=fail) the worker-compatibility admission check.GET /api/cluster/infoproves an authenticated client can discover build identity, control-plane protocol, worker protocol, payload codecs, and server capabilities.POST /api/worker/registerproves workers can authenticate into the expected namespace and task queue.
Cutover sequence for one API node:
- Take the node out of the load balancer rotation. The simplest path is to fail the load balancer's readiness probe by stopping the new image's pre-start hook before bringing the new container up.
- Drain in-flight HTTP requests. Most clients retry on connection reset; long-running connections (worker long-polls) reconnect against the rest of the fleet.
- Stop the old container, start the new one.
- Wait for
GET /api/readyto return 200 and forGET /api/cluster/infoto advertise the new build identity. When the rollout changes the matching topology, also confirmtopology.matching_role.task_dispatch_mode,topology.matching_role.queue_wake_enabled,topology.matching_role.shape,topology.matching_role.wake_owner,topology.matching_role.partition_primitives, andtopology.matching_role.backpressure_modelmatch the intended deployment before returning the node to traffic. Use/api/system/operator-metricswhen you want the same node-local matching-role contract alongside live backlog, repair, and worker counters from the responding process. - Return the node to rotation.
Repeat one node at a time. Do not roll the next node until the previous one is back in rotation and serving traffic cleanly.
For workers, the cutover is per-cohort:
- Bring the new worker cohort online with a new
build_id. - Confirm both cohorts report
rollout_status: "active"and non-zeroactive_worker_countfromdw task-queue:build-ids <queue> --json. - Drain the old cohort with
dw task-queue:drain. - Wait until the old cohort's
active_worker_countanddraining_worker_countreach zero. - Stop the old worker processes.
Rollback
Every step is reversible. Plan for rollback before you start.
Bootstrap rollback. v2 migrations are reversible by the standard Laravel
down()path. A rollback that reverts a migration the new image relies on requires stopping every new node first; otherwise the new code observes a missing column and the readiness contract returns 503 on those nodes. Most rollbacks do not need to reverse migrations because schema changes are additive.API node rollback. Stop the new container, restart the old one on the same node. Take the node out of rotation while it boots and return it once
GET /api/readyis green. Repeat for any other upgraded API nodes. Old code keeps reading the new schema cleanly because the schema change was additive.Worker rollback. Resume the previously drained cohort, drain the bad cohort, and scale the known-good build back up:
dw task-queue:resume orders-critical --build-id orders-worker-2026-04-21-z9
dw task-queue:drain orders-critical --build-id orders-worker-2026-04-22Resume clears
drain_intentanddrained_at, and any worker heartbeating under the resumedbuild_idflips back toactiveon the next poll. Both calls are idempotent.
If the rollout exposed a non-additive schema problem, take a stop-the-world upgrade window to roll back, run the corrective migrations, and re-plan.
Operator verification
Verify each phase of the rollout from operator surfaces, not from logs.
| Question | Surface |
|---|---|
| Is bootstrap finished? | php artisan server:bootstrap --force exit code 0; migrate:status shows every migration ran. |
| Is the new node ready? | GET /api/ready returns 200; GET /api/cluster/info reports the new build. |
| Is compatibility admission healthy? | GET /api/system/operator-metrics workers.fleet, workers.active_workers, and workers.active_workers_supporting_required agree on a non-zero supporter count for every required compatibility marker. |
| Is the worker drain progressing? | dw task-queue:build-ids <queue> --json shows active_worker_count and draining_worker_count falling for the draining cohort. |
| Is routing safe? | GET /api/system/operator-metrics backlog.compatibility_blocked_runs and backlog.max_compatibility_blocked_age_ms stay near zero; the worker_compatibility health check is not in error. |
| Is the scheduler caught up? | GET /api/system/operator-metrics schedules.missed is zero and schedules.oldest_overdue_at is null. |
| Are stuck runs piling up? | GET /api/system/operator-metrics runs.repair_needed and runs.max_repair_needed_age_ms stay near their pre-rollout baseline. |
dw system:operator-metrics --json exposes the same operator-metrics snapshot
on the console for the standalone-server fleet, so operators may pick whichever
surface matches their existing workflow. The standalone-server distribution
does not run Waterline; embedded Laravel deployments read the matching
dashboard signals under /waterline/api/v2/health and /waterline/api/stats,
as documented in the
Operator Operating Envelope.
Failure modes and what to do
| Symptom | Likely cause | Action |
|---|---|---|
New API node fails GET /api/ready after start | Bootstrap did not finish, or DW_V2_FLEET_VALIDATION_MODE=fail and no compatible worker is live yet | Rerun bootstrap; bring a compatible worker cohort online before adding the API node back to rotation. |
worker_compatibility health check escalates to error mid-rollout | The required compatibility marker has no live supporting worker | Bring more workers under a supporting build_id online; resume a previously drained cohort if rollback is the right call. |
backlog.compatibility_blocked_runs climbs and max_compatibility_blocked_age_ms grows | Tasks are queued for a marker no live worker supports | Same as above; tasks are preserved and will redispatch automatically once a compatible worker heartbeats. |
dw task-queue:drain returns success but workers keep claiming tasks | The worker process did not heartbeat after the drain | Wait one heartbeat cycle; if the cohort stays active, restart the worker process so it picks up the drain intent. |
| Schedule fires stop after a scheduler restart | Old and new scheduler are both stopped | Start the new scheduler container; verify schedules.missed returns to zero on next operator-metrics scrape. |
If a symptom is not on this list, treat the rollout as failed: stop adding new processes, drain whatever new cohorts you started, and restore the previous build before debugging further.
Related references
- Self-Hosting Deployments for the deployment shapes this contract assumes.
- Worker Build-Id Rollout for the per-cohort drain and resume calls.
- Operator Operating Envelope for the diagnostic, queue, and rebuild contract that operators read alongside the rollout signals.
- Server Config Reference
for the rollout-safety environment variables (
DW_V2_FLEET_VALIDATION_MODE,DW_V2_PIN_TO_RECORDED_FINGERPRINT,DW_V2_GUARDRAILS_BOOT,DW_V2_CACHE_VALIDATION_MODE,DW_V2_MULTI_NODE,DW_V2_VALIDATE_CACHE_BACKEND,DW_V2_TASK_REPAIR_*). - Server API Reference for the readiness, cluster info, and operator-metrics endpoints used to verify each phase of the rollout.