Recovery Protocol

How a logical execution survives the failure of a physical process — settlement chain, lease and heartbeat, and the invariants a conformant server maintains.

Recovery across time and space

The Distributed Recovery Protocol is responsible for the detection and recovery of crash failures.

What recovery means#

Recovery is the mechanism by which a logical execution survives the failure of a physical process. Distributed Async Await assumes a fail-stop process failure model: a process may non-deterministically terminate without warning. When that happens, the logical execution suspends and a successor physical execution can resume the work.

Recovery is the second of the two foundational protocols in Distributed Async Await:

  • The Coordination Protocol ensures coordination across partial order — the non-deterministic interleaving inherent to concurrent systems.
  • The Recovery Protocol ensures recovery across partial failure — the non-deterministic termination inherent to distributed systems.

Three load-bearing pieces#

The recovery semantics described below rest on three load-bearing pieces:

  1. Durable Promises persist execution state. The state required to recreate a function execution is stored externally to any single process. See the Durable Promise Specification.
  2. Function executions are interruption-tolerant. Re-running a durable function from the beginning, with deduplication via durable promises, converges to the same result as an uninterrupted execution. See the Durable Function Specification.
  3. Tasks claim work and report liveness. Workers claim tasks from the durable system, heartbeat to indicate progress, and surrender claims on crash. See the Tasks page for the full lifecycle.

Lease and heartbeat#

Liveness is enforced by a lease. When a worker acquires a task, the server records a deadline by which the worker must signal it is still making progress. Each successful heartbeat extends that deadline.

code
acquire     →  lease = now + leaseTimeout
heartbeat   →  lease = now + leaseTimeout
release     →  lease deleted, version++
lease expiry → release back to pending, version++

Heartbeat is process-level, not task-level: a worker sends one heartbeat for all the acquired tasks it holds, and the server refreshes every matching (id, version) pair in a single round-trip. Workers typically heartbeat at roughly half the lease interval to absorb transient network delay.

If a worker crashes — or loses the ability to make progress — its lease expires. The server transitions the task back to pending with an incremented version, deletes the lease, and re-enqueues the execute message. A successor worker can then claim the task with the new version.

The version is the optimistic concurrency control (OCC) token: every mutating task operation must present the current version, and the server returns 409 Conflict on mismatch. This guarantees a worker that has lost the lease cannot still effect a settle or a fulfill — the new version on the task means the late operation collides with whatever the successor is doing.

The settlement chain#

Every promise settlement — whether triggered by promise.settle, task.fulfill, task.fence, or a server-driven timeout — proceeds through the same three stages, each producing a distinct kind of effect. We call this the settlement chain because the stages form a linear causal order:

code
settle → resume → execute
  1. Settle (intra-object). The composite settles itself. The promise transitions to its terminal state, value and settledAt are recorded, the promise timeout is deleted, the task (if present) transitions to fulfilled, and the callbacks and listeners are consumed. External side effect: one unblock message is sent to each listener address.

  2. Resume (inter-object, internal side effect). For each callback that was registered on the settled promise, enqueue a resume on the awaiter's task — mutating another composite's task state within the server.

  3. Execute (external side effect). When the resume transitions an awaiter task from suspended to pending, the server sends an execute message to that task's resonate:target address.

The chain is linear, not recursive. Settling composite X does not settle any other composite. It may resume awaiter tasks, which may eventually lead those composites to settle through their own future operations — but not as part of this settlement.

The essential guarantee: when a promise settles, all dependents are reliably notified.

Properties a conformant implementation must guarantee#

A Recovery Protocol implementation must satisfy:

  • Liveness under failure. A logical execution's progress is bounded by the availability of some worker, not any specific worker. If a worker crashes, a successor must be able to resume.
  • Safety under recovery. A re-invoked durable function must not externalize observable side effects twice. Side-effect deduplication is provided by the durable promises representing each step.
  • Bounded re-execution. A worker that has lost the ability to make progress must yield its task claim within a bounded time so a successor can take over without indefinite waiting. The lease enforces this bound.

Server invariants#

Recovery rests on structural invariants the server maintains over its promise, task, and callback stores at all times. A violation indicates a bug in the implementation, not a recoverable runtime condition.

Promise invariants#

InvariantCondition
orphan_invokesEvery pending promise with resonate:target must have a task.
missing_ptimeoutEvery pending promise must have a promise timeout entry.
stale_ptimeoutNo promise timeout should exist for a non-pending promise.
timedout_settled_at_mismatchEvery rejected_timedout promise must have settledAt == timeoutAt.
listener_for_settled_promiseNo listener should exist for a non-pending promise.

Task invariants#

InvariantCondition
orphan_tasksEvery task must have a corresponding promise.
pending_task_no_ttimeoutEvery pending task must have a retry timeout.
acquired_task_no_leaseEvery acquired task must have a lease.
suspended_no_callbackEvery suspended task must have at least one callback registered on an awaited promise.
suspended_with_consumed_callbacksNo suspended task should have any consumed callbacks.
suspended_task_has_ttimeoutNo suspended task should have any timeout.
fulfilled_task_has_ttimeoutNo fulfilled task should have any timeout.

Callback invariant#

InvariantCondition
callback_awaiter_no_targetEvery callback's awaiter promise must have a resonate:target tag.

See also#

  • Coordination Protocol — the sibling protocol covering partial order.
  • Message Passing Protocol — the underlying transport layer including the wire envelope and address scheme.
  • Tasks — the lifecycle and operations recovery operationalizes.
  • Processes — the failure model assumed by recovery.