Durable Kafka workers with crash recovery

A worker that consumes Kafka, dispatches durable workflows for each message, and resumes cleanly after crashes — Redpanda is a drop-in replacement.

Kafka Worker banner

A consumer pulls messages from a Kafka topic and dispatches each one through a durable Resonate workflow. Per-message state lives in durable promises, so a worker that dies mid-batch resumes cleanly without redoing completed steps.

SDK versions

TypeScript: @resonatehq/sdk v0.10.1 (current). Python: resonate-sdk v0.6.x against the legacy Resonate Server. Rust: v0.1.0, in active development.

Kafka or Redpanda — same code

All three repos run against Redpanda for local development because the broker boots in seconds, but the code uses the Kafka wire protocol — kafkajs for TypeScript, confluent-kafka-python for Python, rdkafka for Rust. Point any of them at any Kafka-compatible broker — Apache Kafka, Redpanda, or a managed offering — by changing the bootstrap-server config.

The problem#

A worker between two queues has two recurring problems. Head-of-line blocking — if the first message kicks off an hour-long task and the next one would take 30 seconds, the short job sits behind the long one unless the worker handles concurrency. Repeated work on crashes — if a multi-step task is half-done when the worker dies, restart usually means repeating completed steps and emitting duplicate side-effects downstream.

Most teams write per-message dedup tables, idempotency keys, and recovery scaffolding — repeatedly, and rarely well.

Resonate's solution#

Each Kafka message becomes the ID of a durable workflow. Resonate runs the workflow concurrently with whatever else is in flight, checkpoints every step, and dedups by ID — so a retry of a partially-completed message reconnects to its in-flight execution rather than starting over. When the workflow finishes, the worker emits a "done" message to the next topic.

Code walkthrough#

The pattern is a small consume loop that hands each message to workflow.begin_run. The workflow itself does the per-message work — here, deleting batches of records until none remain, then emitting a completion event.

The durable workflow#

src/workflow.ts
import { Resonate, type Context } from "@resonatehq/sdk";

export const resonate = new Resonate({
  url: "http://localhost:8001",
  group: "workers",
});

function* workflow(ctx: Context, recordId: string, offset: string) {
  console.log(`processing record ${recordId} (offset ${offset})`);
  // Loop until deleteBatch reports no more rows.
  while (yield* ctx.run(deleteBatch, recordId)) {
    yield* ctx.sleep(5_000);
  }
  yield* ctx.run(enqueueCompletion, recordId, offset);
}

resonate.register("workflow", workflow);

Each ctx.run is a checkpoint. A worker crash mid-loop resumes from the next batch rather than from message zero, and delete_batch / deleteBatch doesn't run again for batches that already completed.

The Kafka consumer#

src/consumer.ts
const consumer = kafka.consumer({ groupId: CONSUMER_GROUP_ID });
await consumer.connect();
await consumer.subscribe({ topic: TOPIC_TO_DELETE, fromBeginning: true });

await consumer.run({
  eachMessage: async ({ message }) => {
    const recordId = JSON.parse(message.value!.toString("utf-8")) as string;
    // beginRun is non-blocking. The recordId acts as the durable promise id,
    // so duplicate invocations dedupe.
    await resonate.beginRun(recordId, "workflow", recordId, message.offset);
  },
});

beginRun / begin_run / .spawn() is non-blocking — the consumer keeps polling while the workflow runs in the background. The same recordId deduplicates retries by design.

Run it locally#

Both repos ship a docker-compose.yml for Redpanda (Redpanda boots faster than Apache Kafka locally; the worker code is identical for both).

code
git clone https://github.com/resonatehq-examples/example-kafka-worker-ts
cd example-kafka-worker-ts
bun install
Terminal 1 — the broker
docker compose up
Terminal 2 — Resonate Server
brew install resonatehq/tap/resonate
resonate dev
Terminal 3 — the worker
bun run consumer
Terminal 4 — produce records
bun run producer -- -n 10

Watch the worker terminal — it consumes each message, runs the durable workflow, and re-emits a completion event when each finishes.

Try the recovery story#

While the worker is processing a batch, kill it. Restart it. Resonate replays the workflow from the last unfinished step — already-deleted batches are not redone, the loop continues, the completion event still fires exactly once.