Durability

Durability is the core value proposition of Papayya. Every step is checkpointed so that runs survive crashes, deploys, and restarts.

Execution guarantees

Papayya provides at-least-once execution. This means:

Every task/step will execute at least once
If a crash occurs between executing a task and saving its checkpoint, the task may execute again on resume
Tasks that have side effects (sending emails, charging credit cards, writing to databases) should be idempotent — safe to run more than once

This is the same guarantee provided by systems like Temporal, Inngest, and most durable execution frameworks. Exactly-once execution is not possible in distributed systems without two-phase commit.

Rule of thumb: Design your tasks so that running them twice produces the same result as running them once.

How checkpointing works

Cloud

Worker picks up run from queue
  → Launches container with agent code
    → Container executes agent, reports each step to control plane
      → Step persisted to Postgres
        → If container crashes, heartbeat expires after 60s
          → Recovery sweep re-enqueues run
            → New container resumes from last persisted step

Local

run.task("label", fn) called
  → Check cache: if label already executed for this run_id, return cached result
  → Execute fn()
  → Save checkpoint to Postgres via control plane API
  → On process restart with same run_id: cached tasks skip, uncached tasks re-execute

Important: The checkpoint is saved after the function executes. If your process crashes between execution and checkpoint save, the function runs again on the next attempt.

Worker model (Cloud)

The cloud path uses stateless workers:

Workers pull runs from a Redis-backed queue
Each run launches an isolated container
Container reports steps via HTTP to the control plane
If a container dies, heartbeat-based detection marks the run as failed within 60 seconds
Recovery sweep re-enqueues orphaned runs

Workers can be scaled horizontally, deployed, or restarted without losing progress on any run.

Locking (Cloud)

To prevent two workers from executing the same run concurrently:

Each run has a locked_by and locked_until field
A worker acquires the lock atomically before launching a container
If a worker crashes (lock expires), a recovery sweep re-enqueues the run
The lock acquisition uses WHERE locked_by IS NULL OR locked_until < now() — no two workers can acquire the same run

What gets persisted

Data	Storage	Purpose
Run state	Postgres	Source of truth for run lifecycle
Step results / checkpoints	Postgres	Execution trace and replay
Tool call I/O	Postgres	Debugging
Queue notifications	Redis	Work distribution (rebuildable from Postgres)

Redis is a notification layer only. If Redis is lost, it can be rebuilt from Postgres. No run data is lost.

Budget enforcement

Budget is enforced at step boundaries:

Before each step, the system checks if the budget is exhausted
A single step (one LLM call) may exceed the remaining budget — the check happens before the call, but the call's cost isn't known until it completes
After the step, if budget is exceeded, the run pauses (cloud) or raises an error (local)

This means the actual spend may exceed the budget by the cost of one step. For most use cases this is a few cents. Set your budget with this margin in mind.

Durability by path

Feature	Cloud (`@agent` + deploy)	Local (`run.task()`)
Checkpointing	Every LLM call to Postgres (via shim interceptors)	Every `run.task()` to Postgres
Crash recovery	Heartbeat timeout + recovery sweep	Resume with same `run_id`
Budget enforcement	Automatic — shim intercepts LLM calls	Per-task check
Execution trace	Full (steps table with tokens, cost, tool calls)	Checkpoints (labels + results)
Resume after restart	Automatic	Manual (same `run_id`)

Triggers Budget Enforcement