Core Concepts
Durability

Durability

Durability is the core value proposition of Papayya. Every step is checkpointed so that runs survive crashes, deploys, and restarts.

Execution guarantees

Papayya provides at-least-once execution. This means:

  • Every task/step will execute at least once
  • If a crash occurs between executing a task and saving its checkpoint, the task may execute again on resume
  • Tasks that have side effects (sending emails, charging credit cards, writing to databases) should be idempotent — safe to run more than once

This is the same guarantee provided by systems like Temporal, Inngest, and most durable execution frameworks. Exactly-once execution is not possible in distributed systems without two-phase commit.

Rule of thumb: Design your tasks so that running them twice produces the same result as running them once.

How checkpointing works

Cloud

Worker picks up run from queue
  → Launches container with agent code
    → Container executes agent, reports each step to control plane
      → Step persisted to Postgres
        → If container crashes, heartbeat expires after 60s
          → Recovery sweep re-enqueues run
            → New container resumes from last persisted step

Local

run.task("label", fn) called
  → Check cache: if label already executed for this run_id, return cached result
  → Execute fn()
  → Save checkpoint to Postgres via control plane API
  → On process restart with same run_id: cached tasks skip, uncached tasks re-execute

Important: The checkpoint is saved after the function executes. If your process crashes between execution and checkpoint save, the function runs again on the next attempt.

Worker model (Cloud)

The cloud path uses stateless workers:

  • Workers pull runs from a Redis-backed queue
  • Each run launches an isolated container
  • Container reports steps via HTTP to the control plane
  • If a container dies, heartbeat-based detection marks the run as failed within 60 seconds
  • Recovery sweep re-enqueues orphaned runs

Workers can be scaled horizontally, deployed, or restarted without losing progress on any run.

Locking (Cloud)

To prevent two workers from executing the same run concurrently:

  • Each run has a locked_by and locked_until field
  • A worker acquires the lock atomically before launching a container
  • If a worker crashes (lock expires), a recovery sweep re-enqueues the run
  • The lock acquisition uses WHERE locked_by IS NULL OR locked_until < now() — no two workers can acquire the same run

What gets persisted

DataStoragePurpose
Run statePostgresSource of truth for run lifecycle
Step results / checkpointsPostgresExecution trace and replay
Tool call I/OPostgresDebugging
Queue notificationsRedisWork distribution (rebuildable from Postgres)

Redis is a notification layer only. If Redis is lost, it can be rebuilt from Postgres. No run data is lost.

Budget enforcement

Budget is enforced at step boundaries:

  • Before each step, the system checks if the budget is exhausted
  • A single step (one LLM call) may exceed the remaining budget — the check happens before the call, but the call's cost isn't known until it completes
  • After the step, if budget is exceeded, the run pauses (cloud) or raises an error (local)

This means the actual spend may exceed the budget by the cost of one step. For most use cases this is a few cents. Set your budget with this margin in mind.

Durability by path

FeatureCloud (@agent + deploy)Local (run.task())
CheckpointingEvery LLM call to Postgres (via shim interceptors)Every run.task() to Postgres
Crash recoveryHeartbeat timeout + recovery sweepResume with same run_id
Budget enforcementAutomatic — shim intercepts LLM callsPer-task check
Execution traceFull (steps table with tokens, cost, tool calls)Checkpoints (labels + results)
Resume after restartAutomaticManual (same run_id)