Designing Banking Processes for Automation
bankingApril 20, 2026

Designing Banking Processes for Automation

When to Use Sagas and When to Simplify

Banking workflows — onboarding, loan approval, settlement, chargeback resolution — all share the same structural problem: they span multiple service boundaries, run for minutes or days, and operate under regulatory constraints where a partial failure isn't just a bug, it's a compliance event. The systems I've worked on in this space make one thing very clear: the cost of getting the orchestration model wrong compounds quickly. 


The saga pattern gets reached for often in this domain, sometimes appropriately, sometimes not. The real skill is knowing when its complexity is actually buying you something. 


What a saga is — and what it isn't 


A saga is a sequence of local transactions coordinated across service boundaries, where each step has a corresponding compensating transaction that semantically undoes the effect if a later failure requires rollback. It's not a distributed transaction — there's no two-phase commit, no global lock. Atomicity is replaced with eventual consistency and deliberate compensation. 


Two flavours exist. Choreography-based sagas have no central coordinator — each service reacts to events and publishes its own. Orchestration-based sagas (process managers) use an explicit stateful entity that issues commands and tracks progress. In banking, I almost always prefer the orchestrator. Choreography makes reconstruction of a process's current state expensive — you're joining events across multiple services. An orchestrator makes process state first-class and queryable, which is exactly what audits and support queues demand. 


Loan approval — orchestrated saga (happy path) 


KYC check 


→ 


Credit score 


→ 


Risk assessment 


→ 


Approval issued 


Compensation path (on failure) 


Release credit hold 


→ 


Void risk entry 


→ 


Application voided 


When the pattern earns its complexity cost 


Sagas are the right tool when: the workflow spans multiple service boundaries that can't share a database transaction; the process is long-running and must survive restarts; individual steps have meaningful compensating actions; and the business needs to observe process state as a discrete entity over time. Onboarding checks all of these boxes. A KYC provider, credit bureau, document store, and core banking system are four separate services, four separate failure domains, and the process can sit open for days. 


Use sagas when 


Steps cross service boundaries 


Duration exceeds a single request 


Each step has a compensating action 


Audit trail is a hard requirement 


Partial failure has material impact 


Simplify when 


All steps share one data store 


Process completes in one request 


Retry alone handles failure cleanly 


Workflow is linear with no branching 


No regulatory audit requirement 


Settlement reconciliation is a useful counterexample. If the reconciliation runs within a single service against a single database, wrapping it in a distributed saga adds surface area with no correctness benefit. A well-structured state machine with idempotency keys on the write operations is the right call. Don't pay the saga tax if you're not getting saga guarantees in return. 


Compensation, retry, and timeout — the tricky parts 


Compensating transactions are harder than they look. The naive assumption is that compensation is the inverse of the original action. In practice, time passes between a step executing and a compensation being triggered. A credit reservation compensated three hours later may have already propagated into downstream reports. The compensation has to handle that — often by issuing a corrective entry rather than a deletion. Design compensations as first-class operations: their own idempotency keys, their own retry logic, their own audit events. A compensation that fails silently is worse than no compensation at all. 


Implementation note 


Every operation in a saga — including compensations — must be idempotent. The process manager will retry on transient failures; the downstream service must handle being called twice with the same idempotency key and return the same result. This isn't optional: without it, retry logic becomes a liability rather than a safety net. 


 


Retry policies need explicit bounds and backoff. Exponential backoff with jitter, a hard cap on attempts, and a dead-letter path for exhausted retries. Unbounded retries on a payment instruction will eventually cause a duplicate payment — a failure mode that's far more expensive than a failed transaction. 


Timeouts are business decisions. How long should a loan approval wait for a credit bureau response before escalating? That's an SLA conversation, not a http.client.timeout value. Model timeouts as explicit events in the process manager: when one fires, it should trigger a defined transition — retry, escalate, compensate, alert — not a silent expiry. Timeouts that disappear into logs without driving a state change are operational debt. 


Invariant enforcement in regulated environments 


Banking processes carry hard invariants: a loan cannot disburse if KYC is not in an approved state; a chargeback cannot close without a linked dispute record. The instinct is to enforce these in the downstream services, but I've seen this pattern break badly when a new team member adds a direct API call that bypasses the workflow. Put the invariant checks in the process manager itself. The business rules stay visible, centralised, and version-controlled alongside the workflow definition. 


Every state transition should be logged with a timestamp, the triggering event, and the actor — whether that's a service identity or a human user. The process state should be queryable as a single authoritative record, not reconstructed from event log joins. When a regulator asks why a specific customer's onboarding was paused for 48 hours, the answer should come from a direct query, not a forensic investigation across three services. 


Takeaway 


Sagas give you resilience and auditability at the cost of real operational complexity. That trade-off is worth making for long-running, cross-service, regulated workflows. For everything else, reach for the simpler pattern first — a reliable queue, idempotent writes, and a well-defined state machine will cover more ground than most teams expect. The discipline is in making that call deliberately, not defaulting to the most sophisticated tool in the box.