Prompt-Driven Development
Building Banking Workflows with LLMs
Building Banking Workflows with LLMs
Modern banking systems are no longer just codebases — they are increasingly prompt-orchestrated systems where LLMs handle transaction classification, fraud signal interpretation, customer communication, and compliance summarization. As prompts become load-bearing components of production infrastructure, the question is no longer whether to manage them with engineering discipline, but how.
In traditional software development, a logic change requires a code review, a test suite run, and a deployment pipeline. In prompt-driven systems, a subtle wording change in a fraud detection prompt can silently shift model behavior across thousands of daily decisions. Yet most teams still store prompts in spreadsheets, Notion pages, or hardcoded strings.
Prompt-Driven Development (PDD) formalizes the lifecycle of a prompt the same way software engineering formalizes the lifecycle of a function: it is authored, versioned, tested, reviewed, and deployed with explicit controls.
For banking systems specifically — where regulatory auditability and behavioral consistency are non-negotiable — this discipline is not optional. A prompt that classifies a transaction as high-risk cannot be modified ad hoc by a product manager between deployments. It must carry a version identifier, a change history, and a traceable owner.
Prompt versioning goes beyond storing text in Git. A production-grade versioning strategy for banking LLM workflows requires:
Each prompt version should declare its intended behavioral contract: what input schema it expects, what output schema it guarantees, and under what conditions it should be deprecated. A fraud_signal_classifier_v2.3 prompt must document whether it handles multi-currency edge cases differently than v2.2 — and that difference must be traceable to a ticket, a review, and a test delta.
Prompts in production should be stored in an immutable registry — not edited in place. Tools like LangSmith, PromptLayer, or a custom internal registry enforce this. Each version is addressable by hash or semantic ID. Rollbacks become deterministic: reverting to v2.2 is a registry lookup, not a manual copy-paste.
Prompt templates — the structural logic — should be versioned independently from the runtime context injected at inference time (account metadata, transaction history, regulatory jurisdiction). Conflating the two makes behavioral debugging intractable.
Unit testing a prompt means asserting that given a specific input, the model produces output that satisfies a defined behavioral contract — not just that it produces some output.
For a payment dispute classification prompt, a behavioral test suite might assert:
Given a chargeback claim with a matching merchant refund within 48 hours, the output classification must be RESOLVED, not DISPUTED
Given an input with a flagged BIN range, the risk score must exceed threshold 0.75
Output JSON must conform to the declared schema on 100% of test cases
These are not LLM evals in the academic sense — they are deterministic assertions against curated fixtures, run on every prompt version before promotion.
When fraud_signal_classifier_v2.3 is a candidate to replace v2.2, it must be run against the full regression corpus of historical inputs with known outputs. Behavioral divergence — even in cases where v2.3 is technically "better" — must be surfaced explicitly and reviewed before deployment.
A mature PDD pipeline treats prompt promotion with the same ceremony as a service deployment.
A prompt change targeting a KYC summarization workflow should move through: authoring → lint/schema validation → behavioral test suite → human review gate → staging deployment → shadow mode evaluation → production promotion. Each stage has entry and exit criteria. Shadow mode — running the new prompt in parallel without acting on its output — is particularly valuable in banking, where the cost of a behavioral regression is high.
Before a prompt reaches a test runner, automated linting should flag structural issues: missing output schema declarations, injection-vulnerable template variables, token budget overruns on known input distributions, or missing jurisdiction-scoping instructions required by compliance policy.
Prompt deployments must support zero-downtime rollback. When a production incident is traced to a prompt behavioral regression, the resolution path must be: identify the offending version, revert the registry pointer, verify behavioral restoration against the regression suite, and file a post-mortem. The same incident response playbook that governs service outages applies.
Banking systems built on LLMs inherit all the operational complexity of traditional distributed systems — plus the nondeterminism of language models. Prompt-Driven Development provides the engineering framework to manage that complexity: version prompts as first-class artifacts, test them against behavioral contracts, and promote them through CI/CD pipelines with the same rigor applied to production code. In regulated environments, this is not a best practice. It is a baseline requirement.