Prompt-Driven Development

Modern banking systems are no longer just codebases — they are increasingly prompt-orchestrated systems where LLMs handle transaction classification, fraud signal interpretation, customer communication, and compliance summarization. As prompts become load-bearing components of production infrastructure, the question is no longer whether to manage them with engineering discipline, but how.

Prompts Are Code — Treat Them That Way

In traditional software development, a logic change requires a code review, a test suite run, and a deployment pipeline. In prompt-driven systems, a subtle wording change in a fraud detection prompt can silently shift model behavior across thousands of daily decisions. Yet most teams still store prompts in spreadsheets, Notion pages, or hardcoded strings.

Prompt-Driven Development (PDD) formalizes the lifecycle of a prompt the same way software engineering formalizes the lifecycle of a function: it is authored, versioned, tested, reviewed, and deployed with explicit controls.

For banking systems specifically — where regulatory auditability and behavioral consistency are non-negotiable — this discipline is not optional. A prompt that classifies a transaction as high-risk cannot be modified ad hoc by a product manager between deployments. It must carry a version identifier, a change history, and a traceable owner.

Prompt Versioning: Structural Requirements

Prompt versioning goes beyond storing text in Git. A production-grade versioning strategy for banking LLM workflows requires:

Semantic versioning with behavioral contracts

Each prompt version should declare its intended behavioral contract: what input schema it expects, what output schema it guarantees, and under what conditions it should be deprecated. A fraud_signal_classifier_v2.3 prompt must document whether it handles multi-currency edge cases differently than v2.2 — and that difference must be traceable to a ticket, a review, and a test delta.

Immutable prompt registries

Prompts in production should be stored in an immutable registry — not edited in place. Tools like LangSmith, PromptLayer, or a custom internal registry enforce this. Each version is addressable by hash or semantic ID. Rollbacks become deterministic: reverting to v2.2 is a registry lookup, not a manual copy-paste.

Separation of template and context

Prompt templates — the structural logic — should be versioned independently from the runtime context injected at inference time (account metadata, transaction history, regulatory jurisdiction). Conflating the two makes behavioral debugging intractable.

Testing Prompts Like Code

Unit testing a prompt means asserting that given a specific input, the model produces output that satisfies a defined behavioral contract — not just that it produces some output.

Behavioral test suites

For a payment dispute classification prompt, a behavioral test suite might assert:

Given a chargeback claim with a matching merchant refund within 48 hours, the output classification must be RESOLVED, not DISPUTED

Given an input with a flagged BIN range, the risk score must exceed threshold 0.75

Output JSON must conform to the declared schema on 100% of test cases

These are not LLM evals in the academic sense — they are deterministic assertions against curated fixtures, run on every prompt version before promotion.

Regression testing across versions

When fraud_signal_classifier_v2.3 is a candidate to replace v2.2, it must be run against the full regression corpus of historical inputs with known outputs. Behavioral divergence — even in cases where v2.3 is technically "better" — must be surfaced explicitly and reviewed before deployment.

Integrating Prompts into CI/CD

A mature PDD pipeline treats prompt promotion with the same ceremony as a service deployment.

Pipeline stages for prompt changes

A prompt change targeting a KYC summarization workflow should move through: authoring → lint/schema validation → behavioral test suite → human review gate → staging deployment → shadow mode evaluation → production promotion. Each stage has entry and exit criteria. Shadow mode — running the new prompt in parallel without acting on its output — is particularly valuable in banking, where the cost of a behavioral regression is high.

Prompt linting and static analysis

Before a prompt reaches a test runner, automated linting should flag structural issues: missing output schema declarations, injection-vulnerable template variables, token budget overruns on known input distributions, or missing jurisdiction-scoping instructions required by compliance policy.

Rollback and incident response

Prompt deployments must support zero-downtime rollback. When a production incident is traced to a prompt behavioral regression, the resolution path must be: identify the offending version, revert the registry pointer, verify behavioral restoration against the regression suite, and file a post-mortem. The same incident response playbook that governs service outages applies.

Conclusion

Banking systems built on LLMs inherit all the operational complexity of traditional distributed systems — plus the nondeterminism of language models. Prompt-Driven Development provides the engineering framework to manage that complexity: version prompts as first-class artifacts, test them against behavioral contracts, and promote them through CI/CD pipelines with the same rigor applied to production code. In regulated environments, this is not a best practice. It is a baseline requirement.