AI Prompting for Test Automation in Banking Systems
Prompting for Test Generation
Prompting for Test Generation
Test coverage in core banking environments has a structural problem: the systems most in need of rigorous testing are often the ones with the least documented behavior. Decades-old COBOL and PL/I modules encode business rules that were never written down elsewhere β interest accrual edge cases, batch cutoff logic, currency rounding conventions inherited from pre-euro systems. Writing meaningful tests against this kind of system requires first reconstructing what "correct" actually means. This is where AI-assisted prompting earns its place in the QA workflow β not as a shortcut to more tests, but as a method for surfacing the scenarios a team didn't know to test for.
The naive use of AI in test automation is asking a model to "write unit tests for this function." Against a legacy core banking module, this produces tests that mirror the code's existing logic β useful for regression safety, but blind to the cases where the logic itself is wrong or incomplete. Engineering teams get more value from prompts that separate two distinct tasks: understanding behavior, then testing behavior.
"Trace this COBOL paragraph and list every branch condition, numeric threshold, and date comparison it evaluates. For each one, state the business rule it appears to implement. Do not generate tests yet."
This first pass surfaces the implicit rule set before any test code is written. Claude is better suited to the second stage, where reasoning needs to span an entire transaction flow rather than a single paragraph or function:
"Here is the full settlement batch routine across these four COBOL programs [paste/attach]. Identify any business rule applied inconsistently between them β for example, a rounding method used in one program but not another. List discrepancies before suggesting tests."
Only once these discrepancies and rules are confirmed by an engineer does it make sense to ask for test generation against them.
Legacy core banking systems accumulate edge cases the way old buildings accumulate plumbing modifications β each one solving a real problem at the time, none of them documented for the next person. Prompting AI models to generate edge cases works best when the prompt supplies the domain constraints explicitly rather than asking generically for "edge cases."
"Generate edge-case test scenarios for an account interest calculation routine. Constraints to cover: month-end and year-end batch boundaries, leap-year date handling, multi-currency rounding to two vs. zero decimal places, accounts mid-migration between core systems, and timing gaps between transaction authorization and settlement. For each scenario, state the expected behavior and why it's non-obvious."
Given these constraints, AI models reliably generate scenario lists that QA engineers would otherwise assemble manually from tribal knowledge and past incident reports. The output still requires validation against actual system behavior β a model has no visibility into a specific bank's historical workarounds β but it consistently surfaces more candidate scenarios per hour than manual enumeration.
Where legacy cores connect to modern services β API gateways, fraud engines, customer-facing applications β contract testing matters more than end-to-end testing. The legacy side of that boundary rarely has a machine-readable schema; its "contract" is whatever behavior the mainframe has always exhibited, including its quirks.
"Here are 20 sample request/response pairs captured from the legacy account-lookup interface [paste samples], and a description of what the consuming microservice expects from each field. Generate a Pact-style consumer contract, including field-level assertions that would catch a silent format change β for example, an account number changing length or a field changing from required to optional."
For legacy integration work specifically, this matters more than it does in modern microservice contexts, because the legacy side often cannot be modified to fail loudly when its contract changes β a downstream consumer may be the first to notice a break in production.
Property-based testing β generating large volumes of varied input to verify that certain invariants always hold, rather than checking fixed examples β is well suited to financial logic, where properties like "debits and credits across a transaction always net to zero" are easier to state than exhaustive example sets. The challenge has always been articulating the properties in the first place, particularly for legacy modules where the rules are implicit.
"Review this fund-transfer module and propose candidate invariants it should always satisfy β properties true regardless of input values, such as conservation of balance across accounts or non-negative balances without an overdraft flag. List each candidate property separately so I can confirm or reject it before any test generation."
Once an engineer confirms which candidates are genuine invariants, those properties translate cleanly into property-based test frameworks, with the AI generating the input generators and boundary distributions needed to exercise them meaningfully.
None of this replaces the judgment of an engineer who understands the institution's specific history and risk tolerance. What changes is the starting point: instead of beginning test design from a blank page or a legacy system's own undocumented assumptions, QA teams begin from a structured, reviewable set of candidate behaviors, edge cases, and invariants β generated through deliberately scoped prompts rather than open-ended requests. For banks running critical logic on systems older than the engineers maintaining them, that shift in starting point is often the difference between testing what the system does and testing what it was meant to do.