Introduction

IT infrastructure for financial institutions is becoming increasingly complex. Modern banking infrastructure combines mainframe applications, monolithic applications based on Service Oriented Architecture, relational databases, and big data platforms. This complexity has led to “brittleness”. Under certain conditions, a failure in a subsystem of the infrastructure can lead to catastrophic failure of an entire business function such as clearing and settlement (see below).

BNY Lost Payments Capability for 19 Hours

In addition to lacking resiliency, these systems lack robustness. Robustness in the face of resiliency means that services degrade gracefully as subsystems become unavailable in a manner that is consistent with business needs. Finally, disaster recovery mechanisms must be put in place that allow full capabilities to be restored in a timely manner after subsystems have failed.

The key question is: What best practices can be employed to ensure greater resiliency and robustness for IT infrastructure?

Problem Description

System is Heterogeneous

Financial institutions today use a combination of mainframes, monolithic applications, relational databases, microservices, and big data platforms. Communication between these subsystems is done across a variety of different channels. These include mainframe communication fabrics, enterprise service buses, lightweight message queues, and integration middleware. Each of these have different ways of storing and transmitting state changes, transforming data, maintaining consistency, and queuing transactions and messages.

This makes it difficult to debug and restore systems when there are problems. There may be no central repository for debug information. Subsystems may have unknown dependencies. Messages and transactions lost in-flight may not be recoverable in the event of system failure.

Legacy Systems

Banks and other institutions employ a wide range of infrastructure, some of it legacy in nature. Mainframes, in particular, have maintained backwards compatibility from generation to generation. This implies that some code may still be used in production after having been written fifty years ago! This code may be poorly documented, as well as difficult and expensive to re-engineer.

Centralized Databases

Many different applications and services may all be storing and retrieving state information from a single centralized relational database. This can cause a single point of failure and coupling between subsystems. The use of centralized databases are often mandated to reduce licensing and administration costs. However, in the era of FoSS (free and open source software), this restriction is no longer necessary.

Architectures Not Aligned to Business

The SOA, or Service-Oriented Architecture, decomposes enterprise software into business processes, services, service components, and operational systems. SOA architecture was designed to maximize re-use of software and hardware components. This design was driven by a desire to minimize software licensing costs (such as those for commercial relational databases and operating systems), and maximize hardware utilization. However, this created some undesirable consequences. If a business requirement changed, it would impact a large number of layers and components. The architecture makes it difficult to optimize components for each business, since they are shared. There are also problems caused by a misalignment of ownership between lines of business and projects for creating and maintaining various services and components.

Best Practices

Centralized Logging

All services and subsystems should subscribe to a central logging facility for debugging and monitoring purposes. This makes information available in a central location for analysis. Modern logging platforms allow for streaming and batch processing of data, and extensive analytics to be performed on log data across data sources.

Correlation IDs

Correlation IDs are identifiers that are passed between processes, programs, and subsystems in order to trace dependencies in the system. This design pattern is particularly important in microservices architectures where a business activity may be carried out by hundreds of microservices, and other applications, acting in concert. Centralized logs can be searched for specific correlation IDs to debug specific errors, and diagnose overall system behavior.

Bounded Context

As mentioned above, using centralized databases to store state information can create a single point of failure in the system. The microservices architecture dictates that context be bounded to each microservice. This means that each microservice is responsible for maintaining its own state. This shifts responsibility from a centralized, shared DBA team to the team delivering the microservices themselves. Bounded contexts reduce coupling between services, making systemic failures less likely.

Employ Domain Driven Design

As mentioned above, one of the main weaknesses of the SOA was difficulty in adapting services to needs that are specific to certain businesses. Adoptees of microservices architectures are attempting to change that by recognizing the importance of domain driven design in best practices. DDD should be used to determine how best to partition services along business lines. In addition, DDD can drive definition of what behavior should be exhibited in the event of subsystem failure or degradation in performance. For instance, if an AML (anti-money laundering) service fails to respond, perhaps a manual approval user interface should be presented to administrators. It is important to keep in mind that failures can be partial, can cascade to other applications and services, and may only show up when a service is interacting with other parts of the system. Resiliency and disaster-recovery requirements cannot come purely from a technical understanding of the system. These requirements must be driven by business requirements from the domain.

Reengineer Legacy Subsystems as Appropriate

Legacy code and systems are often portrayed as the immovable object of IT. Rather than assuming that legacy code cannot be changed or replaced, changes should be prioritized based on business requirements. Legacy programs may have static routing, have inadequate logging, or may have bugs that can put business continuity at risk. If legacy code shows any of these weaknesses, and is a high priority to fix given business considerations, it may be warranted to migrate them to a new architecture or fix the bugs in the current program.