Fleet Bidirectional Data Synchronization

This case study provides an overview and a how-to for a number of technical challenges that Fleet modernizations addressed successfully.

Introduction

The FAS-IT modernization aims, among other business objectives, to retire mainframe systems. As Fleet modernizes its systems, the newly deployed products will need access to legacy data. Also, the modernized applications generate transactional data that will need to find its way to legacy systems. Hence, there is a need for bidirectional data synchronization between the two environments. This case study provides an overview and a how-to for a number of technical challenges that Fleet modernizations addressed successfully.

Problem Statement

By 2019, the constellation of Fleet systems had proliferated significantly as new business requirements yielded more interconnected Fleet systems. Fleet embarked on a modernization journey to build a cloud-based solution: GSAFleet.gov. The modernization is a multi-year effort that requires that systems, old and newly deployed, to operate in concert. During the transition period, both GSAFleet.gov and legacy Fleet systems produce new transactional data; there is a need to keep that data synchronized between all these data stores. In theory a customer can buy a car in the GSAFleet.gov and service it using legacy systems. Hence, the requisite data would have to flow in both directions to maintain such “live-live” operations in the cloud, mainframe, and data center.

How can interdependent systems that run on the mainframe, cloud, and data centers operate for several years without interruption? Can the requisite data be kept synchronized adequately such that both legacy and new systems are integrated seamlessly? To answer this question, let's delve into the business requirements for data.

Data Requirements

In near real-time, systems on the various platforms will have access to the same data values.
If features of systems are moved, the remaining pieces will be satisfied with their own data plus the data written back from the new system.
As entire legacy systems are decommissioned, rewire all the data pipelines to reflect the state of the new dataflows.
As modernization progresses, the scope of the data synchronization solution will resemble a Bell curve. Initially, more data pipelines will be added to connect to the increased number of the new GSAFleet.gov systems while a significant number of legacy systems are kept online. At the beginning and end of the project, the number of synchronization pipelines will be small.

The central problem was how to achieve the bidirectional synchronization across systems on different platforms (mainframe, datacenter, and the cloud) while supporting different formats (structured, relational, and schemaless/NoSQL). It was impossible to host a single, central data repository that all systems could refer to, therefore the data is necessarily duplicated and as such synchronization became paramount. Additionally, the bidirectional data synchronization has inherent risks associated with the possibility of having the same record edited from two sides before the changes made on one end are known to the other end. There is also a known tradeoff between the data availability and data consistency in a distributed system.

Considered Solutions

Initial architecture included bidirectional synchronization on every node of the data architecture e.g. between the consolidated data storage hosted in a data center and a document store residing in the cloud.

We also considered peer-to-peer database replications. Our proof of concepts demonstrated that this approach could not satisfy the enterprise level requirements. For example, in the case of a catastrophic failure, a full data recovery would not be possible, potentially leading to a data loss. Moreover, peer to peer replication is not supported when done over two different schemas, which was the case for us.

In an iterative redesign process, the document store has been replaced with a relational database, and bidirectional data pipelines became one-directional and almost the entire infrastructure moved into the cloud. Still, in this architecture every node of the pipelines would implement a bidirectional synchronization, subject to all the associated risks, complicating the development, processing, and disaster recovery.

Adopted Solution

We simplified as much as possible the bidirectional at each node so that:

Data flows one way 80% of the time. This cuts down on risks associated with the bidirectional synchronization at every node.
Bidirectional channels are used only with our legacy data center systems which use hierarchical databases.
The only bidirectional synchronization present is handled via a commercial tool.

Data is still flowing in both directions. However, the bidirectional aspect of this is limited to a single node and it is accomplished using a tool : Data Exchange. While the other nodes allow for data traffic to flow seamlessly using other means, such as flags and timestamps. This solution saved multiple tasks, and allowed us to jumpstart GSAFleet.gov development providing the data to the new system while keeping the data synchronized between the two.

Data Exchange

Data Exchange is a key component of our solution. This tool enables us to bring over mainframe databases and create mirror copies of them in relational databases. It enables us to replicate data from legacy databases and propagate the updates from and to their cloud-based relational database equivalent, thus keeping both in sync with minimal latency.

ETLs and a Consolidate Database

To provide a single source of truth of the legacy data, we built a consolidated database (AIC) that serves as a central data hub. The specifically developed framework executes about 300 interdependent Extract, Transform,and Load (ETL) scripts migrating 22 data domains consisting of over 4000 attributes and millions of records. The ETLs consolidate data from those 11 relational databases and build AIC.

GSAFleet.gov Data Architecture

GSAFleet.gov implements a microservice architecture and uses Domain Driven Design pattern. Each microservice maintains its own data storage independently. The data in these data stores needs to flow from the consolidated database and into mirrored legacy databases. This is achieved via a series of pipelines using StreamSets. This is an off-the-shelf data migration solution that allows building data pipelines using a graphic user interface with drag and drop features.

The following fully developed and robust architecture has been able to successfully keep the data in sync between the GSAFleet.gov and legacy systems and databases: Hundred of thousands of records are being created or updated on a daily basis.