Data Capabilities

Capabilities are discrete functional components for enabling a business or technical activity. These capabilities can be used in concert to enable various plays within the playbook. They are grouped into Services that are security pre-approved with associated security controls accelerating the ATO of our tenant applications. For data, the defined services are Data Governance, Data Management and Data Analytics.

The following 18 data capabilities are numbered and organized by three categories: Data Governance, Data Management and Data Analytics. For each capability you will find a description, key capabilities, maturity (as defined by the maturity index to the right), and when applicable, corresponding technologies and documentation. For further exploration, visit the FAS Enterprise Data Architecture section of the Playbook

Maturity Index

1. Data Catalog (Data Governance)

Description

The Data Catalog is an organized inventory of data assets in the organization. It uses metadata to help FAS manage its data, and helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.

The data catalog stores data set and attribute-level metadata and enables data stewards to create and maintain that metadata for existing and new data sets.

Key Capabilities
Data search & Discovery	Find relevant information within the huge volumes of enterprise data, contextualize it, and determine how the data can be accessed/used
Curation & Governance	Ensure analytics and insights are derived from the best, most trusted data. By applying governance at the point of data use, the data catalog reduces misuse of data and ensures compliance with agency and regulatory policies.
Collaboration & Analysis	Through wiki-like articles, ratings, reviews, and conversations, the data catalog facilitates collaboration among an increasingly global and remote workforce.

Maturity

Common Service

FCS Product Offerings

Data Governance

Technologies

Additional Documentation

Data Catalog Capability

2. Data Quality (Data Governance)

Description

The Data Quality Service provides the necessary capabilities to assess the validity, accuracy, completeness, correctness, and timeliness of the data. The service supports data users as they are evaluating new data sets; and production applications and data pipelines as they are performing CRUD functions and processing data.

Key Capabilities
Data profiling	Generate descriptive metadata about a data set, (e.g. schema, data types, field lengths, value distribution, valid values, etc.)
Rule definition	Specify data quality rules based on prescriptive, (e.g. business rules) and descriptive, (e.g. technical) constraints and specify applicability (e.g. full dataset, sampling, etc.)
Rule execution	Invoke rules through data pipelines/orchestration solutions and support corrective data quality, including logging of data corrections and rule execution results
Rule lifecycle management	Modify rules and track changes over time.
DQ Results Reporting/Notification	Includes DQ dashboard results, alerts/notifications for users and systems

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD

3. Master Data (Data Governance)

Description

The Master Data Service provides capabilities to rationalize core data domains and create authoritative data sets that can be exposed and leveraged across systems and business domains.

Key Capabilities
Unique Identifier Creation and Management	Create/apply unique IDs to drive consistent identification/linking of master data elements across systems
Data Standardization	Apply consistent formatting and correct data inconsistencies of master data elements, (e.g. address formatting standardization)
Exact Matching	Identify master data relationship across systems based on byte-for-byte matching values
Fuzzy Matching	Identify potential master data relationships across systems based on similar values and/or complex logic across multiple attributes
Recommendations	Show potential master data record matches across systems and allow users to determine if they are valid or invalid matches.

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD

4. Data Lifecycle (Data Governance)

Description

The Data Lifecycle Service provides the mechanism to manage data storage in alignment with data retention, archiving, and purge requirements to reduce data sprawl and storage costs

Key Capabilities
Lifecycle Definition	Define the conditions under which data can be retained, archived and or purged.
Time Driven Lifecycle	Move data to lower tiered storage based on elapsed calendar time since the data was created or modified.
Utilization Driven Lifecycle	Move data to lower tiered storage based on elapsed calendar time since the data was last touched.
Intelligent Tiering	Move data between storage tiers based on utilization/access patterns.

Maturity

Concept Phase

FCS Product Offerings

Data Lake
Data Warehouse

Technologies

Additional Documentation

* AWS S3 Lifecycle Management

5. Lineage (Data Governance)

Description

The Data Lineage Service enables understanding, recording and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way—how the data was transformed, what changed and why.

Data lineage shows the history of the data you are looking at today, detailing where it originated and how it may have changed over time. It is a reflection of the data life cycle, the source, what processes or systems may have altered it and how it arrived at its current location and state.

Key Capabilities
Lineage Mapping	Graphical representation of the data flow between source and target
Lineage Details	Description of data transformations applied to the data through each step of the data processing pipeline.
Design-time Lineage	Lineage based on the intended process flow when the data pipeline was being created.
Run-time Lineage	Lineage based on the actual data pipeline execution

Maturity

Early Adoption

FCS Product Offerings

Data Governance

Technologies

Additional Documentation

* Data Catalog Capability

6. Reference Data (Data Governance)

Description

The Reference Data Service provides a means to manage bounded, common data sets across data domains to drive consistency. Reference data is slowly changing by nature and is used to group or organize other data. Within OLAP models, reference data is often represented through dimension tables.

Managing reference data centrally ensures the ability to consistently group and organize data which enables easier cross-domain analytics.

Reference Data PublicationGenerate and expose authoritative copies of reference data to support different data consumers.

Key Capabilities
Reference Data Inventory	Store and manage reference data sets centrally.
Change Notification	Create systematic alerts when reference data records are created, modified, or deleted.
Reference Data Harmonization	Standardization of multi-source reference data through business rules applied as transformation logic.

Maturity

Early Adoption

FCS Product Offerings

Data Warehouse

Technologies

Additional Documentation

* TBD

7. Data Policy (Data Governance)

Description

The Data Policy service provides a centralized location to define and manage the rules for users interaction with data. Stewards can map the rules to specific data sets and identify which policies are being applied to what data and user groups.

Key Capabilities
Policy Definition	Specify rules, conditions, warnings mapped to data sets and elements.
Policy Execution	Based on defined rules manage users access/interaction with data consistent to the policy definition.
Policy Audit	Detailed view of policy definition and how it is applied to specific data sets and elements.

Maturity

Concept Phase

FCS Product Offerings

Data Governance

Technologies

Additional Documentation

* TBD

8. Sensitive Data Detection (Data Governance)

Description

The Sensitive Data Detection service provides an automated means to identify data elements that require additional data protection or special handling based on organizational or regulatory rules.

Key Capabilities
Pattern Matching	Identification of sensitive data elements based on attribute structure/format.
Metadata Matching	Identification of sensitive data elements based on attribute name or definition.
Rule Definition	Creation of detection rules based on business-defined conditions.
Catalog Integration	Automated updating of data catalog with tags for sensitive data attributes.

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD

9. Scheduling and Orchestration (Data Management)

Description

The Scheduling and Orchestration Service provides the ability to set up recurring executions of data pipelines/processes based on time parameters or conditions. This service reduces the need for manual intervention and can be used in conjunction with infrastructure provisioning capabilities.

Key Capabilities
Time-based Schedule Creation	Configuration of recurring job executions based on time conditions, (e.g. time of day, day of week, 1st day of the month, etc.)
Condition-based Schedule Creation	Configuration of recurring job executions based on specific conditions being true. This could include dependency on another job, a specific file being delivered or a notification from another system.
Job Execution Retry	In the event that a job does not complete successfully, automatically restarting the job.
Point-of-Failure Restartability	In the event of a process failure, the ability to restart the job from the point that the failure occurred vs. restarting the entire process over.
Job Branching and Merging	Complex orchestration that allows jobs to initiate other jobs, wait for other jobs to complete before execution and feed processing details into subsequent jobs.

Maturity

Early Adoption

FCS Product Offerings

Extract Transform Load (ETL)

Technologies

Linux cron / crontab

Additional Documentation

* TBD

10. Data Model (Data Management)

Description

The Data Model service provides a means to manage and map key organizational data and relationships between that data and represent those relationships graphically. This is key for supporting data governance, management, and design activities. Integration of the Data Modeling solution and Data Catalog is key to ensure consistent data management.

Key Capabilities
Model Creation	Define a model including key entities/tables, attributes/fields, and relationships.
Attribute Management	Configuration of attributes including defining business and technical metadata
Relationship Management	Define how different entities are related based on the attributes that each entity contains
Constraint Management	Establish rules for attributes, (e.g. key values, valid values, nullability, format, etc.)
Data Definition Language Generation	Creation of scripts from the data model that can be used to create/modify database objects.
Reverse Engineering	Generating a data model based on a database DDL

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD

11. Data Sharing (Data Management)

Description

The Data Sharing service provides systematic means for data owners to expose data to interested parties through controlled interfaces.

Key Capabilities
Direct Access	Providing data consumer direct access to the data storage layer through defined access controls based on the sensitivity of the data and the permission of the user.
Data Abstraction	Creation of a semantic layer to manage data access and provide a managed view of the data to the data consumer where the consumer does not have direct access to the underlying data.

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD

12. Data Exchange (Data Management)

Description

The Data Exchange service provides a means to deliver authoritative copies of data to downstream users/systems to support local application processing and/or analytics.

Key Capabilities
Bulk Data Transfer	Creation of an authoritative copy of data that can be delivered to the consumer for reuse. This is done through the creation of batch files delivered to a specified location or through the creation of database replicas for one-time or ongoing (change data capture) data transfer.
Application Programming Interface (API)	Brokered real-time synchronous interface between data owner and consumer based on a request/response paradigm whereby the consumer makes a specific request for data to the data owner based on a predefined data specification.
Event Publication	Brokered real-time asynchronous interface where the data owner publishes notifications of changes in state or the actual state change of data to a centralized queue where consumers can monitor the queue for data of interest and consume and process the data as the events occur.

Maturity

Early Adoption

FCS Product Offerings

Data Lake
Database Migration

Technologies

Additional Documentation

* TBD

13. Data Processing (Data Management)

Description

The Data Processing service enables integration, standardization, organization, and data derivation to make data easier to consume/use downstream. It supports data integration to manipulate and consolidate data from disparate sources into a useful form. This allows users to have easy access and reliable means to meet the information needs of all applications, users, and business processes. It helps to produce a unified view to be able to glean actionable information from it.

Key Capabilities
Extract Transform Load (ETL)	Access and pull data from sources, apply transformations, refine and publish data for downstream consumption.
Extract Load Transform (ELT)	Access and pull data from sources, persist a copy of the source data for additional refinement, apply transformations, refine and publish data for downstream consumption.

Maturity

Common Service

FCS Product Offerings

Extract Transform Load (ETL)
Data Processing Cluster
Database Migration

Technologies

Additional Documentation

* Data Integration Play

14. Data Storage (Data Management)

Description

The Data Storage service provides the ability to store, manage, and expose data for data consumers to access, query, explore, analyze, and use that data to generate new insights and reports.

Key Capabilities
Unstructured Data Storage	Capturing and persisting data in a scalable manner to enable centralized storage of cross-domain data for further downstream processing and consumption. Unstructured data storage can handle any file/object type and store it in a cost-effective manner with easy ingestion and access methods.
Structured Data Storage	Capturing and persisting conformed data organized in a business context to support ease of data exploration, analytics and reporting. Structured data storage enforces data design specifications such as schema to improve quality and usability of the data.
Data Access	Query/interact with data through standard interfaces based on user roles and data protection policies.
Data Protection	Encrypt data to further protect it from unnecessary exposure/access.

Maturity

Common Service

FCS Product Offerings

Data Lake
Data Warehouse

Technologies

Additional Documentation

* Data Warehousing with AWS Redshift

15. Self Service (Data Analytics)

Description

Self-Service provides capabilities to allow users to query data through a command line interface that supports ANSI standard SQL and manipulate, integrate, and transform data to derive new insights. This service is intended to allow business users to generate new insights and prototype data pipelines.

Key Capabilities
Query creation	Writing of custom SQL against analytic data stores to explore the data and generate insights
Query optimization and editing	Refactoring of query based on new business requirements or to improve performance based on systematic recommendations, (e.g. explain plan)
Query version control	Saving/persisting versions of a query, being able to track changes, and potentially branch/merge code across users
Extract Transform Load (ETL)	Access and pull data from sources, apply transformations, refine and publish data to support localized analytics/reporting
Extract Load Transform (ELT)	Access and pull data from sources, persist a copy of the source data for additional refinement, apply transformations, refine and publish data to support localized analytics/reporting.

Maturity

Early Adoption

Technologies

Additional Documentation

* TBD

16. Computational Service (Data Analytics)

Description

The Computational service provides a means for scalable, parallelized complex data processing and compute. It is intended to support core capabilities for advanced data processing to support analytics, data science, and machine learning (ML).

Key Capabilities
Apache Spark-based Processing	Leverages Spark's in-memory processing to improve scale and parallelization for large-scale data processing.
Multi-language Support	Use python, scala or java to write Spark-based data processes.
Library Integrations	Extend data science functionality through common open source libraries.
EMR Studio / Jupyter Notebooks	Integrated development environment (IDE) for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.

Maturity

Early Adoption

FCS Product Offerings

Data Processing Cluster

Technologies

Additional Documentation

EMR User Guide

17. Business Intelligence (Data Analytics)

Description

The Business Intelligence service includes all facets of standard reporting, dashboarding and data visualization capabilities including authoring, publication, lifecycle management, and access to reporting and visualization artifacts.

Key Capabilities
Pixel-perfect Reporting	Structured reporting conformed to exact specifications to meet organizational or regulatory requirements.
Standard Reporting	Structured tabular reports where the user can interact with the data to filter, drill up/down/across, and explore the underlying data.
Visualization/Dashboards	Interactive reports including charts, visual representations, and graphs.

Maturity

Common Service

FCS Product Offerings

Business Intelligence

Technologies

Additional Documentation

* Data Visualization Play

18. AI/ML Lifecycle (Data Analytics)

Description

The AI/ML Lifecycle service enables data scientists to manage all facets of model creation and execution through standardized tools and methods aligned with best practices for model management and DevSecOps approaches.

Key Capabilities
Data Acquisition and Refinement	Import/access data and standardize it for input to machine learning model
Model Development	Create and refine models
Model Training	Harmonize models through additional input data and refactoring
Model Testing	Validate model outcomes and functionality.
Model Versioning	Retain model versions, including input data, code, and output data for development and compliance requirements
Model Promotion	Migrate approved models to execute in a production environment and/or integrate with production applications.
Model Monitoring	Recurring validation of models to identify data and/or model drift.
Model Refactoring	Updating/retraining models to ensure model produces appropriate outcomes

Maturity

Concept Phase

Technologies

* TBD

Additional Documentation

* TBD