Data Mesh Principles and Logical Architecture
Our aspiration to augment and
improve every aspect of business and life with data, demands a paradigm shift
in how we manage data at scale. While the technology advances of the past
decade have addressed the scale of volume of data and data processing compute,
they have failed to address scale in other dimensions: changes in the data
landscape, proliferation of sources of data, diversity of data use cases and
users, and speed of response to change. Data mesh addresses these dimensions,
founded in four principles: domain-oriented decentralized data ownership and
architecture, data as a product, self-serve data infrastructure as a platform,
and federated computational governance. Each principle drives a new logical
view of the technical architecture and organizational structure.
This article is written
with the intention of a follow up. It summarizes the data mesh approach by
enumerating its underpinning principles, and the high level logical
architecture that the principles drive. Establishing the high level logical
model is a necessary foundation before I dive into detailed architecture of
data mesh core components in future articles. Hence, if you are in search of a
prescription around exact tools and recipes for data mesh, this article may
disappoint you. If you are seeking a simple and technology-agnostic model that
establishes a common language, come along.
The great divide of data
What do we really mean by data? The answer depends on whom you ask. Today’s
landscape is divided into operational data and analytical data. Operational data sits in databases
behind business capabilities served with microservices, has a transactional
nature, keeps the current state and serves the needs of the applications
running the business. Analytical data is a temporal and aggregated view of the
facts of the business over time, often modeled to provide retrospective or
future-perspective insights; it trains the ML models or feeds the analytical
reports.
The current state of
technology, architecture and organization design is reflective of the
divergence of these two data planes - two levels of existence, integrated yet
separate. This divergence has led to a fragile architecture. Continuously
failing ETL (Extract, Transform, Load) jobs and ever growing complexity of
labyrinth of data pipelines, is a familiar sight to many who attempt to connect
these two planes, flowing data from operational data plane to the analytical
plane, and back to the operational plane.
Analytical data plane
itself has diverged into two main architectures and technology stacks: data lake and data warehouse; with data lake supporting data science access
patterns, and data warehouse supporting analytical and business intelligence
reporting access patterns. For this conversation, I put aside the dance between
the two technology stacks: data warehouse attempting to onboard
data science workflows and data lake attempting to serve data analysts and business intelligence. The original writeup on
data mesh explores the challenges of the existing analytical data plane
architecture.
Core principles and logical architecture of data mesh
Data mesh objective is to create a foundation for getting value from analytical data and historical facts at scale - scale being applied to constant change of data landscape, proliferation of both sources of data and consumers, diversity of transformation and processing that use cases require, speed of response to change. To achieve this objective, I suggest that there are four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.
While I expect the practices, technologies and implementations of these principles vary and mature over time, these principles remain unchanged.
I have intended for the four principles to be collectively necessary and sufficient; to enable scale with resiliency while addressing concerns around siloeing of incompatible data or increased cost of operation. Let's dive into each principle and then design the conceptual architecture that supports it.
Yorumlar
Yorum Gönder