Chapter 3. Before The Inflection Point

Today’s problems come from yesterday’s “solutions.”

Senge, Peter M. The Fifth Discipline

Organizational complexity, growth of data sources, proliferation of data expectations. These are the forces that have put stress on our existing approaches to analytical data management. Our existing methods have made remarkable progress scaling the machines: manage large volumes of a variety of data types with planet scale distributed data storage, reliably transmit high velocity data through streams, and process data intensive workloads, concurrently and fast. However, our methods have limitations with regard to the organizational complexity and scale, the human scale.

In this chapter, I give a short introduction to the current landscape of data architectures, their underlying characteristics and the reasons why, moving into the future, they limit us.

Evolution of Analytical Data Architectures

How we manage analytical data has gone through evolutionary changes; changes driven by new consumption models, ranging from traditional analytics in support of business decisions to intelligent business functions augmented with ML. While we have seen an accelerated growth in the number of analytical data technologies, the high level architecture has seen very few changes. Let’s have a quick browse of the high level analytical data architectures, followed by a review of their unchanged characteristics.

Note

The underlying technologies supporting each of the following architectural paradigms have gone through many iterations and improvements. The focus here is on the architectural pattern, and not the technology and implementation evolutions.

First Generation: Data Warehouse Architecture

Data warehousing architecture today is influenced by early concepts such as facts and dimensions formulated in the 1960s. The architecture intends to flow data from operational systems to business intelligence systems that traditionally have served the management with operations and planning of an organization. While data warehousing solutions have greatly evolved, many of the original characteristics and assumptions of their architectural model remain the same:

  • Data is extracted from many operational databases and sources
  • Data is transformed into a universal schema - represented as a multi-dimensional and time-variant tabular format
  • Data is loaded into the warehouse tables
  • Data is accessed through SQL-like querying operations
  • Data is mainly serving data analysts for their reporting and analytical visualizations use cases

The data warehouse approach is also referred to as data marts with the usual distinction that a data mart serves a single department in an organization, while a data warehouse serves the larger organization integrating across multiple departments. Regardless of their scope, from the architectural modeling perspective they both have similar characteristics.

In my experience, the majority of enterprise data warehouse solutions are proprietary, expensive and require specialization for use. Over time, they grow to thousands of ETL jobs, tables and reports that only a specialized group can understand and maintain. They don’t let themselves to modern engineering practices such as CI/CD and incur technical debt over time and an increased cost of maintenance. Organizations attempting to escape this debt, find themselves in an inescapable cycle of migrating from data warehouse solution to another.

Figure 3-1. Analytical data architecture - warehouse

Second Generation: Data Lake Architecture

Data lake architecture was introduced in 2010 in response to challenges of data warehousing architecture in satisfying the new uses of data; access to data based on data science and machine learning model training workflows, and supporting massively parallelized access to data. Data lake architecture, similarly to data warehouse, assumes that data gets extracted from the operational systems and is loaded into a central repository often in the format of an object store, storage of any type of data. However unlike data warehousing, data lake assumes no or very little transformation and modeling of the data upfront; it attempts to retain the data close to its original form. Once the data becomes available in the lake, the architecture gets extended with elaborate transformation pipelines to model the higher value data and store it in lakeshore marts.

This evolution to data architecture aims to improve ineffectiveness and friction introduced by extensive upfront modeling that data warehousing demands. The upfront transformation is a blocker and leads to slower iterations of model training. Additionally, it alters the nature of the operational system’s data and mutates the data in a way that models trained with transformed data fail to perform against the real production queries.

In our example, a music recommender when trained against a transformed and modeled data in a warehouse, fails to perform when evoked in an operational context - e.g. evoked by the recommender service with the logged-in user’s session information. The heavily transformed data used to train the model, either misses some of the user’s signals or has created a different representation of users attributes. Data lake comes to rescue in this scenario.

Notable characteristics of a data lake architecture include:

  • Data is extracted from many operational databases and sources
  • Data is minimally transformed to fit the storage format e.g. Parquet, Avro, etc.
  • Data - as close as the source syntax - is loaded to scalable object storage
  • Data is accessed through the object storage interface - read as files or data frames - a two-dimensional array-like structure.
  • Lake storage is accessed mainly for analytical and machine learning model training use cases and used by data scientists
  • Downstream from the lake, lake shore marts, are fit-for-purpose data marts or data services serve the modeled data
  • Lakeshore marts are used by applications and analytics use cases
Figure 3-2. Analytical data architecture - data lake

Data lake architecture suffers from complexity and deterioration; complex and unwieldy pipelines of batch or streaming jobs operated by a central team of hyper-specialized data engineers; deteriorated and unmanaged datasets, untrusted, inaccessible, provide little value.

Third Generation: Multimodal Cloud Architecture

The third and current generation data architectures are more or less similar to the previous generations, with a few modern twists:

  • Streaming for real-time data availability with architectures such as Kappa
  • Attempting to unify the batch and stream processing for data transformation with frameworks such as Apache Beam
  • Fully embracing cloud based managed services with modern cloud-native implementations with isolated compute and storage
  • Convergence of warehouse and lake, either extending data warehouse to include embedded ML training, e.g. Google BigQuery ML, or alternatively build data warehouse integrity, transactionality and querying systems into data lake solutions, e.g., Databricks Lakehouse

The third generation data platform is addressing some of the gaps of the previous generations such as real-time data analytics, as well as reducing the cost of managing big data infrastructure. However it suffers from many of the underlying characteristics that have led to the limitations of the previous generations.

Characteristics of Analytical Data Architecture

From the quick glance at the history of analytical data management architecture, it is apparent that the architecture has gone through evolutionary improvements. The technology and products landscape in support of the data management have gone through a cambrian explosion and continuous growth. The dizzying view of FirstMark’s1 annual landscape and “state of the union” in big data and AI, is an indication of the sheer number of innovative solutions developed in this space.

Figure 3-4. The Cambrian explosion of big data and AI tooling - it’s not intended to be read, just glanced and feel dizzy Courtesy of FirstMark

So the question is, what hasn’t changed? What are the underlying characteristics that all generations of analytical data architecture carry? Despite the undeniable innovation, there are fundamental assumptions that have remained unchallenged for the last few decades and must be closely evaluated:

  • Data must be centralized to be useful - managed by a centralized organization, with an intention to have an enterprise-wide taxonomy.

  • Data management architecture, technology and organization are monolithic.

  • The enabling technologies dictate the paradigm - architecture and organization.

Note

The architectural characteristics discussed in this chapter, including centralization, are only applied to the logical architecture. Physical architecture concerns such as where the data is physically stored - whether it is physically collocated or not - is out of scope for our conversation, and it’s independent of the logical architecture concerns. The logical architecture focuses on the experience layer of the data developers and consumers. Such as whether data is being managed by a single team or not - data ownership - whether data has a single schema or not - data modeling - and whether a change on one data model has tight coupling and impact on downstream users - dependencies.

Let’s look a bit more closely at each of these underlying assumptions and the limitations each impose.

Monolithic

Architecture styles can be classified into two main types: monolithic (single deployment unit of all code) and distributed (multiple deployment units connected through remote access protocols)

Fundamentals of Software Architecture

Monolithic Architecture

At 30,000 feet the data platform architecture looks like Figure 2-7 below; a monolithic architecture whose goal is to:

  • Ingest data from all corners of the enterprise and beyond, ranging from operational and transactional systems and domains that run the business, to external data providers that augment the knowledge of the enterprise. For example in the case of Daff Inc., the data platform is responsible for ingesting a large variety of data: the ‘media players performance', how their ‘users interact with the players', ’songs they play', ‘artists they follow', ‘labels’ and ‘artists’ that the business has onboarded, the ‘financial transactions’ with the artists, and external market research data such as ‘customer demographic’ information.

  • Cleanse, enrich, and transform the source data into trustworthy data that can address the needs of a diverse set of consumers. In our example, one of the transformations turns the ‘user clicks stream’’ to ‘meaningful user journeys’ enriched with details of the user. This attempts to reconstruct the journey and behavior of the user into an aggregate longitudinal view.

  • Serve the datasets to a variety of consumers with a diverse set of needs. This ranges from data exploration, machine learning training, to business intelligence reports. In the case of Daff Inc., the platform must serve ‘media player’s near real-time errors’ through a distributed log interface and at the same time serve the batched aggregate view of a particular ‘artist played record’ to calculate the monthly financial payments.

Figure 3-5. The 30,000 ft view of the monolithic data platform

While a monolithic architecture can be a good and a simpler starting point for building a solution - e.g. managing one code base, one team - it falls short as the solution scales. The drivers we discussed in Chapter 1, organizational complexity, proliferation of sources and use cases, create tension and friction on the architecture and organizational structure:

  • Ubiquitous data and source proliferation: As more data becomes ubiquitously available, the ability to consume it all and harmonize it in one place, logically, under the control of a centralized platform and team diminishes. Imagine the domain of ‘customer information’. There are an increasing number of sources inside and outside of the boundaries of the organization that provide information about the existing and potential customers. The assumption that we need to ingest and harmonize the data under a central customer master data management to get value, creates a bottleneck and slows down our ability to take advantage of diverse data sources. The organization’s response to making data available from new sources slows down as the number of sources increases.

  • Organizations’ innovation agenda and consumer proliferation: Organizations’ need for rapid experimentation introduces a larger number of use cases that consume the data from the platform. This implies an ever growing number of transformations to create data - aggregates, projections and slices that can satisfy the test and learn cycle of innovation. The long response time to satisfy the data consumer needs has historically been a point of organizational friction and remains to be so in the modern data platform architecture. The disconnect between people and systems who are in need of the data and understand the use case, from the actual sources, teams and systems, who originated the data and are most knowledgeable about the data, impedes the company’s data-driven innovations. It lengthens the time needed to access the right data, and becomes a blocker for hypothesis-driven development.

  • Organizational complexity: Adding a volatile and continuously shifting and changing data landscape - data sources and consumers - to the mix, is when a monolithic approach to data management becomes a synchronization and prioritization hell. Aligning the priorities and activities of the continuously changing data sources and consumers, with the capabilities and priorities of the monolithic solution - isolated from the sources and consumers - is a no-win situation.

Monolithic Technology

From the technology perspective, the monolithic architecture has been in a harmonious accordance with its enabling technology; technologies supporting data lake or data warehouse architecture, by default, assume a monolithic architecture. For example, data warehousing technologies such as Snowflake, Google BigQuery, or Synapse, all have a monolithic logical architecture - architecture from the perspective of the developers and users. While at the physical and implementation layer there has been immense progress in resource isolation and decomposition - for example Snowflake separates compute resource scaling from storage resources and BigQuery uses the latest generation distributed file system - the user experience of the technology remains monolithic.

Data Lake technologies such as object storage and pipeline orchestration tools, can be deployed in a distributed fashion. However by default, they do lead to creation of monolithic lake architectures. For example, data processing pipeline DAG definition and orchestrations’ lack of constructs such as interfaces and contracts abstracting pipeline jobs dependencies and complexity, leads to a big ball of mod monolithic architecture with tightly coupled labyrinthic pipelines, where it is difficult to isolate change or failure to one step in the process. Some cloud providers have limitations on the number of lake storage accounts, having assumed that there will only be a small number of monolithic lake setups.

Monolithic Organization

From the organizational perspective, Conway’s Law has been at work and in full swing with monolithic organizational structures - business intelligence team, data analytics group, or data platform team - responsible for the monolithic platform, its data and infrastructure.

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

Melvin Conway, 1968

When we zoom close enough to observe the life of the people who build and operate a data platform, what we find is a group of hyper-specialized data engineers siloed from the operational units of the organization; where the data originates or where it is used. The data engineers are not only siloed organizationally but also separated and grouped into a team based on their technical expertise of data tooling, often absent of business and domain knowledge.

Figure 3-6. Siloed hyper-specialized data team

I personally don’t envy the life of a data engineer. They need to consume data from operational teams who have no incentive in providing meaningful, truthful and correct data, based on an agreed-upon contract. Given the data team’s organizational silo, data engineers have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.

For example at Daff Inc., on the source side we have a cross-functional ‘media player’ team that provide signals of how users interact with media player features e.g. ‘play song events', ‘purchase events', and ‘play audio quality’; and on the other end sit a cross-functional consumer team such as ’song recommendation’ team, ’sales team’ reporting sales KPIs, ‘artists payment team’ who calculate and pay artists based on play events, and so on. Sadly, in the middle sits the data team that through sheer effort provides analytical data on behalf of all sources and to all consumers.

In reality what we find are disconnected source teams, frustrated consumers fighting for a spot on top of the data team’s backlog and an over stretched data team.

The complicated monolith

Monolithic architectures when they meet scale - here, scale in diversity of sources, consumers, and transformations - all face a similar destiny, becoming a complex and difficult to manage system.

The complexity debt of the sprawling data pipelines, duct-taped scripts implementing the ingestion and transformation logics, the large number of datasets - tables or files - with no clear architectural and organizational modularity, and thousands of reports built on top of those datasets, keeps the team busy paying the interest of the debt instead of creating value.

In short, a monolithic architecture, technology and organizational structure is not suitable for analytical data management of large scale and complex organizations.

Centralized

It’s an accepted convention that the monolithic data platform hosts and owns the data that belongs to different domains, e.g. ‘play events', ’sales KPIs', ‘artists', ‘albums', ‘labels', ‘audio', ‘podcasts', ‘music events', etc.; collected from a large number of disparate domains.

While over the last decade we have successfully applied domain driven design and bounded context to the design of our operational systems to manage complexity at scale, we have largely disregarded the domain driven design paradigm in a data platform. DDD’s strategic design introduces a set of principles to manage modeling at scale, in a large and complex organization. It encourages moving away from a single canonical model to many bounded contexts’ models. It defines separate models each owned and managed by a unit of organization. It explicitly articulates the relationships between the models.

While operational systems have applied DDD’s strategic design techniques toward domain-oriented data ownership, aligning the services and their data with existing business domains, analytical data systems have maintained a centralized data ownership outside of the domains.

Figure 3-7. Centralization of data with no clear data domain boundaries and domain-oriented ownership of data

While this centralized model can work for organizations that have a simpler domain with a smaller number of consumption cases, it fails for enterprises with rich and complex domains.

In addition to limitations to scale, other challenges of data centralization include providing quality data that is resilient to change; data that is as closely as possible is reflective of the facts of the business with integrity. The reason for this is that business domains and teams who are most familiar with the data, who are best positioned to provide quality data right at the source, are not responsible for data quality. The central data team, far from the source of the data and isolated from the domains of the data, is tasked with building quality back into the data through data cleansing and enriching pipelines. Often, the data that pops out of the other end of the pipelines into the central system loses its original form and meaning.

Centralization of the analytical data has been our industry’s response to the siloed and fragmented data, commonly known as Dark Data. Coined by Gartner, Dark Data refers to the information assets organizations collect, process and store during regular business activities, but generally fail to use for analytical or other purposes.

Technology driven

Looking back at different generations of analytical data management architectures, from warehouse to lake and all on the cloud, we have heavily leaned on a technology-driven architecture. A typical solution architecture of a data management system merely wires various technologies, each performing a technical function, a piece of an end to end flow. This is evident from a glance at any cloud provider’s modern solution architecture diagram, like the one below. The core technologies listed below are powerful and helpful in enabling a data platform. However, the proposed solution architecture decomposes and then integrates the components of the architecture based on their technical function and the technology supporting the function. For example, first we encounter the ingestion function supported by Cloud Pub/Sub, then publishing data to Cloud Storage which then serves data through BigQuery. This approach leads to a technically-partitioned architecture and consequently an activity-oriented team decomposition.

Figure 3-8. Modern analytical solutions architecture biased toward a technology-driven decomposition - example from GCP https://cloud.google.com/solutions/build-a-data-lake-on-gcp

Technically-Partitioned Architecture

One of the limitations of data management solutions today, comes down to how we have attempted to manage its unwieldy complexity; how we have decomposed an ever-growing monolithic data platform and team to smaller partitions. We have chosen the path of least resistance, a technical partitioning for the high level architecture.

Architects and technical leaders in organizations decompose an architecture in response to its growth. The need for on-boarding numerous new sources, or responding to proliferation of new consumers requires the platform to grow. Architects need to find a way to scale the system by breaking it into its top-level components.

Top-level technical partitioning, as defined by Fundamentals of Software Architecture, decomposes the system into its components based on their technical capabilities and concerns; it’s a decomposition that is closer to the implementation concerns than business domain concerns. Architects and leaders of monolithic data platforms have decomposed the monolithic solutions based on a pipeline architecture, into its technical functions such as ingestion, cleansing, aggregation, enrichment, and serving. The top-level functional decomposition leads to synchronization overhead and slow response to data changes, updating and creating new sources or use cases. An alternative approach is a top-level domain-oriented top-level partitioning, where these technical functions are embedded to the domain, where the change to the domain can be managed locally without top-level synchronization.

Figure 3-9. Top-level technical partitioning of monolithic data platform

Activity-oriented Team Decomposition

The motivation behind breaking a system down into its architectural components is to create independent teams who can each build and operate an architectural component. These teams in turn can parallelize work to reach higher operational scalability and velocity. The consequence of top-level technical decomposition is decomposing teams into activity-oriented groups, each focused on a particular activity required by a stage of the pipeline. For example, a team focusing on ingestion of data from various sources or a team responsible for serving the lakeshore marts. Each team attempts to optimize their activity, for example find patterns of ingestion.

Though this model provides some level of scale, by assigning teams to different activities of the flow, it has an inherent limitation that does not scale what matters: delivery of outcome - in this case, delivery of new quality and trust-worthy data. Delivering an outcome demands synchronization between teams and aligning changes to the activities. Such decomposition is orthogonal to the axis of change or outcome, and slows down the delivery of value and introduces organizational friction.

Conversely, an outcome-oriented team decomposition, optimized for achieving an end to end outcome fast with low synchronization overhead.

Let’s look at an example. Daff Inc. started its services with ’songs’ and ‘albums', and then extended to ‘music events', ‘podcasts', and ‘radio shows’. Enabling a single new feature, such as visibility to the ‘podcasts play rate', requires a change in all components of the pipeline. Teams must introduce new ingestion services, new cleansing and preparation as well as served aggregates for viewing podcast play rates. This requires synchronization across implementation of different components and release management across teams. Many data platforms provide generic and configuration-based ingestion services that can cope with extensions such as adding new sources easily or modifying the existing sources to minimize the overhead of introducing new sources. However this does not remove an end to end dependency management of introducing new datasets from the consumer point of view. The smallest unit that must change to cater for a new functionality, unlocking a new dataset and making it available for new or existing consumption, remains to be the whole pipeline - the monolith. This limits our ability to achieve higher velocity and scale in response to new consumers or sources of the data.

Figure 3-10. Architecture decomposition is orthogonal to the axis of change (outcome) when introducing or enhancing features, leading to coupling and slower delivery

We have created an architecture and organization structure that does not scale and does not deliver the promised value of creating a data-driven organization.

Recap

The definition of insanity is doing the same thing over and over again, but expecting different results.

Albert Einstein

You made it, walking with me through the evolution of analytical data management architecture. We looked at the current state of the two-plane division between operational data and analytical data, and their fragile ETL-based integration model. We dug deeper into the limitations of analytical data management; limitations to scale - organizational scale in expansion of ubiquitous data, scale in diversity of usage patterns, scale in dynamic topology of data and need for rapid response to change. We looked critically into the root causes of their limitations.

The angle we explored was architecture and its impact on the organization. We explored the evolution of analytical data architectures from data warehousing, data lake to multi-modal warehouse and lake on the cloud. While acknowledging the evolutionary improvement of each architecture, we challenged some of the fundamental characteristics that all these architectures share: monolithic, centralized and technology driven. These characteristics are driven from an age-old assumption that to satisfy the analytical use cases, data must be extracted from domains, and consolidated and integrated under central repositories of a warehouse or a lake. This assumption was valid when the use cases of data were limited to low-frequency reports; it was valid when data was being sourced from a handful of systems. It is no longer valid when data gets sources from hundreds of microservices, millions of devices, from within and outside of enterprises. It is no longer valid that use cases for data tomorrow are beyond our imagination today.

We made it to the end of Part I. With an understanding of the current landscape and expectations of the future, let’s move to Part II and unpack what Data Mesh is based on its core principles.

1 An early-stage venture capital firm in New York City

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.204.34.64