Chapter 1. The Inflection Point

A strategic inflection point is a time in the life of business when its fundamentals are about to change. That change can mean an opportunity to rise to new heights. But it may just as likely signal the beginning of the end.

Andrew S. Grove, CEO of Intel Corporation

Data Mesh is what comes after an inflection point, shifting our approach, attitude, and technology toward data. Mathematically, an inflection point is a magic moment at which a curve stops bending one way and starts curving in the other direction. It’s a point that the old picture dissolves, giving way to a new one.

This won’t be the first or the last inflection point in the evolution of data management. However, it is the one that is most relevant now. There are drivers and empirical signals that point us in a new direction. I personally found myself at this turning point in 2018. When many of our clients at ThoughtWorks, a global technology consultancy, simultaneously were seeking for a new data architecture that could respond to the scale, complexity, and aspirations of their business. After reading this chapter, I hope that you too arrive at this critical point, where you feel the urge for change, to wash away some of the fundamental assumptions made about data, and imagine something new.

Figure 1-1 is a simplistic demonstration of the inflection point in question. The x-axis represents the macro drivers that have pushed us to this inflection point. They include an ever-increasing business complexity combined with uncertainty, proliferation of data expectations and use cases, and the availability of data from ubiquitous sources. On the y-axis we see the impact of these drivers on business agility, ability to get value from data and resilience to change. In the center is the inflection point, where we have a choice to make. To continue with our existing approach and, at best, reach a plateau of impact, or take the Data Mesh approach with the promise of reaching new heights in the agility of acting on data, immunity to rapid change, and being able to get value from data at a larger scale. Part II of this book will go through the details of what the Data Mesh approach entails.

Figure 1-1. The inflection point of the approach to data management

In this chapter, I share today’s data landscape realities that are the main drivers for Data Mesh.

Great Expectations of Data

One of the perks of being a technology consultant is traveling through many industries and companies, and getting to know their deepest desires and challenges. Through this journey, one thing is evident: being a data-driven organization remains one of the top strategic goals of executives.

Here are a few examples, all truly inspiring:

Our mission at Intuit is to power prosperity around the world as an AI-driven expert platform company, by addressing the most pressing financial challenges facing our consumer, small business and self-employed customers.

Financial SaaS Company

Our mission is to improve every single member’s experience at every single touchpoint with our organization through data and AI.

Healthcare provider and payer company

By People, For People: We incorporate human oversight into AI. With people at the core, AI can enhance the workforce, expand capability and benefit society as a whole.

Telco

No matter the industry or the company, it’s loud and clear, we want to become intelligently empowered1 to:

  • provide the best customer experience based on data and hyper-personalization
  • reduce operational costs and time through data-driven optimizations
  • empower employees to make better decisions with trend analysis and business intelligence

All of these scenarios require data--a high volume of diverse, up-to-date, and truthful data that can, in turn, fuel the underlying analytics and machine learning models.

A decade ago, many companies’ data aspirations were mainly limited to business intelligence (BI). They wanted the ability to generate reports and dashboards to manage operational risk, respond to compliance, and ultimately make business decisions based on the facts, on a slower cadence. In addition to BI, classical statistical learning has been used in pockets of business operations in the industries such as insurance, healthcare, and finance. These early use cases, delivered highly specialized teams, have been the most influential drivers for many past data management approaches.

Today, data aspirations have evolved beyond business intelligence to every aspect of an organization, using machine learning in the design of the products, such as automated assistants, in the design of our services and experience of our customers, such as personalized healthcare, and streamlining operations such as optimized real-time logistics. Not only that, the expectation is to democratize data, so that the majority of the workforce can put data into action.

Meeting these expectations requires a new approach to data management. An approach that can seamlessly fulfill the diversity of modes of access to data. Access that ranges from a simple structured view of the data for reporting, to a continuously reshaping semi-structured data for machine learning training; from real-time fine-grained access to events to aggregations. We need to meet these expectations with an approach and architecture that natively supports diverse use cases and does not require copying data from one technology stack to another across the organization so that we can meet the needs of yet another use case.

More importantly, the widespread use of machine learning requires a new attitude toward application development and data. Need to move from deterministic and rule-based applications - where given a specific input data, the output can be determined - to nondeterministic and probabilistic data-driven applications - where given a specific input data, the output could be a range of possibilities which can change over time. This approach to application development requires continuous refining of the model over time, and continuous, frictionless access to the latest data.

The great and diverse expectations of data require us to step back, acknowledge the accidental technical complexities that we have created over time, and wonder if there is a simpler approach to data management that can universally address the diversity of needs today, and beyond.

The Great Divide of Data

Many of the technical complexities organizations face today stem from how we have divided the data -- operational and analytical data, siloed the teams that manage them, proliferated the technology stacks that support them and how we have integrated them.

Today, we have divided the data and its supporting technology stacks and architecture into two major categories: operational data: databases that support running the business and keeping the current state of the business - also known as transactional data; and analytical data: data warehouse or lake providing a historical, integrated and aggregate view of data created as a byproduct of running the business. Today, operational data is collected and transformed to form the analytical data. Analytical data trains the machine learning models that then make their way into the operational systems as intelligent services.

Figure 1-2. : The two planes of data

Operational Data

Operational data sits in databases of microservices, applications, or systems of records that support the business capabilities. Operational data keeps the current state of the business. It is optimized for application’s or microservice’s logic and access patterns. It often has a transactional nature. It’s referred to as data on the inside, private data of an application or a microservice that performs CRUD (create, update, delete) operations on it. Operational data is constantly updated, so it’s access requires reads and writes. The design has to account for multiple people updating the same data at the same time in unpredictable sequences (hence the need for transactions). The access is also about relatively in-the-moment activity. Operational data is recording what happens in the business, supporting decisions that are specific to the business transaction. In short, operational data is used to run the business and serve the users.

Imagine a digital media streaming business that streams music, podcast and other digital content to its subscribers and listeners. Registration service implements the business function of registering new users or unregistering them. The database that supports the registration and deregistration process, keeping the list of users, is considered operational data.

Analytical Data

Analytical data is the temporal, historic and often aggregated view of the facts of the business over time. It is modeled to provide retrospective or future-perspective insights. Analytical data is optimized for analytical logic - training machine learning models, creating reports and visualizations. Analytical data is called data on the outside, data directly accessed by analytical consumers. Analytical data is immutable and has a sense of history. Analytical use cases require looking for comparisons and trends over time, while a lot of operational uses don’t require much history. The original definition of analytical data as a nonvolatile, integrated, time variant collection of data2 still remains valid.

In short, analytical data is used to optimize the business and user experience. This is the data that fuels the AI and analytics aspirations that we talked about in the previous section.

For example, in the case of Daff Inc. it’s important to optimize the listeners’ experience with playlists recommended based on their music taste and favorite artists. The analytical data that helps train the playlist recommendation machine learning model, captures all the past behavior of the listener as well as all characteristics of the music the listener has favored. This aggregated and historical view is analytical data.

Over time, the analytical data plane itself has diverged into two generations of architectures and technology stacks: initially data warehouse and followed by data lake; with data lake supporting data science access patterns and preserving data in its original form, and data warehouse supporting analytical and business intelligence reporting access patterns with data conforming to a centrally unified ontology. For this conversation, I put aside the dance between the two technology stacks: data warehouse attempting to onboard data science workflows and data lake attempting to serve data analysts and business intelligence.

Analytical and Operational Data Misintegration

The current state of technology, architecture and organization design is reflective of the divergence of the analytical and operational data planes - two levels of existence, integrated yet separate. Each plane operates under a different organizational vertical. Business intelligence, data analytics and data science teams, under the leadership of Chief Data and Analytics officer (CDAO), manage the analytical data plane, while business units and their corresponding technology domains manage the operational data. From the technology perspective, there are two independent technology stacks that have grown to serve each plane, while there are some convergence such as infinite event logs.

This divergence has led to the two-plane data topology and a fragile integration architecture between the two. The operational data plane feeds the analytical data plane through a set of scripts or automated processes often referred to as ETL jobs - Extract, Transform, and Load. Often operational databases have no explicitly defined contract with the ETL pipelines for sharing their data. This leads to fragile ETL jobs where unanticipated upstream changes to the operational system and their data leads to downstream pipeline failures. Over time the ETL pipelines grow in complexity trying to provide various transformations over the operational data, flowing data from the operational data plane to the analytical plane, and back to the operational plane.

Figure 1-3. Pipeline-based integration of the data planes

The challenges of the two-plane data management approach with a brittle integration through pipelines, and a centralized data warehouse or lake for access to data is a major driver to reimagine the future solutions.

Scale, Encounter of a New Kind

Since the mid 2000s, we have evolved our technologies to deal with the scale of the data in terms of its volume, velocity and variety. We built the first generation batch data processing to manage the large volume of data that our applications and touchpoints generated, we built stream processing architectures to handle the speed of data that started flowing from our mobile devices, and built different types of storage systems to manage the diversity of data, text, imaging, voice, graphs, files, etc. Then we got carried away and kept tagging more Vs to data to encourage access to clean data - veracity - and aim to get value3 from data.

Today, we are encountering a new kind of scale, the origins and location of the data. The data-driven solutions often require access to data beyond a business domain, organizational or technical boundary. The data can be originated from every system that runs the business, from every touchpoint with customers, and from other organizations. The next approach to data management needs to recognize the proliferation of the origins of the data, and their ubiquitous nature.

The most interesting and unexpected patterns emerge when we connect data from a variety of sources, when we can have access to information that is beyond the transactional data that we generate running our business. The future of intelligent healthcare requires a longitudinal human record of a patient’s diagnostics, pharmaceutical records, personal habits, etc. and in comparison with all other patients’ history. These sources are beyond a single organization’s control. The future of intelligent banking requires data beyond the financial transactions that customers perform with their banks. They’ll need to know the customers’ housing needs, the housing market, their shopping habits, their dreams, to offer them the services they need, when they need it.

This unprecedented scale of diversity of sources, requires a shift in data management. A shift away from collecting data from sources into one big centralized place, repeatedly across every single organization, to connecting data, wherever it is.

Beyond Order

I’m writing this book during the pandemic of 2020-2021. If there was any doubt that our organizations need to navigate complexity, uncertainty and volatility, the pandemic has made that abundantly clear. Even on a good day outside of the pandemic, the complexity of our organizations demand a new kind of immunity, immunity to change.

The complexity that has risen from the ever changing landscape of a business is also reflected in the data. Rapid delivery of new features to products, new and changed offerings and business functions, new touchpoints, new partnerships, new acquisitions, all result in a continuous reshaping of the data.

More than ever now, organizations need to have the pulse of their data and the ability to act quickly and respond to change with agility.

What does this mean for the approach to data management? It requires access to the quality and trustworthy facts of the business at the time they happen. The data platforms must close the distance - time and space - between when an event happens, and when it gets consumed and processed for analysis. The analytics solutions must guide real time decision making. Rapid response to change is no longer a premature optimization4 of the business; it’s a baseline functionality.

Data management of the future must build-in change, by default. Rigid data modeling and querying languages that expect to put the system in a straitjacket of a never-changing schema can only result in a fragile and unusable analytics system.

Data management of the future must embrace the complex nature of today’s organizations and allow for autonomy of teams with peer-to-peer data collaborations.

Today, the complexity has stretched beyond the processes and products to the technology platforms themselves. In any organization, the solutions span across multiple cloud and on-prem platforms. The data management of the future must support managing and accessing data across multiple cloud providers, and on-prem data centers, by default.

Approaching the Plateau of Return

In addition to the seismic shifts listed above, there are other telling tales about the mismatch between data and AI investment and the results. To get a glimpse of this, I suggest you browse the NewVantage Partners annual reports; an annual survey of senior corporate c-executives on the topics of data and AI business adoption. What you find is the recurring theme of an increasing effort and investment in building the enabling data and analytics platforms, and yet experiencing low success rates.

For example, in their 2021 report, only 26.8% of firms reported having forged a data culture. Only 37.8% of firms reported that they have become data-driven, and only 45.1% of the firms reported that they are competing using data and analytics. It’s too little result for the pace and amount of investment; 64.8% of surveyed companies reported greater than $50MM investment in their Big Data and AI strategies.

Despite continuous effort and investment in one generation of data and analytics platforms to the next, the organizations find the results middling.

I recognize that the organizations face a multi-faceted challenge in transforming to become data-driven; migrating from decades of legacy systems, resistance of a legacy culture to rely on data, and competing business priorities.

The future approach to data management must look carefully at this phenomena, why the solutions of the past are not producing a comparable result to the human and financial investment we are putting in today. Some of the root causes include lack of skill sets needed to build and run data and AI solutions, organizational, technology and governance bottlenecks, friction in discovering, trusting, accessing and using data.

Recap

As a decentralized approach to managing data, Data Mesh embraces the data realities of organizations today, and their trajectory, while acknowledging the limitations our solutions face today.

Data Mesh assumes a new default starting state: proliferation of data origins within and beyond organizations boundaries, on one or across multiple cloud platforms. It assumes a diverse range of use cases for analytical data from hypothesis-driven machine learning model development to reports and analytics. It works with a highly complex and volatile organizational environment and not against it.

In the next two chapters, I set the stage further. Next, we look at our expectations from Data Mesh as a post-inflection-point solution. What organizational impact we expect to see, and how Data Mesh achieves them.

1 Christoph Windheuser, What is Intelligent Empowerment?, (ThoughtWorks, 2018).

2 Definition provided by William H. Inmon known as the father of data warehousing.

3 https://www.bbva.com/en/five-vs-big-data

4 Donald Knuth made the statement, “(code) premature optimization is the root of all evil.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
54.211.148.68