Enterprises are struggling with the proliferation of data in terms of both volume and variety. Data modernization addresses this key challenge and involves establishing a strong foundation of data by making it simple and accessible, regardless of where that data resides. Since data that's used in AI is often very dynamic and fluid with ever-expanding sources, virtualizing how data is collected is critical for clients. Cloud Pak for Data offers a flexible approach to address these modern challenges with a mix of proprietary, open source, and third-party services.
In this chapter, we will look at the importance of data and the challenges that occur with data-centric delivery. We will also look at what data virtualization is and how it can be used to simplify data access. Toward the end of this chapter, we will be looking at how Cloud Pak for Data enables data estate modernization.
In this chapter, we're going to cover the following main topics:
The fact is, every company in the world is a data company. As the Economist magazine rightly pointed out in 2017, data (not oil) is the world's most valuable resource and unless you are leveraging your data as a strategic differentiator, you are likely missing out.
If you look at successful companies over time, all of them had sustainable competitive advantages – either economies of scale (Apple, Intel, AWS) or network effects (Facebook, Twitter, Uber, and so on). Data is the new basis for having a sustainable competitive advantage. Over 90% of the world's data cannot be googled, which means most of the world's valuable data is private to the organizations that own it. So, what can you do to unleash the potential that's inherent to your proprietary data?
As we discussed in Chapter 1, Data Is the Fuel that Powers AI-Led Digital Transformation, CEOs and business leaders know they need to harness digital transformation to jumpstart growth, speed up time to market, and foster innovation. In order to accelerate that transformation, they need to integrate processes across organizational boundaries by leveraging enterprise data as a strategic differentiator. It's critical to remember that your data is only accessible to you.
Let's look at a few examples of enterprises leveraging data as a strategic differentiator:
Next, we will cover the challenges associated with data-centric delivery.
Now that we have established that data is everywhere and that the best businesses in the world today are data-driven, let's look at what data-centric means. Enterprises are collecting data from more and increasingly diverse sources to analyze and drive their operations, with those sources perhaps numbering in the thousands or millions.
Here is an interesting fact: according to Forrester (https://go.forrester.com/blogs/hadoop-is-datas-darling-for-a-reason/), up to 73% of the data you create in your enterprise goes unused. That's a very expensive and ineffective approach.
Note to Remember
Data under management is not the same as data stored. Data under management is data that can be consumed by the enterprise through a governed and common access point. This is something that hasn't really existed until today, but it is quickly becoming the leading indicator of a company's market capitalization.
And we are just getting started. Today, enterprises have roughly 800 terabytes of data under management. In 5 years, that number will explode to 5 petabytes. If things are complicated, slow, and expensive today, think about what will happen in 5 years.
The complexity, cost, time, and risk of error in collecting, governing, storing, processing, and analyzing that data centrally is also increasing exponentially. On the same note, the databases and repositories that are the sources of all of this data are more powerful, with abundant processing and data storage capabilities of their own available.
Historically, enterprises managed data through systems of record (mainframe and client/server applications). The number of applications was limited, and data that's generated was structured for the most part – relational databases were leveraged to persist the data. However, that paradigm is not valid anymore. With the explosion of mobile phones and social networks over the past decade, the number of applications and amount of data that's generated has increased exponentially. More importantly, the data is not structured anymore – relational databases, while still relevant for legacy applications, are not built to handle unstructured data produced in real time. This led to a new crop of data stores such as MongoDB, Postgres, CouchDB, GraphDB, and more, jointly referred to as NoSQL databases. These applications are categorized as systems of engagement:
While the evolution of systems of engagement has led to an exponential increase in the volume, velocity, and variety of data, it has also made it equally challenging for enterprises to tap into their data, given it is now distributed across a wide number and variety of data stores within the organization:
At the same time, business users are more sophisticated these days when it comes to creating and processing their own datasets, tapping into Excel and other desktop utilities to cater for their business requirements. Data warehouses and data lakes are unable to address the breadth and depth of the data landscape, and enterprises are beginning to look for more sophisticated solutions. To add to this challenge, we are beginning to see an explosion of smart devices in the market, which is bound to further disrupt the data architecture in industries such as manufacturing, distribution, healthcare, utilities, oil, gas, and many more. Enterprises must modernize and continue to reinvent to address these changing business conditions.
Before we explore potential solutions and how Cloud Pak for Data is addressing these challenges, let's review the typical enterprise data architecture and the inherent gaps that are yet to be addressed.
A typical enterprise today has several data stores, systems of record, data warehouses, data lakes, and end user applications, as depicted in the following diagram:
Also, these data stores are typically distributed across different infrastructures – a combination of on-premises and multiple public clouds. While most of the data is structured, increasingly, we are seeing unstructured and semi-structured datasets being persisted in NoSQL databases, Hadoop, or object stores. The evolving complexity and the various integration touchpoints are beginning to overwhelm enterprises, often making it a challenge for business users to find the right datasets for their business needs. This is represented in the following architecture diagram of a typical enterprise IT, wherein the data and its associated infrastructure is distributed, growing, and interconnected:
The following is a detailed list of all the popular data stores being embraced by enterprises, along with the specific requirements they address:
The following is a detailed list of NoSQL data stores.
The following is a detailed list of data warehouses and data lakes:
Now that we have covered enterprise data architecture, let's move on to NoSQL data stores.
NoSQL data stores can be broadly classified into four categories based on their architecture and usage. They are as follows:
Next, we'll review some of the capabilities of Cloud Pak for Data that enable it to address some of the challenges and one of them is Data virtualization.
Historically, enterprises have consolidated data from multiple sources into central data stores, such as data marts, data warehouses, and data lakes, for analysis. While this is still very relevant for certain use cases, the time, money, and resources required make it prohibitive to scale every time a business user or data scientist needs new data. Extracting, transforming, and consolidating data is resource-intensive, expensive, and time-consuming and can be avoided through data virtualization.
Data virtualization enables users to tap into data at the source, removing complexity and the manual processes of data governance and security, as well as incremental storage requirements. This also helps simplify application development and infuses agility. Extract, Transform, and Load (ETL), on the other hand, is helpful for complex transformational processes and nicely complements data virtualization, which allows users to bypass many of the early rounds of data movement and transformation, thus providing an integrated, business-friendly view in near real time.
The following diagram shows how data virtualization connects data sources and data consumers, thereby enabling a single pane of glass to access distributed datasets across the enterprise:
There are very few vendors that offer data virtualization. According to Tech Target, the top five vendors in the market for data virtualization are as follows:
Among these, IBM stands out for its integrated approach and scale. Its focus on a unified experience of bringing data management, data governance, and data analysis into a single platform resonates with today's enterprise needs, while its unique IP enables IBM's data virtualization to scale both horizontally and vertically. Among other things, IBM leverages push-down optimization to tap into the resources of the data sources, enabling it to scale without constraints.
Data virtualization connects all the data sources to a single, self-balancing collection of data sources or databases, referred to as a constellation. No longer are analytics queries performed on data that's been copied and stored in a centralized location. The analytics application submits a query that's processed on the server where the data source exists. The results of the query are consolidated within the constellation and returned to the original application. No data is copied, and it only exists at the source.
By using the processing power of every data source and accessing the data that each data source has physically stored, latency from moving and copying data is avoided. In addition, all repository data is accessible in real time, and governance and erroneous data issues are virtually eliminated. There's no need for extract, transform, and load and duplicate data storage, accelerating processing times. This process brings real-time insights to decision-making applications or analysts more quickly and dependably than existing methods. It also remains highly complementary with existing methods and can easily coexist when it remains necessary to copy and move some data for historical, archival, or regulatory purposes:
A common scenario in distributed data systems is that many databases store data in a common schema. For example, you may have multiple databases storing sales data or transactional data, each for a set of tenants or a region. Data virtualization in Cloud Pak for Data can automatically detect common schemas across systems and allow them to appear as a single schema in data virtualization – a process known as schema folding. For example, a SALES table that exists in each of the 20 databases can now appear as a single SALES table and can be queried through Structured Query Language (SQL) as one virtual table.
Historically, Data warehouses and data lakes are built by moving data in bulk using ETL. One of the leading ETL products in the market happens to be from IBM and is called IBM DataStage. So, it begs the question as to when someone should use data virtualization versus an ETL offering. The answer depends on the use case. If the intent is to explore and analyze small sets of data in real time and where data can change every few minutes or hours, data virtualization is recommended. Please note that the reference to small sets of data alludes to the actual data that's transferred, not the dataset that a query is performed on. On the flip side, if the use case requires processing huge datasets across multiple sources and where data is more or less static over time (historical datasets), an ETL-based solution is highly recommended.
Cloud Pak for Data also includes the concept of platform connections, which enable enterprises to define data source connections universally. These can then be shared by all the different services in the platform. This not only simplifies administration but enables users across different services to easily find and connect to the data sources. Cloud Pak for Data also offers flexibility to define connections at the service level to override platform connections. Finally, administrators can opt to either set user credentials for a given connection or force individual users to enter their respective credentials. The following is a screenshot of the platform connections capability of Cloud Pak for Data v4.0, which was released on June 23, 2021:
Now that we have gone through the collect capabilities of Cloud Pak for Data, let's see how customers can modernize their data estates for Cloud Pak for Data and the underlying benefits of this.
So far, we have seen the evolving complexity of the data landscape and the challenges enterprises are trying to address. Increasing data volumes, expanding data stores, and hybrid multi-cloud deployment scenarios have made it very challenging to consolidate data for analysis. IBM's Cloud Pak for Data offers a very modern solution to this challenge. At its core is the data virtualization service, which lets customers tap into the data in source systems without moving the data. More importantly, its integration with the enterprise catalog means that any data that's accessed is automatically discovered, profiled, and cataloged for future searches. Customers can join data from multiple sources into virtualized views and can easily enforce governance and privacy policies, making it a one-stop shop for data access. Finally, its ability to scale and leverage source system resources is extremely powerful.
IBM's data virtualization service in Cloud Pak for Data supports over 90% of the enterprise data landscape:
While tapping into data without moving it is a great starting point, Cloud Pak for Data also enables customers to persist data on its platform. You can easily deploy, provision, and scale a variety of data stores on Cloud Pak for Data. The supported databases on Cloud Pak for Data are as follows:
Cloud Pak for Data enjoys a vibrant and open ecosystem, and more data sources are scheduled to be onboarded over the next 1-2 years. This offers customers the freedom to embrace the data stores of their choice while continuing to tap into their existing data landscape. Finally, it's worth mentioning that all the data stores that are available on the platform are containerized and cloud-native by design, allowing customers to easily provision, upgrade, scale, and manage their data stores. Also, customers can deploy these data stores on any private or public cloud, which enables portability. This is critical in situations where data gravity and the co-location of data and analytics is critical.
Data is the world's most valuable resource and to be successful, enterprises need to become data-centric and leverage their data as a key differentiator. However, increasing data volumes, evolving data stores, and distributed datasets are making it difficult for enterprises and business users to easily find and access the data they need. Today's enterprise data architecture is fairly complex, with a plethora of data stores optimized for specific workloads. Data virtualization offers the silver bullet to address this unique challenge, and IBM's Cloud Pak Data is one of the key vendors in the market today. It is differentiated for its ability to scale and its integrated approach, which addresses data management, data organization, and data analysis requirements.
Finally, IBM's Cloud Pak for Data complements its data virtualization service with several containerized data stores that allow customers to persist data on the platform, while also allowing them to access their existing data without moving it.
In the next chapter, you will learn how to create a trusted analytics foundation and organize the data you collected across your data stores.
18.219.31.220