Chapter 3. Getting Started Building Your Data Lake

By now you’re probably thinking, “How does the cloud fit in? Why do organizations decide to build their data lakes in the cloud? How could this work in my environment?” As it turns out, there isn’t a one-size-fits-all reason for building a data lake in the cloud. But many of the requirements for getting value out of a data lake can be satisfied only in the cloud.

In this chapter, we answer the following questions:

  • As your company’s data initiatives mature, at what point should you think about moving to a data lake?

  • Why should you move your data into the cloud? What are the benefits?

  • What are the security concerns with moving data into the cloud?

  • How can you ensure proper governance of your data in the cloud?

The Benefits of Moving a Data Lake to the Cloud

The Enterprise Strategy Group asked companies what the most important attributes were when working with big data and analytics. Not surprisingly, virtually all of the attributes listed were those found when building big data lakes in the cloud, as depicted in Figure 3-1.

Most important attributes when building a data lake
Figure 3-1. Most important attributes when building a data lake

With the cloud come the following key benefits:

Built-in security

When it comes to security, cloud providers have collected knowledge and best practices from all of their customers and have learned from the trials and errors of literally thousands of other companies. What’s more, they have dedicated security professionals—the best in the industry—working on continually improving the security of their platforms. After all, trust in their ability to keep their customers’ data safe is key to their success.

High performance

The resources available from cloud providers are virtually infinite, giving you the ability to scale out performance as well as a broad range of configurations for memory, processors, and storage options.

Greater reliability

As in your on-premises datacenters, cloud providers have layers of redundancy throughout their entire technology stacks. Service interruptions are extraordinarily rare.

Cost savings

If you do it well, you’ll get a lower total cost of ownership (TCO) for your data lake than if you remained on-premises. You’re paying only for the compute you need, and the tools you use are built specifically for the cloud architecture. Automation technologies like Qubole prevent failure and thus lower the risks of operating on big data at scale, which offers a huge value for the DevOps team, who are no longer called in the middle of the night when things break.

A lot of these benefits come from not having to reinvent and maintain the wheel when it comes to your infrastructure. As we’ve said before, you need to buy only as much as you use, and build only what you need to scale out based on demand. The cloud is perfect for that. Rather than spending all your time and energy spinning up clusters as your data grows and provisioning new servers and storage, you can increase your resources with just a few clicks in the cloud.

Key Benefit: The Ability to Separate Compute and Storage

One requirement inexorably pushing companies toward the cloud is that to work with truly big data, you must separate compute from storage resources. And this type of architecture is possible only in the cloud, making it one of the biggest differences from data platforms built for on-premises infrastructure. Two technologies make this separation possible: virtualization and object stores.

Virtualization makes it possible for you to provision compute in the cloud on demand. That’s because compute in the cloud is ephemeral: you can instantaneously provision and deprovision virtual machines (VMs).

Precisely because compute is ephemeral, the separation of compute and storage is critical for storage of “persistent” assets such as data. This separation is achieved by different storage technologies in the cloud that are persistent: block stores and object stores. For large datasets, object stores are especially well suited for data storage.

Object storage is an architecture that manages storage as objects instead of as a hierarchy (as filesystems do) or as blocks (as block storage does). In a cloud architecture, you must store data in object stores and use compute when needed to process data directly from the object stores. The object stores become the place where you dump all of your data in its raw format. Cloud-native big data platforms then create the compute infrastructure needed to process this data on the fly.

Note that this is different from typical on-premises data architectures for big data. Because of the lack of highly scalable object stores that can support thousands of machines reading data from and writing data to them, on-premises data lake architectures stress convergence of compute and storage, as illustrated in Figure 3-2. In fact, Hadoop is based on the principle that compute and storage should be converged. The same physical machines that store data also perform the computation for different applications on that data.

The key benefits of cloud based data platforms  elasticity and separation of compute and storage
Figure 3-2. The key benefits of cloud-based data platforms: elasticity and separation of compute and storage

Although this is the standard architecture for on-premises big data platforms, lifting and shifting this architecture to the cloud greatly nullifies the elasticity and flexibility benefits of the cloud.

Thus, the cloud allows you to tailor and structure your compute and storage resources exactly the way you need them. This emables you to eliminate worries about resource capacity utilization. It’s like using Lyft or Uber rather than owning and maintaining your own car. Instead, a car is a just-in-time resource—just as raw goods are made available just in time in manufacturing—that you can use when you need it. This saves time and money, and reduces risk.

The separation of compute from storage is important because compute and storage have different elasticity requirements. Businesses don’t tend to buy and then throw away storage. If you write something to storage, it’s very rarely deleted—not just for compliance reasons (although that’s part of it), but because businesses are very reluctant to throw data out.

Compute, on the other hand, is wanted only when needed. You don’t want thousands of CPUs sitting idle. That’s just wasteful—not just for your organization, but for the environment, too. This is just one of the reasons big data platforms are modernized for the cloud; data inherently has an expiration date on value. You might need to process information only once or a few times. Having elasticity of compute allows for completely efficient use of resources in the analytics life cycle.

This separation of compute and storage also helps you to have true financial governance over your data operations. You can appropriately tune and size your clusters for the given workloads without having to allow for extra compute resources during peak processing times, as you would with an on-premises infrastructure.

Note

Some companies are simply lifting and shifting their on-premises Hadoop platforms, such as those distributed by Cloudera or Hortonworks, into the cloud. These companies are actually falling short because they aren’t getting the benefits of separating storage from compute and the advantages of being able to spin up ephemeral clusters. Having their platforms in the cloud thus won’t improve cost and flexibility results over hosting their data lakes on-premises.

Object storage in the cloud

When it comes to object storage in the cloud, each provider offers its own solution. An object store is a unique way of filing data; whereas traditional systems use blocks of data, object stores make data available in manageable files called objects. Each object combines the pieces of data that compose a file, including the file’s relevant metadata, and then creates a custom identifier. This combines the power of a high-performance filesystem with massive scale and economy to help you to speed up your time to insight.

AWS has its Simple Storage Service (Amazon S3), which is storage for the internet. Designed to make web-scale computing easier for developers, it has a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. You pay only for what you use.

Microsoft offers Azure BlobStore as well as Azure Data Lake Storage (ADLS). Azure Blob storage is Microsoft’s object storage solution for the cloud; it is optimized for storing massive amounts of unstructured data such as text or binary data. ADLS Gen 2 is a highly scalable and cost-effective data lake solution for big data analytics.

Finally, Google offers Cloud Storage. Cloud Storage allows worldwide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

Distributed SQL versus a data warehouse

Another critical decision that organizations face when moving to a cloud data lake is how to migrate their data assets. More often than not, traditional organizations run on online analytical processing (OLAP) systems that have RDBMS and SQL Server backends. These tend to be structured and require data to be persistent with compute. Moving to a cloud object store and then separating data processing from storage can be challenging. Furthermore, certain applications such as enterprise resource planning, financial software, and customer-facing interfaces require that data be highly available.

Data lakes and data warehouses are both widely used for storing big data, but these are not interchangeable terms. Data lakes are vast pools of raw data that have a number of different purposes and can be used for processing later on. Data warehouses (often referred to as data marts), on the other hand, store information that has already been processed for a specific purpose; they are intended as repositories for structured and filtered data.

In the strictest sense of the term, a data warehouse is meant to hold a subset of the information stored in the data lake, only in a more optimized and readable format for nontechnical users. Many companies have different data warehouses for finance, manufacturing, ordering, and procurement, for example. These data warehouses are quite often separate from one another because they serve different sectors of the organization. The idea behind a data lake is to bring these all together in a cohesive system that allows users to query across organizational verticals and obtain more value from the data.

Differences in users

The users of data warehouses and data lakes are also different. Data warehouses are typically operated by less technical users than those using a data lake. That’s because data warehouses typically abstract away from the users how data is stored, organized, and optimized.

Data lakes, on the other hand, require much more thought and planning as to where data lives, how it’s updated and secured, as well as knowledge of the data’s life cycle. Thus, data warehouses are typically used by people who know SQL but could care less about the minutiae of the underlying system. Data lakes are typically managed and used by people who think in computer science terms and are concerned with such things as column-oriented storage, block replication, compression, and the order of data within data files.

In terms of roles and responsibilities, data analysts are typically the target audience of data warehouses. Data engineers and data scientists are the primary users of data lake tools such as Hadoop and Spark. That being said, if a data lake is correctly built, it can serve the needs of all three types of users (analysts, scientists, and data engineers) by supplying an engine for each use case that meets their skillsets and goals.

The data may live in S3, for example, but a cloud-native platform like Qubole offers something for each user with a rich set of engines that can fulfill each user’s desired outcome. Data engineers typically use Hadoop, Hive, and Spark; data scientists typically use Spark; and data analysts tend to like Presto. Qubole can support all of these simultaneously.

When Moving from an Enterprise Data Warehouse to a Data Lake

If your business is primarily using an enterprise data warehouse, it is important to ensure that you don’t disrupt existing IT operations in the process of building out your data lake platform. A key best practice is to start copying data from your warehouse to your data lake. This creates a foundation for a platform that can scale storage and compute resources separately while figuring out which operations you can migrate. Your data will then be available across a multitude of use cases while you map out a plan of which teams or data operations make sense to move in the near term. From there, your best next step might be helping to get a successful data product out or creating a new reporting pipeline operation that can demonstrate measurable success to your organization.

How you want to prioritize your plan will be driven by a number of common factors:

Capital expenditure (CapEx) and operational expenditure (OpEx)

Your business might not want to increase budget on CapEx hardware, so offloading data mining onto the cloud could help mitigate the need to buy more servers.

Analytics bottlenecks

As more users begin depending on your analytics models to do their jobs, reporting and query volume rapidly increases, which causes bottlenecks on other critical systems or hinders the customer experience. Offloading these users could help stabilize your analytics processes and, subsequently, the business.

Employee resources and expertise

The company might be hiring more data scientists or analysts, adding to the workloads of the data engineering and DevOps teams who will support them. Focusing on these new and different demands will help you to see the gaps in personnel that you need to fill.

Data analysts (SQL users) are typically the target audience of data warehouses, and data engineers and data scientists are the primary users of data lake tools such as Hadoop and Spark. This said, a well-built data lake will serve the needs of all types of users (analysts, product managers, and business users) by supplying an engine and interface for all use cases that align with users’ skillsets and goals.

Deciding whether to start your cloud analytics operations by separating storage from compute will determine whether you go with a cloud data warehouse (such as Redshift) or a distributed SQL engine (such as Presto).

Cloud Data Warehouse

There are two fundamental differences between cloud data warehouses and cloud data lakes: the data types and the processing framework. In a cloud data warehouse model, you need to transform the data into the right structure to make it usable. This is often referred to as schema-on-write.

In a cloud-based data lake, you can load raw data—unstructured or structured—from a variety of sources. Most business users can usually analyze only data that has been “cleaned” by formatting it and assigning metadata to it. This is called schema-on-read. When you marry this operational model with the cloud’s unlimited storage and compute availability, your business can scale its operations with growing volumes of data, a variety of sources, and concurrent querying while paying only for the resources utilized.

With this in mind, some large innovations in data warehouses focus on using data in cloud object storage as the primary repository but can scale out compute nodes horizontally. The most popular of these are Amazon Redshift, Snowflake, Google BigQuery, and Microsoft SQL Data Warehouse.

We describe Redshift in detail, but they all work in a similar manner.

Redshift

Redshift is a managed data warehouse service from Amazon, delivered via AWS. Redshift stores data in local storage distributed across multiple compute nodes, although Redshift Spectrum uses a Redshift cluster to query data stored in S3 instead of local storage.

With Redshift, there is no separation of compute and storage. It does provide “dense-storage” and “dense-compute” options, but storage and compute are still tightly coupled. With Redshift, as your data grows you inevitably buy more resources than you need, increasing the cost of your big data initiatives.

Redshift, originally based on ParAccel, is proprietary and its source code closed. This means that you have no control over what is implemented or what is put in the roadmap. However, on the plus side, with Redshift, you don’t need to manage your own hardware. Simply load data from your S3 object store or disk, and you’re ready to query from there. This is great for cases in which you are serving reports or data through a frontend interface from which your users are slicing and dicing different dimensions of the data.

Cloud data warehouse services are great because they’re straightforward to use, perform well, and don’t require the knowledge and support savviness that is required when you use open source. That being said, ease of use and performance come at a high cost. Given that data remains persistent on disk, adding any new users or data will increase costs exponentially.

Understanding which use cases best fit a distributed SQL technology such as Presto or a data warehouse like Redshift will require you to analyze your needs carefully. You also could find that you have a data pipeline in which your reports are processed by Presto and sent to individual customers via PDF, and then a subset of that data is pushed down to the data warehouse to generate other reports or to feed into customer-facing applications.

Distributed SQL

If you want the advantages of separating compute and storage, but with a feel that is similar to a data warehouse, Presto is an excellent place to start.

Some important reasons why you might use a distributed SQL engine include:

  • You have bursty volumes of users analyzing data regularly.

  • Data sizes you’re analyzing are unpredictable and large.

  • You want to analyze data from multiple sources and data marts.

  • You need to join large tables together for further analysis.

In the end, you should have a combination of both. So, your scheduled batches with bursty volumes of data and ad hoc workloads are running in a distributed engine, and your business intelligence analytics or other systems that rely on information are fed into a data warehouse with more uptime.

Presto

Presto is an open source SQL query engine built for running fast, large-scale, analytics workloads distributed across multiple servers. Presto supports standard ANSI SQL and has enterprise-ready distributions made available by services such as Qubole, AWS Athena, GE Digital Predix, and HDInsight. This helps companies on other data warehouses like Redshift, Vertica, and Greenplum to move legacy workloads to Presto.

Presto can plug in to several storage systems such as HDFS, S3, and others. This layer of Presto has an API that allows users to author and use their own storage options as well. As we explained in the previous section, this separation of compute from storage allows you to scale each independently. This means that you use resources more efficiently and ultimately can save costs.

Presto has several other advantages:

Supports ANSI SQL

This includes complex interactive analytics, window functions, large aggregations, joins, and more, which you can use against structured and unstructured data.

Separates storage from compute

Presto is built such that each command is passed through a master coordinator that dictates which nodes will run the job through a scheduler.

Supports federated queries to other data sources

Presto supports querying of other database systems like MySQL, Postgres, Cassandra, MongoDB, and more, so you can bring together disparate data sources.

Performs fast distributed SQL query processing

The in-memory engine enables massive amounts of data to be processed quickly.

And as we’ve said, Presto is open source. As such, it has accepted contributions from third parties that enhance it, such as Uber, AWS, Netflix, Qubole, FINRA, Starburst Data, Teradata, and Lyft. Facebook has kept a close eye on the project as well, and continues to contribute its own improvements to Presto to the open source community. Using an open source engine like Presto means that you get advantages from others’ work while remaining in control of your own technology.

Because Presto doesn’t typically care what storage you use, you can quickly join or aggregate datasets and can have a unified view of your data to query against. The engine is also built to handle data processing in memory. Why does this matter? If you can read data more swiftly, the performance of your queries improves correspondingly—always a good thing when you have business analyst, executive, and customer reports that need to be made available regularly.

How Companies Adopt Data Lakes: The Maturity Model

A TDWI report released in March surveyed how companies are using data lakes and what benefits or drawbacks they’re seeing. In the survey, conducted in late 2017, 23% of respondents said their organizations were already using data lakes, whereas another 24% expected to have one in production in the next 12 months.

Given this, how do companies move from traditional on-premises data warehouses to data lakes in the cloud?

Qubole’s five-step Big Data Cloud Maturity Model closely correlates with a company’s migration to a data lake, as demonstrated in Figure 3-3.

The Qubole Big Data Cloud Maturity Model
Figure 3-3. The Qubole Big Data Cloud Maturity Model

Stage 1: Aspiration—Thinking About Moving Away from the Data Warehouse

In the first stage, the company is typically using a traditional data warehouse with production reporting and ad hoc analyses.

Signs of a Stage 1 company include having a large number of apps collecting growing volumes of data, researching big data but not investing in it yet, and beginning to hire big data engineers. The company likely also has a data warehouse or a variety of databases instead of a data lake.

The classic sign of a Stage 1 company is that the data team acts as a conduit to the data, so all employees must go through that team to access data.

The key to getting from Stage 1 to Stage 2 is to not think too big. A company should begin by focusing on one problem it has that might be solved by a big data initiative, starting with something small and concrete that will provide measurable ROI. For ecommerce companies, that could mean building a straightforward A/B testing algorithm, or if your company has an on-premises setup, it could be moving some customer reports so that you can deliver them under a faster SLA.

It’s important to understand that every new project build is a proof of concept (PoC), whether you are new to data lakes or a pro. Companies should use the power of elasticity in the cloud to be able to test and fail fast to iterate on what technology works best for each use case.

Stage 2: Experimentation—Moving from a Data Warehouse to a Data Lake

In this stage, you deploy your first big data initiative in a data lake. This is typically small and targeted at one specific problem that you hope to solve.

You know you’re in Stage 2 if you have successfully identified a big data initiative. The project should have a name, a business objective, and an executive sponsor. You probably haven’t yet decided on a platform, and you don’t have a clear strategy for going forward. That comes in Stage 3. You still need to circumvent numerous challenges in this stage.

Some typical characteristics of a Stage 2 company:

  • Company personnel don’t know the potential pitfalls ahead. Because of that, they are confused about how to proceed.

  • The company lacks the resources and skills to manage a big data project. This is extremely common in a labor market in which people with big data skills are snapped up at exorbitant salaries.

  • The company cannot expand beyond its initial success, usually because the initial project was not designed to scale, and expanding it proves too complex.

  • The company doesn’t have a clearly defined support plan.

  • The company lacks cross-group collaboration.

  • The company has not defined the budget.

  • The company is unclear about the security requirements.

Whereas the impetus to reach the aspirational stage typically comes from senior executives—where perhaps the CEO or CIO has read about data lakes or heard about them at a conference—the experimental stage must be driven by hands-on technologists who need to determine how to fulfill the CEO’s or CTO’s vision. It’s still very much a PoC.

Although you can certainly build your initial data lake on-premises, it makes sense to start in the cloud. Why? Because you want to fail fast, and you want to separate compute from storage so that you don’t have racks upon racks of servers and storage arrays sitting there on the chance that you might find workloads to fill them up one day.

Starting with the cloud means that you have access to elastic resources to do your PoCs and begin training your workforce. They will fail at first—this is inevitable—but they can fail fast and cheaply in the cloud, and they can realize success sooner than if they were on-premises.

Analyzing data from different formats

During Stage 2, you’re getting initial datasets into the data lake to meet initial use cases and preparing the foundations of your data lake by doing database dumps in either JSON or CSV format. You’re also beginning to analyze log files or clickstream data.

Remember, every build is a PoC. Whether you are new to data lakes or a pro, you want to be able to test and fail fast to iterate your way to success.

Note

The JSON library in Python can parse JSON from strings or files. The library parses JSON into a Python dictionary or list. There are various circumstances in which you receive data in JSON format and you need to send or store it in CSV format. Of course, JSON files can have much more complex structures than CSV files, so a direct conversion is not always possible.

Then there’s clickstream data. Many platforms such as Facebook and Google Ads rely on the data generated by user clicks. To analyze clickstream data, you must follow a user’s click-by-click activity on a web page or application. This is extraordinarily valuable because having a 360-degree view of what customers are and are not clicking on can dramatically improve both your products and your customers’ experiences.

Important considerations for the data lake

Here are some common considerations during the exploratory phase of data lake maturity:

1. Onboard and ingest data quickly

One innovation of the data lake is rapid—and early—ingestion coupled with late processing. This allows you to make integrated data immediately available for reporting and analytics.

2. Put governance in place to control who loads which data into the lake

Without these types of controls, a data lake can easily turn into a data swamp, which is a disorganized and undocumented dataset from which it’s difficult to derive value. Establish control via policy-based data governance and, above all, enforce so-called “antidumping” policies. You should also document data as it enters the data lake by using metadata, an information catalog, business glossary, or other semantics so that users can find the data they need and optimize queries.

3. Keep data in a raw state to preserve its original details and schema

Detailed source data is preserved in storage so that it can be used over and over again as new business requirements emerge.

Stage 3: Expansion—Moving the Data Lake to the Cloud

In this stage, multiple projects are using big data, so you have the foundation for a big data infrastructure. You have created a roadmap for building out teams to support the environment.

You also face a plethora of possible projects. These typically are “top-down” projects; that is, they come from high up in the organization, from executives or directors. You are focused on scalability and automation, but you’re not yet evaluating new technologies to see whether they can help you. However, you do have the capacity and resources to meet future needs and have won management buy-in for the project on your existing infrastructure.

If you’re not in the cloud by this time, you should be moving there during this stage.

However, you’re still pretty rudimentary at this point. You probably don’t have governance in place, and you probably don’t have well-structured teams in place, either. You’re really just a step beyond performing PoCs and you probably haven’t started hiring additional team members. You haven’t started reaching out for people who are visionaries or have a lot of experience in this area. Your teams are still learning. And you probably haven’t figured out what storage formats you should have, how well they’re optimized, how to convert one to another, or how to order them. This is all perfectly natural.

As far as challenges go, here’s what Stage 3 companies often encounter:

  • A skills gap: needing access to more specialized talent

  • Difficulty prioritizing possible projects

  • No budget or roadmap to keep TCO within reasonable limits

  • Difficulty keeping up with the speed of innovation

Getting from Stage 3 to Stage 4 is the most difficult transition. At Stage 3, people throughout the organization are clamoring for data, and you realize that having a centralized team as the conduit to the data and infrastructure puts a tremendous amount of pressure on that team. To avoid this bottleneck in the company’s big data initiatives, you need to find a way to invert your current model and open up infrastructure resources to everyone, to fully operationalize the data lake and expand its use cases.

The bottom line is that Stage 3 naturally pushes you out of your comfort zone and to the point where you will need to invest in new technologies and to shift your corporate mindset and culture. At this time, you absolutely begin thinking of self-service infrastructure, and looking at the data team as a data platform team. You’re ready to move to Stage 4.

Stage 4: Inversion

It is at this stage that your company achieves an enterprise transformation and begins seeing “bottom-up” use cases—meaning that employees are identifying projects for big data themselves rather than depending on executives to commission them. Though this is a huge shift for the company, there is a new challenge now, which is maintaining organization and looking at where to spend smarter or more efficiently across teams’ workloads.

You know you are in Stage 4 if you have spent many months building a cluster and have invested a considerable amount of money, but you no longer feel in control. Your users used to be happy with the big data infrastructure, but now they complain. You’re also simultaneously seeing high growth in your business—which means more customers and more data—and you’re finding it difficult, if not impossible, to scale quickly. This results in massive queuing for data. You’re not able to serve your “customers”—employees and lines of business are not getting the insight they need to make decisions.

Stage 4 companies worry about the following:

  • Not meeting SLAs

  • Not being able to grow the data products and users

  • Not being able to control rising costs from growing or new projects

  • Not seeing collaboration across teams or data

Still, at this point the company is pretty mature. You know what you’re doing. You’re now a data-driven organization and your data lake is returning measurable ROI and has become the heartbeat of the business.

You are now concerned about governance: controlling this very valuable asset you now have in place. Governance structures need to be mandated from the top down and strictly adhered to, and you must publish action items that are followed up on every month or every quarter to ensure that your governance activities live up to these standards. You need to update your governance plans and add on security and reporting procedures. That’s why governance must be a living, breathing thing.

You actually need three governance plans:

  • Data governance

  • Financial governance

  • Security governance

We discuss governance more thoroughly in Chapter 5.

Stage 5: Nirvana

If you’ve reached this stage, you’re on par with the Facebooks and Googles of the world. You are a truly data-driven enterprise that is gaining invaluable insights from its data. Your business has been successfully transformed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.103.183