Chapter 5. Governing Your Data Lake

Now that you’ve built the “house” for your data lake, you need to consider governance. And you need to do it before you open the data lake up to users, because the kind of governance you put into place will directly affect data lake security, user productivity, and overall operational costs. As described in Chapter 3, you need to create three governance plans:

  • Data governance

  • Financial governance

  • Security governance

Data Governance

When formulating a data governance policy, you’ll inevitably encounter these questions about the data life cycle:

  • How long is this data good for?

  • How long will it be valuable?

  • Should I keep it forever or eventually throw it away?

  • Do I need to store it because of government regulations?

  • Should I put it into “colder” storage to lower costs?

Many enterprises have data that doesn’t need to be accessed frequently. In fact, your data has a natural life cycle, and an important data governance task is to manage data as it moves between various storage resources over the course of that life cycle. Storage life-cycle management is thus becoming an increasingly important aspect of data storage decisions. The cloud offers a variety of storage options based on volume, cost, and performance that you can choose from depending on where in its life cycle your data currently resides.

Public cloud providers like AWS and Azure offer storage life-cycle management services. These allow you to move data to and from various storage services. In most cases, life-cycle management options allow you to set rules that automatically move your data—or schedule deletion of unneeded data—after specified amounts of time.

Although details vary, data-management experts often identify seven stages, more or less, in the data life cycle (see also Figure 5-1).

1. Create

Data enters or is created by the business. This can be from a transaction, from the purchase of third-party data sources, or, increasingly, from sources like sensors on the IoT.

2. Store

If the data is stored in a data lake, it is kept in its raw state, and governance makes sure it doesn’t become corrupted, changed, or lost.

3. Optimize

The data is scrubbed, organized, optimized, and democratized to be accessible to the rest of the organization.

4. Consume

This is where analysts, scientists, and business users access the data and perform analyses and transformations on it.

5. Share

Analysts, scientists, and users can share the data analyses and transformations they’ve created with others.

6. Archive

Data is removed from the data lake. It is no longer processed, used, or published but is stored in case it is needed again in the future.

7. Destroy

In this phase, data that was previously available in hot storage is deleted. This is usually done after the data has been archived.

The seven stages in the data life cycle
Figure 5-1. The seven stages in the data life cycle

Privacy and Security in the Cloud

Data privacy has become a center of focus because of the ever-increasing requirements of regulations in the industry, including government regulations such as GDPR and HIPAA.

This is especially important when you move to the cloud because you are conceding some control of your environment to a third-party provider. Traditional forms of security that are perimeter-centric are no longer sufficient, because data has become the perimeter in the cloud data lake.

Traditionally, all data operations were kept in-house in your on-premises datacenter, but now that you’re using the cloud, the cloud provider controls where the data resides, and it’s up to you to manage the rights of the data subjects. For example, under GDPR, you are responsible for protecting the privacy rights of customers, whether it is through anonymization or deletion of their data. You need to keep in mind that your cloud provider might be working with other third-party vendors. In such cases, it becomes essential to look at who controls which people can access what data. You need to ensure that you have the appropriate controls in your application infrastructure, but you also need to take a more data-centric approach to security that offers granular data security using technologies like encryption, masking, filtering, and data redaction.

Depending on your industry, you’ll need to think carefully about conforming to HIPAA or PCI regulations. And you need auditability so that you can demonstrate who has access to data, how the data is being accessed or is proliferating, and how you ensure that appropriate controls are in place to demonstrate compliance.

Ensuring that your data lake is secure also requires you to think about data retention. You have two different kinds of responsibilities when it comes to data retention. First, there are the general data retention policies that enterprises must follow as defined by regulations. The best guidelines for these are from the National Institute of Standards and Technology and the International Organization for Standardization. Second, you need to ensure that you put provisions in place so you can delete, purge, or archive data that you’ve been collecting about individuals or businesses—especially given the EU’s GDPR and “right to be forgotten” rules.

Governing your data for privacy and security also requires that you take certain actions. First, you need to be aware of your data. Technologies and tools exist to help you automate this process, so you can scan your data lake to identify exactly what kind of data you have and what’s appropriate for your business.

After data discovery, you need to confirm that the intelligence you’ve gathered can be continuously and automatically augmented. You also need to be able to identify things like data lineage, which gives you the ability to keep up to date on when data moves across the enterprise, when data is transformed, and when data is deleted.

The third aspect is to ensure that this catalog that you created and are continually augmenting is available for users to consume. These could be business users, data analysts, or data scientists, but they will all want the ability to see the data lineage, where the data has come from, and who has done what to it.

Security Governance

One of the biggest concerns businesses have is whether the cloud is secure. One way of reassuring yourself is to first count how many security engineers you have working in your center. One, two, or maybe three?

Then, ask yourself how many security engineers Amazon, Google, or Microsoft have working for them. Probably thousands, if not tens of thousands. Security is possibly the biggest concern for their existing business model. If companies with data in the cloud can’t provide adequate security, no one will trust them.

Most security breaches come from internal vulnerabilities—for example, people who don’t use strong passwords, don’t change their passwords often enough, tape their passwords under the keyboard, or share their passwords with colleagues. One way of dealing with this type of threat is to segregate your work into different systems. You might have one system for ad hoc reporting, one for canned reports, and one for dashboards. Users are looking at the same information, but in separate systems.

Ultimately, this is where object storage comes in. Remember, object storage is ubiquitous across the entire computer infrastructure. Now you can have that system of record or source of truth, and it can sit on Amazon S3 and it can sit on Blob storage, and everyone is looking at the same data, even though they’re not running on the same clusters or the same hardware. This provides different levels of security access to the data.

Financial Governance

Financial governance for data lakes is actually easier when you’re using the cloud model than when you’re using the on-premises model. The reason is very simple: you have visibility and control.

First, let’s talk about financial governance over compute. Compute in the cloud is a logical, not physical, entity. It’s not like a CPU in your datacenter. In the cloud, you can actually know at very fine granularity exactly how many compute hours you use. And because of that ability, you can be very specific about what you need—and know how much it will cost. You can also put limits on compute hours so that you can create a policy-based financial governance model that says one group has a certain amount of compute hours, whereas another group has more or less.

In the on-premises world, the notion of the compute hour never existed. There was no need for it. After you bought a particular compute unit or CPU, you had no incentive to actually break it down and see exactly how many hours each particular workload took. This lack of visibility led to a great deal of “server sprawl.”

Many businesses find costs growing out of their control in the on-premises world for a number of other reasons. First, scaling compute resources also means buying more storage nodes, regardless of whether you need them, because compute and storage are intertwined (unlike in the cloud). Next, as your investments in hardware and software grow, you need additional engineering resources to manage them. Migrating or upgrading on-premises hardware and software is also costly, and raises the risk of creating siloed data operations as it becomes more difficult for a single team to manage a growing infrastructure and data platform. You also need to invest in disaster recovery measures, which can mean having to buy duplicate systems to locate in a safe remote location.

In fact, the overall costs of running a physical datacenter are high whether you rent or buy. Think of the expenses that accrue for power, cooling, uninterruptible power supplies, and the space itself. It all adds up.

On the other hand, financial governance becomes much easier to achieve in the cloud. Reporting and monitoring is simpler, as is enforcement. This can lead to significant cost savings overall.

Why is that? You now have a metric (compute hours) and you can use that metric to pinpoint where, if at all, cost overruns or abuse is occurring. Combine this elastic infrastructure with a billing model that allows businesses to quickly scale as both data and users increase, and you have an OpEx model that your CFO will love. No more CapEx that can encourage overspending on unneeded capacity. Or, the reverse frequently happens: your data teams run out of resources and must wait for new ones to be procured and set up. It’s no wonder that the OpEx model is being enthusiastically embraced by traditional and cloud-native businesses alike.

Warning

You will need a plan for governing the costs of data storage in the cloud. Otherwise, you could find your storage bills begin to rise dramatically as data begets data: the more users get into the data, the more they’re creating other derived datasets. So, make sure that you keep an eye on your storage bills as well as compute hours. Happily, there are plenty of tools that give you the visibility and control you need to do this.

A Deeper Dive into Why the Cloud Makes Solid Financial Sense

To understand why so many businesses save money in the cloud, let’s look more closely at some of the fundamental components of the cloud that enable financial governance. In particular, we define big data clusters, cloud servers, and cloud virtual machine clusters and explain how they contribute to the financial story of the cloud.

What is a big data cluster?

When you request compute resources in the cloud, you are getting a section of a cluster. A big data cluster is a collection of machines, called nodes, that provide the compute resources. The entire collection of nodes is referred to as the cluster, as illustrated in Figure 5-2. You can easily manage clusters by using one of several available frameworks. Qubole uses YARN’s framework for processing and resource allocation with Apache Hadoop, Hive, and Spark engines (see the upcoming sidebar), whereas Presto has its own internal resource manager.

A big data cluster
Figure 5-2. A big data cluster

Cloud virtual machine cluster

In the cloud, clusters are composed of VMs that reside together within the compute space and are paid for when needed, providing an elastic infrastructure to meet the demands of a business, as shown in Figure 5-3. By using VMs, you get the following:

  • Decreased spending and higher utilization

  • Capacity to automate infrastructure management

  • The right environment and the right tools for each workload and team

How a virtual compute cluster works
Figure 5-3. How a virtual compute cluster works

Ultimately, this model of having ephemeral servers available to scale up or down dynamically according to the workload demand of various big data clusters provides the foundational model of ensuring financial governance for each of your workloads in the cloud.

Cloud servers

A cloud server is primarily an Infrastructure-as-a-Service–based cloud service model. There are two types of cloud servers: logical and physical. A cloud server is considered to be logical when it is delivered through virtualization. In this delivery model, the physical server is logically distributed into two or more logical servers, each of which has a separate operating system, user interface, and apps, although they share physical components from the underlying physical server. A physical cloud server, on the other hand, is also accessed through the internet remotely; however, it isn’t shared or distributed. This is commonly known as a dedicated cloud server.

How to Mitigate Cloud Costs: Autoscaling

There are several ways of imposing financial governance on your cloud-based data lake. The most efficient way is autoscaling. Autoscaling is a way to automatically scale up or scale down the number of compute resources that are being allocated to your application based on its needs at any given time.

Figure 5-4 depicts this scaling, which is a visual of an autoscaling Apache Spark cluster on Qubole. The x-axis represents a one-month span of cloud servers (nodes) used by hour, and the y-axis is the number of nodes used.

Autoscaling on an Apache cluster
Figure 5-4. Autoscaling on an Apache cluster

Look at the first spikes of blue—that is, the cluster scaling up to around 80 nodes to meet the demand of the workload. When that workload is complete, autoscaling kicks back in and brings the nodes back down to the amount needed to process that data.

Another interesting implication of autoscaling for this workload is the jump in volume that indicates a spike in the seasonality of the business and thus the need to process more data to meet customer demand.

These servers can be quickly provisioned and turned off, which means that you can have different workloads with different server needs. For instance, in an ad hoc environment, you might want nodes that have a lot of memory to handle the concurrency of your teams’ queries, whereas if you are running a massive ETL job, you will likely need a lot of disk space to be able to handle larger volumes of data. The nature of these workloads will also affect the way autoscaling works.

Figure 5-5 depicts the cluster workloads over a one-week period. Zooming in on a 24-hour period shows how autoscaling can significantly reduce the costs of running clusters due to the fact that clusters are upscaled for only a small percentage of the day.

Cluster workloads over a one month period
Figure 5-5. Cluster workloads over a one-month period

Spot Instances

Another way to save money in the cloud is to use spot instances. Spot instances arise when AWS has any EC2 instance types sitting idle in one of its datacenters. It releases these “spot” instances out to a marketplace where any AWS customer can bid to use the extra capacity. You can usually get away with bidding a fraction of the full value. This is a great way to save money for various ephemeral big data and analytics workloads because you can easily find EC2 instances at up to 90% off the on-demand cost.

In Figure 5-6, the lighter blue shading indicates spot nodes that are running with on-demand nodes in a Qubole autoscaling cluster that has been changed to run Spot on 90% to 100% of the nodes. This allows for this 200- to 1,000-node cluster to run at more than an 80% discount, which is far beyond any vendor commitment discounts from AWS.

Using spot instances to achieve significant cost savings
Figure 5-6. Using spot instances to achieve significant cost savings

Using AWS spot instances does come at a risk, though, as someone can easily come in and outbid the spot instances you want if they’re willing to pay a bit more. Finding the right configuration based on different workloads requires a bit of skill, art, and occasional luck—although we have noticed that stability and alerting for spot instances has improved significantly over the past few years.

Here’ are a few good considerations when using spot instances:

  • Spot instances can be taken away at any time. Resiliency must be built in to the pipeline if you’re going to use them.

  • Availability of spot instances can be lower at certain times of the year. Examples of this are holiday seasons and large sporting events such as the Super Bowl.

  • Providing fault tolerance (for example, a high number of spot nodes) introduces volatility into the entire cluster.

Measuring Financial Impact

Ultimately, the cloud data lake—and, more specifically, autoscaling for cloud computing—allows you to segment your spending by workload as needed. This is a huge shift for your teams’ budgets, as you might find spending fluctuates when your teams are investigating or developing new projects, rather than focusing on system optimizations. Here are a few questions to ask yourself regularly when managing analytics in a cloud data lake:

  • Are your projects delivering ROIs that push key performance indicators in a positive direction?

  • Are you spending well? Are you getting your resources for the best price possible?

  • Are you keeping in mind that sometimes it’s not about spending less, it’s about spending smarter?

Qubole’s Approach to Autoscaling

When Qubole started building autoscaling technologies, it evaluated existing approaches to autoscaling and rejected them as being insufficient for building a truly cloud-native big data solution. At the time, and today still, most autoscaling is built on the server level, which makes decision making very reactive and so causes latencies and other inaccurate decisions (such as the estimated size of data volume used in a query, or impact on memory when more queries come in).

To avoid this, Qubole built autoscaling into Hadoop and Spark, enabling it to access the details of big data applications and the detailed states of the cluster nodes. Being workload-aware makes a big difference when your business is trying to orchestrate Hadoop and Spark in the cloud. Figure 5-7 shows how Qubole uses the following features to enable autoscaling:

World-aware autoscaling

This is automation that allows autoscaling to be malleable in different use cases and clusters based on the workload demands, whether they are bursty or more constant.

Cluster life cycle management

This is when a cluster automatically starts and terminates (upon idleness). This is a big difference between on-premises and prebuying cloud instances that remain available 24/7.

On-demand instances (nodes)

These are common cloud servers that are available to any customer immediately.

Spot nodes

These cloud servers are excess capacity for the infrastructure/datacenter. They’re sold at discount prices, but are unstable given that they can be taken back at any moment.

How Qubole autoscaling works for cost management
Figure 5-7. How Qubole autoscaling works for cost management
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.156.202