Chapter 6. Tools for Making the Data Lake Platform

Now that you’ve got the frame of a data lake house, it’s time to think about the tools that you need to build up the data lake platform in the cloud. These tools—in various combinations—can be used with your cloud-based data lake. And certain platforms, like Qubole, allow you to use these tools at will depending on the skills of your team and your particular use cases.

The Six-Step Model for Operationalizing a Cloud-Native Data Lake

Figure 6-1 illustrates a step-by-step model for operationalizing the data in your cloud-native data lake platform. In the subsections that follow, we discuss each step and the tools involved.

The six step model for operationalizing data
Figure 6-1. The six-step model for operationalizing data

Step 1: Ingest Data

As Chapter 5 discusses, you have various sources of data: applications, transactional systems, and emails, as well as a plethora of streaming data from sensors and the IoT, logs and clickstream data, and third-party information. Let’s focus on streaming data. Why is it so important?

First, it provides the data for the services that depend on the data lake. Data must of course arrive on time and go where it needs to go for these services to work. Second, it can be used to quickly get value from data, whereas batch processing would take too long. Recommendation engines that improve the user experience, network anomaly detection, and other real-time applications all require streaming data.

You first need tools that ensure syncs or end points so we can automatically group the data into logical buckets. In other words, you need tools to help you organize the streaming data in an automated fashion, so you don’t have one “big blob” of data; you can separate it into logical groupings. Relevant tools include Kinesis Streams for capturing streaming data from edge devices, and Kinesis Firehose to sync or move that data to Amazon S3 or Redshift. Then there’s Kafka plus KStreams and Spark Structured Streaming. Although aggressively promoted by Hortonworks, Apache Flume and Apache Storm are less popular today because they lack the performance and scalability of Kafka or Spark Structured Streaming.

When you first set up your streaming pipelines, it’s usually considered a best practice to refresh the data every 24 hours, depending on your business. For example, a financial company probably wants to know what’s been happening in the markets within the last five minutes. A shipping company has different needs. It wouldn’t make any sense to update its data every five minutes, so relaxing and letting some latency into the pipeline for a few hours would probably be all right for that particular use case.

At this point, many businesses are already moving away from the raw batch data derived from transactional systems, applications, or IoT sensors. As you might recall, this raw data is neither cleansed nor formatted. This is just an initial stage for the data.

After the data is moved into an optimal data store in the data lake, it is cleaned and transformed. At this point, operators perform necessary governance, such as stripping out names or Social Security numbers, depending on whether your business is responsible for meeting compliance mandates such as HIPAA or PCI regulations.

You can transform and clean data inline at the same time. There is no need for a separate batch process. And then your users can get into real-time predictions and learning.

At this state, you need to be sure that the data lineage—the data’s origins—has been tracked. You need to know not only where it came from, but where it moved over time and when and if anything was done to it.

Apache Atlas (discussed in more detail later in this section) can help you track data lineage. Remember when we talked about all the tribal knowledge held within these data silos? Atlas helps you to concentrate it into one system and expose it to people searching for data. Your users then can search for a particular column, data type, or particular expression of data.

One way to make this search process easier is to assign metadata in plain English, free of acronyms or terms that only a few technical users might understand. Think of the business user, not a member of the technical team, who will be reading the metadata and making decisions and discoveries based on it.

Querying the data stream

First, what is a data stream? In computer science, a stream is a sequence of unbounded data elements made available over a span of time. You can think of a stream as items on a conveyor belt being processed one at a time, in a continuous flow, rather than in large batches, or, to continue the warehouse analogy, a delivery truck periodically dropping off a large load of items all at once. Streams are processed differently than batch data—most normal system functions can’t operate on streams, because they have potentially unlimited data.

You can run queries on the data that’s in the stream. It wouldn’t be the same as querying all of the data in the data lake. You would also need a shorter time interval to query, such as 1 hour or 24 hours. You could ask questions like, how many users signed up in the past hour? How many financial transactions occurred in the past 2 hours?

Streams are not meant to hold massive amounts of data; they’re designed to hold data only for a short time, typically from 5 minutes to 24 hours. A streaming platform has three key capabilities. First, it can publish and subscribe to streams of data, like traditional message queues or enterprise messaging systems can. Second, it stores data streams in a reliable, fault-tolerant way. Finally, it can process streams as they arrive.

Apache Kafka

Kafka is an open source stream-processing software platform that started as a distributed message queue for stream data ingestion. Over time, it developed into a full-fledged streaming platform by including processing capabilities as well. Written in Scala and Java, Kafka offers a unified, high-throughput, low-latency platform for handling real-time data feeds.

Apache Kafka is used for two types of applications:

  • Real-time streaming data pipelines that reliably get data between systems or applications

  • Real-time streaming applications that transform or react to the streams of data

Tools to use for stream processing

The following tools can be used to process streaming data:

Spark Streaming

Although Kafka is often referred to as the “message bus,” Spark can provide a streaming engine in conjunction with the Kafka operation. The same data is going through Kafka; however, you are using the Spark engine to enable processing of live data streams. This is a streaming module of the Apache Spark ecosystem (Figure 6-2) known for scalable, high-throughput, and fault-tolerant stream data processing.

How Spark Streaming works
Figure 6-2. How Spark Streaming works

With Spark 2.0, Spark Streaming (now known as Structured Streaming) has evolved significantly in terms of capabilities and simplicity, enabling you to write code for streaming applications the same way you write batch jobs. Internally, it uses the same core APIs as Spark Batch with all of the optimizations intact. It supports Java, Scala, and Python. Structured Streaming can read data from HDFS, Flume, Kafka, Twitter, and ZeroMQ. You can also define your own custom data sources, which could be storage such as the Amazon S3 object store or a streaming database like Druid.

Apache Flink

Apache Flink is a scalable, high-throughput, and fault-tolerant stream-processing framework popular for its very low-latency processing capabilities. Flink was built by developers from the Apache Software Foundation, most of whom are employed by data Artisans (recently acquired by Alibaba).

The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes operators in a continuous flow, allowing multiple jobs to be processed in parallel as new data arrives.

There are several key differences between Spark Streaming and Flink. Spark is a microbatch technology that allows latency to be counted in seconds. Alternatively, Flink offers event-by-event stream processing, so latency can be measured in milliseconds. Flink is usually used for business scenarios where very low end-to-end latency is important, such as real-time fraud detection. Spark is usually used for streaming ingestion and streaming processing, for which low latency is unimportant.

A point in favor of using Spark most of the time is its popularity. Data engineers are familiar with Spark for batch and ETL use cases, so they end up using Spark for streaming as well for familiarity and ease of use—unless there is a strong need for very low-latency stream processing.

Apache Druid

This is an open source, high-performance, and column-oriented distributed database built for event-driven data that was designed for real-time, subsecond OLAP queries on large datasets. Druid is currently in incubation (see the following note) at the Apache Foundation. Druid has two query languages: a SQL dialect and a JSON-over-HTTP API. Druid is extremely powerful when it comes to running fast interactive analytics on real-time and historical information.

Note

What Is “Incubation” for Open Source Projects?

After a project has been created within the Apache Foundation, the incubation phase is used to establish a fully functioning open source project. In this context, incubation is about developing the process, the community, and the technology. Incubation is a phase rather than a place: new projects can be incubated under any existing Apache top-level project.

Step 2: Store, Monitor, and Manage Your Data

As you begin to develop your data lake, the structures in it become well defined.

You now have some governance around what the structures of your data look like. You might now even have a DevOps team that is monitoring the data using tools like LogicMonitor or Datadog (more on these later).

At this point, you’re monitoring for operations that might be running out of bounds or not working the way you expect. For data engineers, in particular, dataflow governance is really important. As data becomes more critical to the organization, the data engineering team is in effect the headwaters of the data river. This is the team to which everyone looks if something goes wrong. And if the data doesn’t flow because the ETLs don’t happen, or if the data isn’t processed correctly, increasingly serious problems can occur downstream—potentially all the way to the senior executives, or even the CEO.

Monitoring your data lake

Monitoring is a critical function for any successful data lake. Monitoring is a broad-ranging subject. Different types of users—DataOps professionals, data engineers, data scientists, and data analysts—all have different requirements for monitoring. For example, DataOps professionals might be looking at resource usage, data ingest rates, and overall CapEx and OpEx efficiency as well as cloud provider health. Alternatively, data engineers might be looking at SLA objectives for ETL pipelines and data-quality reports. Data analysts and data scientists might be interested in the data-quality reports so that they can have confidence in the data provided to them.

Modern monitoring systems provide a rich set of services such as dashboarding, anomaly detection, alerting, and messaging.

Reports and anomaly detection

Monitoring services for cloud-scale applications are extremely important to build into your systems at this stage. They provide dashboarding capabilities to visualize the health of your servers, databases, tools, and services through a SaaS-based data analytics platform. Management tools with cutting-edge capabilities such as collaboration, workflow, and dashboarding include LogicMonitor, Datadog, and VictorOps.

Alerting

Alerting is one potential output of a monitoring system, drawing attention to an issue based on predetermined metrics so you can take action to resolve the problem. Modern alerting platforms let you program on-call rotations as well as escalation logic. Leaders in this space include PagerDuty, Opsgenie, and VictorOps.

Messaging/communications

Monitoring systems typically have messaging and communication components that allow your team to collaborate. Whereas the alerting system can alert a specified group or user if a specified system threshold is passed, the messaging and communication capabilities help your team resolve issues quickly and effectively. Another benefit of a centralized communication system is that it gives you the ability to search historical discussions for solutions if issues recur. Leaders in this space include Slack, Google Groups, and Discord.

Tools for managing data services

Data services play an important role in managing an organization’s critical data. These tools are responsible for handling data ingestion, preparation, federation, and propagation in the data lake.

Many times, these systems can directly impact the overall success of a big data project by making it easier for users to discover and have confidence in the data early in a project life cycle. Data services also allow administrators to enforce governance by obfuscating sensitive data or controlling which users can access which data.

Apache Atlas

Apache Atlas is a low-level service in the Hadoop stack that provides core metadata services.

Atlas supports data lineage by helping people understand the metadata for the data. It’s intended to provide a discoverability mechanism for people to better parse the data.

Atlas concentrates all the tribal knowledge held within all these data silos and exposes it to the people searching for data. Users can search for a column, a data type, or even a particular expression of data.

Step 3: Prepare and Train Data

Data preparation is important at the data lake level during the ingestion process. If data is not transformed and cleaned here, you will need to do it downstream in other data pipelines, which can cause inconsistency or duplicate workloads. You need to focus on preparing data to put into a data lake that’s accessible for one source of truth; in effect, this allows your data teams to speak the same language about data and easily consume it. We can break down data preparation into four main areas:

  • Data discoverability

  • Data preparation

  • Data fusion (augmentation)

  • Data filtering

Let’s take a closer look at each of these areas.

Data discoverability is important during data preparation because as your organization grows, it can be very time-consuming to try to identify what data is available. You need to make it easy for users to log into a data portal and see what data assets they can access. It is also critical that your users have a high degree of confidence in the data that you are putting into the data lake.

Data preparation is not one step, but a combination of many steps. For example, ingestion basically means consuming data from sources. That’s the first step in data preparation, which is ensuring that the data is coming in. It might be coming from multiple sources, and you need to connect to all of them and get the data, which is essentially raw, into a new system (data lake).

This data is not yet ready for direct consumption, which is where sanitizing comes in. This is also known as data “cleansing,” in which you remove redundancies and incorrect records and values.

Data fusion, also referred to as data augmentation, is when you fuse data from one source with data from other sources. Implementing good master-data management and data-preparation practices enables users to more easily fuse disparate data sources. Then, having a single shared repository of data and breaking down data silos means that users can extract value from more types of data in a much shorter timeframe.

Data filtering comes in during data preparation, when the data team helps you filter out everything but the data that’s useful to you. Some groups want only a subset of the data. For instance, if you’re working in a big bank and you’re in the equities group, you might not want to see bond data.

This is especially important to companies that must comply with a lot of data privacy or security mandates, like health care organizations or large banks. For example, the latter has tightly controlled structures, so data could be coming from the mortgage capital division, the equities division, the bonds division, or wealth management. Users might need different segments of data that come out from the general data lake in which this data has been stored. You typically need to filter based on your requirements. When you filter, the datasets are refined into simply what you need, without including other data that can be repetitive, irrelevant, or sensitive.

The Importance of Data Confidence

Data preparation is all about sanitizing, filtering, and cataloging the data to capture its lineage. Why do you need to do this? Because you’ll always have issues related to data quality. How confident are you in the data? Do you have a clean set of the data? Is the data complete and was it processed correctly? These are probably the most important questions you’ll hear from users within your company.

Data preparation is all about ensuring the quality and consistent timely arrival of data. The absolute last thing you want is the CEO coming to you and asking, “Where’s my data?” or worse, “Where did these numbers come from? The CFO thinks they’re wrong.”

That’s not fun, because then you must scramble and go through all the different channels and groups that might have touched that data before it got to the CEO.

Tools for Data Preparation

Hive Metastore is a service that stores the metadata for Hive tables and partitions in a relational database. It then provides clients (including Hive) access to this information using the metastore service API that will be used later, during SQL query processing.

Apache Atlas, as we’ve discussed, is a scalable and extensible framework for data governance including metadata management and potentially also data usage and lineage. It allows enterprises to define and classify data and build a unified catalog of governable data assets. This unified catalog can be used by data analysts and scientists for their data analysis as well as by engineers and IT administrators in defining data pipelines. Security personnel can use it for audit and compliance monitoring through data usage and lineage.

Step 4: Model and Serve Data

The more you can empower users by lowering the barriers to access data and the right engines to process said data, the faster you will derive value from your investment in big data technologies.

In particular, moving to a self-service culture—as described in Chapter 2—will free up time for your data scientists and engineers to begin using advanced analytics and AI tools such as machine learning instead of worrying about provisioning infrastructure resources. This shift is essential if you want to reap the competitive benefits of big data in the data lake.

First, let’s examine some definitions and talk about some ways in which you can use machine learning in the data lake.

Machine learning defined

Machine learning is a specific type of AI technology that enables systems to automatically learn and improve from accessing data and having “experiences.” Many experts today consider machine learning to be a “productized statistical implementation.”

By creating a machine learning algorithm, data scientists are basically teaching a machine how to respond to inputs that it has not seen before, while still producing accurate outputs. Machine learning is often used to learn from big data and forecast future events based on that data.

Machine learning is becoming quite common, and many of our day-to-day activities are affected and informed by it. Take Netflix, which predicts what other movies or television shows you might like depending on your past viewing history. Or Amazon’s famous recommendation engine, which makes shopping suggestions based on your purchasing patterns. These both utilize machine learning.

Use cases for machine learning

Here are some of the ways in which machine learning can be used in a real-world business setting:

Deploy AI-powered robotic process automation (RPA)

Cognitive RPA combines process automation with machine learning. RPA by itself is good for basic, repetitive rules-based tasks such as streamlining HR onboarding procedures or processing standard purchase orders. When machine learning is added to it, RPA can do more sophisticated tasks like automating insurance risk assessments. By augmenting traditional rules-based automation with machine learning, RPA software robots (“bots”) can even make simple judgments and decisions.

Improve sales and marketing

Sales and marketing operations generate huge amounts of unstructured data that previously went untapped. For example, companies are using machine learning to do customer sentiment analysis based on remarks made in social media or on sales calls as well as forecasting sales and customer churn based on detecting complex patterns of customer behavior that would otherwise go unnoticed.

Streamline customer service

When coupled with other AI techniques such as natural language processing, machine learning and big data have the opportunity to transform customer service. Already, customers interact with companies using chatbots and virtual digital assistants, and the huge quantities of data that these interactions produce that are ready for analysis boggles the imagination. Such “virtual agents” today use machine learning algorithms to parse customer questions or statements about problems and provide a speedy resolution—problem-solving that gets better as more time passes and more data is available from which the system can learn.

Bolster security

Machine learning can also help companies enhance threat analyses and prevent potential security breaches. Predictive analytics can detect threats early, and machine learning enables you to monitor the millions of data logs from IoT devices and learn from each incident how best to thwart attacks.

This isn’t to say that machine learning is without challenges. According to the 2018 Qubole big data survey, many obstacles impede the effective use of machine learning. The number one obstacle is learning how to analyze extremely large datasets (40% of respondents), followed by being able to secure adequate resources—including expert staff (see Figure 6-3).

Common challenges with machine learning
Figure 6-3. Common challenges with machine learning

Tools for Deploying Machine Learning in the Cloud

When it comes to deploying machine learning models, a number of technologies have recently emerged to help provide data scientists with more self-service capabilities that enable them to take their models into a production pipeline.

Open Source Machine Learning Tools

In the open source world, a few of these tools that have gained popularity recently are MLflow, Kubeflow, and MLeap, which focus on cataloging and managing models that are deployed into production for large-scale datasets.

Managed Machine Learning Services

DataRobot, Dataiku, and H2O are managed services to help data scientists build and train machine learning models for predictive applications. These companies are also helping provide the new wave of “Algorithms-as-a-Service,” which aim to provide point-and-click solutions for machine learning across a wide variety of verticals and data. These companies also provide capabilities to manage models to deploy.

Cloud Machine Learning Services

Lastly, cloud providers are also beginning to offer solutions that manage the end-to-end workflow for data scientists. Recently, AWS released SageMaker, a tool that was designed to make machine learning easier for novices to develop models. SageMaker does this by providing common, built-in machine learning algorithms along with easy-to-use tools for building machine learning models.

SageMaker supports Jupyter Notebooks and includes drivers and libraries for common machine learning platforms and frameworks. Data scientists can use SageMaker to launch prebuilt notebooks from AWS, customizing them based on the dataset and schema they want to train. Additionally, data scientists can take advantage of custom-built algorithms written in one of the supported machine learning frameworks or any code that has been packaged as a Docker image.

Traditionally, data engineers have had to rework models built by data scientists before embedding them in production-quality applications. SageMaker hosting services enable data scientists to deploy their models independently by decoupling them from application code. SageMaker can pull virtually unlimited data from Amazon S3 to train.

Step 5: Extract Intelligence

What helps people understand data best is pretty pictures and charts. That’s because we’re human and can take in only so much information. It really is true that a picture is worth a thousand words.

Databases are the cornerstone of how we store and interact with data on a daily basis. These systems have been evolving and maturing since the 1970s and support a wide range of business tools that users deploy to digest and report on data. These databases work well for structured data such as OLTP and OLAP workloads. At the same time, the increase in volume and variety of data now requires database engines to run in more distributed ways.

Examples of these types of engines are Google BigQuery, Snowflake, Redshift, Druid, and Presto. Because most of these tools are built with SQL capabilities, you can plug and play your BI solution into most big data technologies that support ANSI-SQL. Having these tools connected to your data lake eliminates the huge wait time it previously took to query data warehouses, reducing the time to get access to information from hours or days to near real time or minutes.

Visualization tools such as Looker, Power BI, and Tableau will always be important. Even Excel on top of the data lake can be valuable because now you can connect your BI tools directly to your data lake—not just a traditional data warehouse—so that you can perform advanced analytics that require immediate access to petabytes of data, data discovery, or the many other use cases we see today.

Tools for Extracting Intelligence

The following are some tools that you can use to get actionable intelligence from your data lake.

Looker

Looker is a BI tool that provides visualization capabilities and comes with real-time analysis. Users can select different types of visualizations from the Looker library or create a custom visualization, including bubble charts, word clouds, chord diagrams, spider-web charts, and heat maps. Looker offers analytics code blocks (Looker Blocks) with SQL patterns, data models, and visualizations included. Although the blocks are prebuilt, they are customizable and can be adjusted to the needs of the user. LookML employs some features of SQL, but with improved functionality.

Power BI

Power BI is a business analytics service by Microsoft. It offers interactive visualizations with self-service BI capabilities, allowing users to create reports and dashboards without having to ask IT staff or database administrators for assistance.

Tableau

Tableau is a BI and data visualization tool with an intuitive user interface. Users don’t need to know how to code, and they can drill down into data and create reports and visualizations without intervention from the data team or IT.

Getting Data Out of Your Data Lake

When you begin thinking of extracting data from your data lake to drive business insight, you generally have two options: canned reports and ad hoc queries.

Canned reports are typically executive reports that are regularly generated. Whether they’re created every morning or even every hour, a pipeline runs a set of predefined data operations that usually results in a PDF or dashboard. Canned reports tend not to change very often, because they are really what runs the day-to-day business.

Ad hoc reports involve users asking general questions of the data, such as “I have a customer on the phone, what’s happening with his order? When will this product arrive in inventory?” and so on. These reports are usefuly when you’re trying to solve a problem at the moment, but don’t necessarily need to be persistent like a daily or weekly report.

Presto for Ad Hoc Analytics

One of the key personas in a large company’s data team is the data analyst. Typically, the job of a data analyst is to explore data, generate reports, and interpret how the business is doing. Usually, a lot of data comes in. It might be in real time, it might be on an hour’s delay, or it might be on a delay of a day or more. It doesn’t matter. The job of the analyst is to analyze the data to reveal various metrics about the business.

Sometimes, an analyst needs data about something that is happening now. For example, a shared-ride aggregator like Lyft might want to know how one of its ride categories is doing in Seattle today because it did a promo there. Or, it may want to do a little bit more historical analysis that investigates which category of service different customer demographics like. For example, are college students going for the cheap cars, are they going for the midsized cars, or are they going for the SUVs?

This is where Presto shines compared to Spark and Hive. Recall in Chapter 3 that Presto is built for running fast, large-scale analytics workloads distributed across multiple servers. Presto is built for SQL, which is the common skill among analysts. In fact, it is an imperative language that helps analysts transition to Presto from other databases. They do not need to worry about the underlying technology or the scale of the system to handle big data analysis.

When it comes to query runtime performance, Presto will not have the millisecond response offered by some of the traditional databases out there (e.g., Oracle or Greenplum with DB2), but it is generally considered faster than Spark and Hive. Presto also shines for common analytics ETL operations, such as large joins and table aggregations. All in all, there is a good chance that Presto will work better with more business tools than the other engines. For these reasons, if you must move away from one of the more traditional mainstream databases to something that is built for the data lake, Presto is the easiest option.

Because of its support for federated queries, Presto fits well into any ecosystem, as depicted in Figure 6-4. It works well with Hive Metastore, HDFS, relational database service, and cloud storage. It also integrates with other data stores like MySQL, PostgreSQL, MongoDB, and Oracle. This helps analysts quickly bring together ad hoc data sources for impromptu analysis without the need for an ETL phase.

Operational data flow using Presto in the cloud
Figure 6-4. Operational dataflow using Presto in the cloud

Because you have a lot of data, speed is definitely important. You probably don’t need subsecond speeds, but you definitely need subminute speeds. And because you have hundreds of terabytes of data and you want subminute speeds, there’s an assumption that you would want to use a cluster of machines to do this job. And Presto can do the capacity planning to make it work well.

On top of that, Presto has also been adopted by large companies like Uber, Lyft, and Netflix. This means that it is also battle hardened, not just at Facebook, but at some of the other top technology companies in the world. There is therefore less of a burden on other companies to do quality and stress testing. The technology world is doing that for us.

Ad Hoc Versus Canned Reporting

As we’ve discussed, the data lake is meant to work for both structured reporting—often called prepackaged, prebuilt, or canned reporting—and ad hoc reporting, in which you’re searching for data in an unexpected or unique way. Data lakes are especially good for ad hoc reporting—the spontaneous SQL or search queries to which you wouldn’t know how to tune your relational database to respond.

The type of report depends on the type of user

Different users will prefer one type of report over the others. Users without particular technical expertise or who lack SQL experience generally are given the prebuilt, canned reports and given access to drill-down reports you get from popular visualization tools. Typically, the higher you go in an organization, the more users will want these simple canned reports that show only high-level views of data.

As you go down the organizational hierarchy, you will want to deliver more elaborate reports—still canned—that are consumed by mid-level management and some data analysts. These kinds of reports go into more depth, into “what if” varieties of ad hoc searching on a limited dataset.

Finally, some data analysts will want to do their own ad hoc querying in SQL for ad hoc reporting.

The challenges of dealing with ad hoc queries

There are two particular challenges when it comes to ad hoc queries: ensuring quality and balancing speed of reporting with speed of searching.

The first challenge is data quality. Organizationally, you need a formal program for evaluating data quality. After all, the reports will only be helpful if the data is accurate, up to date, and meaningful. To set up a program like this, you would need to inject quality checks into your ETL processes. You might do random checks of certain datasets or metadata.

The other challenge is balancing speed of reporting and speed of searching. Typically, the data analytics team that receives requests from users monitors the various types of requests. If a number of requests for the same type of query grows, you have an obvious need to create a standardized report based around that query. It could even become a separate dataset prebuilt for that query that is faster to access and therefore can satisfy those requests quickly.

This will reduce your ability to do additional ad hoc searching on that dataset, but it increases speed and helps you avoid having to do additional programming to deliver similar reports to users over and over again. And, ultimately, you want to shift users as far as possible from having to ask for ad hoc reports to receiving prepared reports.

The need for competence in analytics reporting

In analytics reporting, you also need what we’ll call competence. The three main areas of competence when it comes to analytics are: data-enablement competence, decision-support competence, and data-excellence competence. Let’s look a bit more closely at each of these:

Data-enablement competence

This means providing platform information, integrity, and data and enterprise integration in a way that ensures the data will always be available and reliable. This is critical to a smooth analytics reporting function.

Decision-support competence

For your team to be competent at decision support, you first must acknowledge that there are two sides to successfully applying data science. One is that the data scientist is expected to be knowledgeable about implementing various mathematical models to detect patterns in the data. The other side of it is that data scientists also need to understand the business, or they won’t be of any real help to the bottom line.

Data-excellence competence

What happens with data in the end is very much a function of how well and how tight the data-excellence processes are. These processes include what the data team—data engineering in particular—is doing to ensure that the data is of the highest quality so that business decisions can be made using it.

Structurally, this means that you can organize your data teams in two different ways. The first way is to have data scientists work closely and directly with users within your different business units or with executives in the C-suite. To do this, the data scientists must both understand the business’s needs as well as use their experience and expertise to deliver insights to the business based on big data.

Alternatively, you could put the data scientists into two different groups by function. One group would primarily interact with business users to gather their needs. Then, they would take those needs to the other group, who would do the actual data science. We think the former is a much healthier way to go, and thus we recommend that businesses go that route.

Step 6: Productionize and Automate

At this stage, you have your data lake built. You are focused on making it into a production-ready resource and improving its operations through automation.

Here’s how you’ll know that your data lake is production ready:

  • Your users have clearly defined expectations and specific goals that they’ve prioritized for using the big data in your data lake.

  • Your data lake possesses the security and governance required of any enterprise-class infrastructure or resource.

  • You can scale and add new storage, compute, and network capacity quickly that exactly match your needs at the moment, with no waste.

  • Your data team has the required skills to support the data lake throughout the data life cycle.

  • You can efficiently perform incident response, manage trouble tickets, provide training, and in general deliver all the support functions required of an enterprise-class operation.

  • You have automated as much as possible to reduce errors and the stress on the data team.

Tools for Moving to Production and Automating

Now that you are ready to put your data lake into production and automate it, you will need tools. Happily, a number of great tools have emerged so that you don’t need to build them yourself. Specifically, you can deploy workflow schedulers and ETL managed services from the open source world to help you with this final stage of building your data lake.

Open Source Workflow Schedulers

Apache Airflow is an open source tool for orchestrating complex computational workflows and data-processing pipelines. An Airflow workflow is designed as a directed acyclic graph (DAG). When using Airflow to author workflows, you divide it into tasks that can be executed independently.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are DAGs of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

Azkaban is an open source workflow scheduler created by LinkedIn that aims to reduce Hadoop job dependencies. It is a batch job scheduler that allows developers to control job execution inside Java and especially inside Hadoop projects.

Pinball is workflow management software developed to manage big data pipelines. It is available as open source.

ETL Managed Services

ETL is a process for preparing data for analysis. It is most commonly associated with data warehouses, but it can also apply to data lakes. Extract involves extracting data from homogeneous or heterogeneous sources. Transform processes data by transforming it into the chosen format for analysis. Finally, load describes placing the data into the final repository—in our case, the data lake. Following are three tools that you can use to manage your ETL tasks:

Informatica

A widely used ETL tool, Informatica’s PowerCenter helps you to extract data from multiple sources, transform it to the right format for your data lake, and load it into the data lake.

Pentaho

This is BI software that provides data integration, OLAP services, reporting, information dashboards, data mining, and ETL. Pentaho enables businesses to access, prepare, and analyze data from multiple sources.

Talend

This is an open source data integration platform that provides various software and services for data integration, data management, enterprise application integration, data quality, and cloud storage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.234