Data lakes

A data lake is a centralized repository for both structured and unstructured data. The data lake is becoming a popular way to store and analyze large volumes of data in a centralized repository. It stores data as is, using open source file formats to enable direct analytics. As data can be stored as is in its current format, you don't need to convert data into a predefined schema, which increases the speed of data ingestion. As illustrated in the following diagram, the data lake is a single source of truth for all data in your organization:

Object store for data lake

The following are the benefits of a data lake:

Data ingestion from various sources: Data lakes let you store and analyze data from various sources such as relational, non-relational database, and streams in one centralized location for a single source of truth. This answers questions such as why is the data distributed in many locations? and where is the single source of truth?
Collecting and efficiently storing data: A data lake can ingest any kind of data structure, including semi-structured and unstructured data without the need of any schema. This answers questions such as how can I ingest data quickly from various sources and in various formats, and store it efficiently at scale?
Scale up with the volume of generated data: Data lakes allow you to separate the storage layer and compute layer to scale each component separately. This answers questions such as how can I scale up with the volume of data generated?

Applying analytics to data from different sources: With a data lake, you can determine schema on read and create a centralized data catalog on data collected from various resources. This enables you to perform quick ad hoc analysis. This answers questions such as is there a way I can apply multiple analytics and processing frameworks to the same data?

You need an unlimited scalable data storage solution for your data lake. Decoupling your processing and storage provides a significant number of benefits, including the ability to process and analyze the same data with a variety of tools. Although this may require an additional step to load your data into the right tool, using Amazon S3 as your central data store provides even more benefits over traditional storage options.

The beauty of the Data Lake is that you are future-proofing your architecture. Twelve months from now, there may be new technology you want to use. With your data in the data lake, you can insert this new technology into your workflow with minimal overhead. By building modular systems in your big data processing pipeline, with common object storage such as Amazon S3 as the backbone, you can replace specific modules when they become obsolete or when a better tool becomes available.

One tool can not do everything. You need to use the right tool for the right job, and data lakes enable you to build a highly configurable big data architecture to meet your specific needs. Business problems are far too broad, deep, and complex for one tool to solve everything, and this is especially true in the big data and analytics space.

Table of Contents for Data lakes

Create new playlist

Sign In

Sign Up

Table of Contents for
Data lakes