282 | Big Data Simplied
11.3 DATA WAREHOUSES VS. DATA
LAKES—WHATISYOUR STRATEGY?
In the preceding section, we got introduced to the concept of data warehouse. It was mentioned
that data warehouse is a storehouse of integrated data, pouring in from data sources across differ-
ent parts of an enterprise. Thus, the data stored in a data warehouse is primarily used for report-
ing, analysis and to support various processes related to making business decisions. Typically, the
data is gathered for transactional and operational sources.
Let us now introduce the concept of a ‘Data Lake’. The commonality between a data ware-
house and a data lake is that they both offer storage for data. However, there are significant differ-
ences between the two and it will be discussed later. Let us first define a data lake.
Simply put, a data lake is a storage repository that can handle large volumes and varied types of
data from theoretically an infinite number of sources. Most importantly, it can hold vast amounts
of raw data in its native format, such as structured, semi-structured or unstructured. The data
structure and the requirements from the data are not definitive till the point when the data is
needed for processing. As such, the data can be stored in the data lake without any processing.
It can be processed as and when the need arises. This, in turn increases the efficiency of the data
lake from the perspectives of storage and processing.
Did You Know?
The term “Data Lake” was coined by the CTO of Pentaho, James Dixon.
Dixon draws an analogy between a water body, where water is stored in a natural state,
and a Data Lake where data is stored unprocessed for later usage, hence the title Data Lake.
In his words:
‘If you think of a Data Warehouse as a store of bottled water—cleansed and packaged and
structured for easy consumption—the DataLake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various users
of the lake can come to examine, dive in, or take samples.’
11.3.1 Differences between Data Warehouse and Data Lake
It is extremely important to understand the difference between a traditional data warehouse and
a data lake to decide what is more suitable for an implementation.
a. Type of Data: A traditional data warehouse can pose constraints in terms of data storage
and processing. It will accept and store data that is highly processed, transformed and
structured. On the other hand, a data lake accepts all kinds of data, irrespective of whether
the data is structured, semi-structured or unstructured. It can keep the data secure. It can
ingest data from a wide variety of sources which a traditional data warehouse will reject.
Examples of such sources include, but are not limited to web server logs, social network
data, sensor data, textual data and images. A traditional data warehouse will typically
store highly structured data.
M11 Big Data Simplified XXXX 01.indd 282 5/13/2019 9:57:44 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.137.58