Data Storage and the Lambda Batch Layer

In Chapter 2, Comprehensive Concepts of a Data Lake, you got a glimpse into the data storage layer. This layer’s responsibility is to persist gathered data into a permanent place in our Data Lake. The Lambda Batch Layer’s responsibility is to create batch views for the data stored in the Data Storage layer. The following figure will refresh your memory and give you a good pictorial view of this layer:

Figure 01: Data Lake - Data Storage and Lambda Batch Layer

The Data Storage Layer is one of the very important layers, which should persist different types of raw data coming from different source systems, and it also should be easy to scale according to need. It's very important for this layer to have a defined IOPS (see info) as this can be one of the deciding factors of how much and how frequently the data from the source system can be taken into the data lake. For unstructured data, Hadoop is one of the de-facto technologies used for this purpose. There are other mechanisms, such as NoSQL (see info) and NewSQL (see info).

Input/output operations per second (IOPS, pronounced eye-ops) is a performance measurement used to characterize computer storage devices, such as hard disk drives (HDDs), Solid State Drives (SSDs), and Storage Area Networks (SANs).

A NoSQL (originally referring to non-SQL, non relational, or not only SQL) database provides a mechanism for the storage and retrieval of data, which is modeled by means other than the tabular relations used in relational databases.

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

- Wikipedia

The storage layer should be able to handle the following:

  • Support for a wide variety of analytics tool to be bound on top of it for various queries
  • Different types of data in different modes (batch and real-time)
  • Different formats of data, such as structured, unstructured, and semi-structured data, with ease
  • Different scaling requirements
  • Various compression methodologies for efficient persistence and efficiency
  • Different data velocities (KB per second, MB per second, and so on)
  • Different querying mechanism and language capabilities for extracting relevant data out of the lake for various analysis, as the case may be
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.72.245