In Chapter 2, Comprehensive Concepts of a Data Lake, you got a glimpse into the data storage layer. This layer’s responsibility is to persist gathered data into a permanent place in our Data Lake. The Lambda Batch Layer’s responsibility is to create batch views for the data stored in the Data Storage layer. The following figure will refresh your memory and give you a good pictorial view of this layer:
The Data Storage Layer is one of the very important layers, which should persist different types of raw data coming from different source systems, and it also should be easy to scale according to need. It's very important for this layer to have a defined IOPS (see info) as this can be one of the deciding factors of how much and how frequently the data from the source system can be taken into the data lake. For unstructured data, Hadoop is one of the de-facto technologies used for this purpose. There are other mechanisms, such as NoSQL (see info) and NewSQL (see info).
A NoSQL (originally referring to non-SQL, non relational, or not only SQL) database provides a mechanism for the storage and retrieval of data, which is modeled by means other than the tabular relations used in relational databases.
NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.
- Wikipedia
The storage layer should be able to handle the following:
- Support for a wide variety of analytics tool to be bound on top of it for various queries
- Different types of data in different modes (batch and real-time)
- Different formats of data, such as structured, unstructured, and semi-structured data, with ease
- Different scaling requirements
- Various compression methodologies for efficient persistence and efficiency
- Different data velocities (KB per second, MB per second, and so on)
- Different querying mechanism and language capabilities for extracting relevant data out of the lake for various analysis, as the case may be