In Chapter 2, Comprehensive Concepts of a Data Lake you will have got a glimpse of the Data Ingestion Layer. This layer’s responsibility is to gather both stream and batch data and then apply any processing logic as demanded by your chosen use case. The following figure will refresh your memory and give you a good pictorial view of this layer:
In our Data Lake implementation, the Data Ingestion Layer is responsible for consuming the messages from the messaging layer and performing the required transformation for ingesting them into the Lambda Layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. The Data Ingestion Layer must ensure that the rate of message consumption is always better or equal to the message ingestion rates, such that there is no latency to process the messages/events.
Some of the characteristics of Data Ingestion Layer can be summarized as follows:
- Less complex and really fast to cater to data input (in our case, output from the messaging layer)
- Capable of handling different data flows (real-time or batch, continuous or asynchronous)
- Capable of handling various data types (structured, unstructured, and semi-structured)
- Integration with various persistence store mechanisms
- Multiple transport protocol support
- Capable of handling four V's of big data
- Capable of connecting with disparate systems and technologies
As shown in the preceding figure, we will take data from the messaging layer and will enrich and transform it accordingly to pass it to the Lambda Layer (both Speed and Batch Layer).