Data ingestion

The scope of an ingestion layer is to receive data, encompassing as wide a range of commonly used transport protocols and data formats as possible, while providing capabilities to extract and transform this data before finally storing it.

Processing data can somehow be seen as extracting, transforming, and loading (ETL) data, which is often called an ingestion pipeline and, essentially, receives data from the shipping layer to push it to a storage layer. It comes with the following features:

  • Generally, the ingestion layer has a pluggable architecture to ease integration with the various sources of data and destinations, with the help of a set of plugins. Some of the plugins are made for receiving data from shippers, which means that data is not always received from shippers and can come directly from a data source such as a file, a network, or even a database. It can be ambiguous in some cases: should I use a shipper or a pipeline to ingest data from the file? It will, of course, depend on the use case and also on the expected SLAs.
  • The ingestion layer should be used to prepare the data by, for example, parsing the data, formatting the data, doing the correlation with other data sources, and normalizing and enriching the data before storage. This has many advantages, but the most important advantage is that it can improve the quality of the data, providing better insights for visualization. Another advantage could be to remove processing overheads later on, by precomputing a value or looking up a reference. The drawback of this is that you may need to ingest the data again if the data is not properly formatted or enriched for visualization. Hopefully, there are some ways to process the data after it has been ingested.
  • Ingesting and transforming data consumes compute resources. It is essential that we consider this, usually in terms of maximum data throughput per unit, and plan for ingestion by distributing the load over multiple ingestion instances. This is a very important aspect of real-time visualization, or, to be precise, near real-time visualization. If ingestion is spread across multiple instances, it can accelerate the storage of the data and, therefore, make it available faster for visualization.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.156.176