Unstructured data stores

When you look at the requirements for an unstructured data store, it seems that Hadoop is a perfect choice because it is scalable, extensible, and very flexible. It can run on consumer hardware, has a vast ecosystem of tools, and appears to be cost effective to run. Hadoop uses a master-and-child-node model, where data is distributed between multiple child nodes and the master node co-ordinates jobs for running queries on data. The Hadoop system is based on massively parallel processing (MPP), which makes it fast to perform queries on all types of data, whether it is structured or unstructured.

When a Hadoop cluster is created, each child node created from the server comes with a block of the attached disk storage called a local Hadoop Distributed File System (HDFS) disk store. You can run the query against stored data using common processing frameworks such as Hive, Ping, and Spark. However, data on the local disk persists only for the life of the associated instance.

If you use Hadoop's storage layer (that is, HDFS) to store your data, then you are coupling storage with compute. Increasing storage space means having to add more machines, which increases your compute capacity as well. For maximum flexibility and cost effectiveness, you need to separate compute and storage and scale them both independently. Overall object storage is more suited to data lakes to store all kinds of data in a cost-effective and performant manner. Cloud-based data lakes backed by object storage provide flexibility to decouple compute and storage. Let's learn more about data lakes.

Table of Contents for Unstructured data stores

Create new playlist

Sign In

Sign Up

Table of Contents for
Unstructured data stores