Selection of the hardware stack

The choice of hardware often depends on the type of solution that is chosen and where the hardware would be located. The proper choice depends on several key metrics such as the type of data (structured, unstructured, or semi-structured), the size of data (gigabytes versus terabytes versus petabytes), and, to an extent, the frequency with which the data will be updated. The optimal choice requires a formal assessment of these variables and will be discussed later on in the book. At a high-level, we can surmise three broad models of hardware architecture:

  • Multinode architecture: This would typically entail multiple nodes (or servers) that are interconnected and work on the principle of multinode or distributed computing. A classic example of a multinode architecture is Hadoop, where multiple servers maintain bi-directional communication to coordinate a job. Other technologies such as a NoSQL database like Cassandra and search and analytics platform like Elasticsearch also run on the principle of multinode computing architecture. Most of them leverage commodity servers, another name for relatively low-end machines by enterprise standards that work in tandem to provide large-scale data mining and analytics capabilities. Multinode architectures are suitable for hosting data that is in the range of terabytes and above.
  • Single-node architecture: Single-node refers to computation done on a single server. This is relatively uncommon with the advent of multinode computing tools, but still retains a huge advantage over distributed computing architectures. The Fallacy of Distributed Computing outlines a set of assertions, or assumptions, related to the implementation of distributed systems such as the reliability of the network, cost of latency, bandwidth, and other considerations.
    If the dataset is structured, contains primarily textual data, and is in the order of 1-5 TB, in today’s computing environment, it is entirely possible to host such datasets on single-node machines using specific technologies as has been demonstrated in later chapters.
  • Cloud-based architecture: Over the past few years, numerous cloud-based solutions have appeared in the industry. These solutions have greatly reduced the barrier to entry in big data analytics by providing a platform that makes it incredibly easy to provision hardware resources on demand based on the needs of the task at hand. This materially reduces the significant overhead in procuring, managing, and maintaining physical hardware and hosting them at in-house data center facilities.

Cloud platforms such as Amazon Web Services, Azure from Microsoft, and the Google Compute Environment permit enterprises to provision 10s to 1000s of nodes at costs starting as low as 1 cent per hour per instance.

In the wake of the growing dominance of cloud vendors over traditional brick-and-mortar hosting facilities, several complementary services to manage client cloud environments have come into existence.

Examples include cloud management companies, such as Altiscale that provides big data as a service solutions and IBM Cloud Brokerage that facilitates selection and management of multiple cloud-based solutions.

The exponential decrease in the cost of hardware: The cost of hardware has gone down exponentially over the past few years. As a case in point, per Statistic Brain’s research, the cost of hard drive storage in 2013 was approximately 4 cents per GB. Compare that with $7 per GB as recent as 2000 and over $100,000 per GB in the early 80’s. Given the high cost of licensing commercial software, which can often exceed the cost of the hardware, it makes sense to allocate enough budget toward procuring capable hardware solutions. Software needs appropriate hardware to provide optimal performance and providing level importance toward hardware selection is just as important as selecting the appropriate software.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.215