Selection of the software stack

The selection of the software stack for data mining varies based on individual circumstances. The most popular options specific to data mining are shown along with a couple of alternatives which, although not as well-known, are just as capable of managing large-scale datasets:

The Hadoop ecosystem: The big data terms arguably got their start in the popular domain with the advent of Hadoop. The Hadoop ecosystem consists of multiple projects run under the auspices of the Apache Software Foundation. Hadoop supports nearly all the various types of datasets—such as structured, unstructured, and semi-structured—well-known in the big data space. Its thriving ecosystem of auxiliary tools that add new functionalities as well as a rapidly evolving marketplace where companies are vying to demonstrate the next-big-thing-in-Big-Data means that Hadoop will be here for the foreseeable future. There are four primary components of Hadoop, apart from the projects present in the large ecosystem. They are as follows:
- Hadoop Common: The common utilities that support the other Hadoop modules
- Hadoop Distributed File System (HDFS™): A distributed filesystem that provides high-throughput access to application data
- Hadoop YARN: A framework for job scheduling and cluster resource management
- Hadoop MapReduce: A YARN-based system for parallel processing of large datasets
Apache Spark™: Apache Spark was a project for a multinode computing framework first conceived at University of California at Berkeley’s AMPLab as a platform that provided a seamless interface to run parallel computations and overcome limitations in the Hadoop MapReduce framework. In particular, Spark internally leverages a concept known as DAG—directed acyclic graphs—which indicates a functionality that optimizes a set of operations into a smaller, or more computationally efficient, set of operations. In addition, Spark exposes several APIs—application programming interfaces—to commonly used languages such as Python (PySpark) and Scala (natively available interface). This removes one of the barriers of entry into the Hadoop space where a knowledge of Java is essential.

Finally, Spark introduces a data structure called Resilient Distributed Datasets (RDD), which provides a mechanism to store data in-memory, thus improving data retrieval and subsequently processing times dramatically:

- Cluster manager: The nodes constituting a Spark cluster communicate using cluster managers, which manage the overall coordination among the nodes that are part of the cluster. As of writing this, the cluster manager can be the standalone Spark cluster manager, Apache Mesos, or YARN. There is also an additional facility of running Spark on AWS EC2 instances using spark-ec2 that automatically sets up an environment to run Spark programs.

- Distributed storage: Spark can access data from a range of underlying distributed storage systems such as HDFS, S3 (AWS Storage), Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. It should be noted that Spark can be used as a standalone product and does not require Hadoop for operations. Newcomers to Spark are often under the impression that Hadoop, or more concretely an HDFS filesystem, is needed for Spark operations. This is not true. Spark can support multiple types of cluster managers as well as backend storage systems, as shown in this section.

NoSQL and traditional databases: A third consideration in terms of selecting the software stack are NoSQL databases. The term NoSQL came into existence recently and is meant to distinguish databases that do not follow the traditional relational-database models. There are both open source and commercial variations of NoSQL databases and indeed even cloud-based options that have become increasingly common. There are various broad classifications of NoSQL databases and some of the more common paradigms are as follows:
- Key-value: These NoSQL databases store data on a principle of hashing—a unique key identifies a set of properties about the key. An example of a key in this parlance could be the national ID number of an individual (such as the Social Security Number or SSN in the US and Aadhaar in India). This could be associated with various aspects relating to the individual such as name, address, phone number, and other variables. The end user of the database would query by the ID number to directly access information about the individual. Open source Key-Value databases such as Redis and commercial ones such as Riak are very popular.
- In-memory: While databases that have used in-memory facilities, such as storing caches in the memory to provide faster access relative to storing on disk, have always existed, they were adopted more broadly with the advent of big data. Accessing data in-memory is orders of magnitude faster (~ 100 nanoseconds) than accessing the same information from disk (1-10 milliseconds or 100,000 times slower). Several NoSQL databases, such as Redis and KDB+, leverage temporary in-memory storage in order to provide faster access to frequently used data.
- Columnar: These databases append multiple columns of data as opposed to rows to create a table. The primary advantage of columnar storage over row-based storage is that a columnar layout provides the means to access data faster with reduced I/O overhead and is particularly well-suited for analytics use cases. By segregating data into individual columns, the database query can retrieve data by scanning the appropriate columns instead of scanning a table on a row-by-row basis and can leverage parallel processing facilities extremely well. Well-known columnar databases include Cassandra, Google BigTable, and others.
- Document-oriented: In many ways considered a step up from pure key-value stores, document-oriented databases store data that do not conform to any specific schema such as unstructured text like news articles. These databases provide ways to encapsulate the information in multiple key-value pairs that do not have to be necessarily consistent in structure across all other entries. As a consequence, document databases such as MongoDB are used widely in media-related organizations such as NY Times and Forbes in addition to other mainstream companies.
Cloud-based solutions: Finally, cloud-based solutions for large-scale data mining such as AWS Redshift, Azure SQL Data Warehouse, and Google Bigquery permit users to query datasets directly on the cloud-vendor’s platform without having to create their own architecture. Although the end user can choose to have their own in-house specialists such as Redshift System Administrators, the management of the infrastructure, maintenance, and day-to-day routine tasks are mostly carried out by the vendor, thus reducing the operational overhead on the client side.

Table of Contents for Selection of the software stack

Create new playlist

Sign In

Sign Up

Table of Contents for
Selection of the software stack