© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. NokeriData Science Solutions with Pythonhttps://doi.org/10.1007/978-1-4842-7762-1_2

2. Big Data, Machine Learning, and Deep Learning Frameworks

Tshepo Chris Nokeri1  
(1)
Pretoria, South Africa
 

This chapter carefully presents the big data framework used for parallel data processing called Apache Spark. It also covers several machine learning (ML) and deep learning (DL) frameworks useful for building scalable applications. After reading this chapter, you will understand how big data is collected, manipulated, and examined using resilient and fault-tolerant technologies. It discusses the Scikit-Learn, Spark MLlib, and XGBoost frameworks. It also covers a deep learning framework called Keras. It concludes by discussing effective ways of setting up and managing these frameworks.

Big data frameworks support parallel data processing. They enable you to contain big data across many clusters. The most popular big data framework is Apache Spark, which is built on the Hadoop framework.

Big Data

Big data means different things to different people. In this book, we define big data as large amounts of data that we cannot adequately handle and manipulate using classic methods. We must undoubtedly use scalable frameworks and modern technologies to process and draw insight from this data. We typically consider data “big” when it cannot fit within the current in-memory storage space. For instance, if you have a personal computer and the data at your disposal exceeds your computer’s storage capacity, it’s big data. This equally applies to large corporations with large clusters of storage space. We often speak about big data when we use a stack with Hadoop/Spark.

Big Data Features

The features of big data are described as the four Vs—velocity, volume, variety, and veracity. Table 2-1 highlights these features of big data.
Table 2-1

Big Data Features

Element

Description

Velocity

Modern technologies and improved connectivity enable you to generate data at an unprecedented speed. Characteristics of velocity include batch data, near or real-time data, and streams.

Volume

The scale at which data increases. The nature of data sources and infrastructure influence the volume of data. Characteristics of the volume include exabyte, zettabyte, etc.

Variety

Data can come from unique sources. Modern technological devices leave digital footprints here and there, which increase the number of sources from which businesses and people can get data. Characteristics of variety include the structure and complexity of the data.

Veracity

Data must come from reliable sources. Also, it must be of high quality, consistent, and complete.

Impact of Big Data on Business and People

Without a doubt, big data affects the way we think and do business. Data-driven organizations typically establish the basis for evidence-based management. Big data involves measuring the key aspects of the business using quantitative methods. It helps support decision-making. The next sections discuss ways in which big data affects businesses and people.

Better Customer Relationships

Insights from big data help manage customer relationships. Organizations with big data about their customers can study customers’ behavioral patterns and use descriptive analytics to drive customer-management strategies.

Refined Product Development

Data-driven organizations use big data analytics and predictive analytics to drive product development. and management strategies. This approach is useful for incremental and iterative delivery of applications.

Improved Decision-Making

When a business has big data, it can use it to uncover complex patterns of a phenomenon to influence strategy. This approach helps management make well-informed decisions based on evidence, rather than on subjective reasoning. Data-driven organizations foster a culture of evidence-based management.

We also use big data in fields like life sciences, physics, economics, and medicine. There are many ways in which big data affects the world. This chapter does not consider all factors. The next sections explain big data warehousing and ETL activities.

Big Data Warehousing

Over the past few decades, organizations have invested in on-premise databases, including Microsoft Access, Microsoft SQL Server, SAP Hana, Oracle Database, and many more. There has recently been widespread adoption of cloud databases like Microsoft Azure SQL and Oracle XE. There are also standard big data (distributed) databases like Cassandra and HBase, among others. Businesses are shifting toward scalable cloud-based databases to harness benefits associated with increasing computational power, fault-tolerant technologies, and scalable solutions.

Big Data ETL

Although there have been significant advances in database management, the way that people manipulate data from databases remains the same. Extracting, transforming, and loading (ETL) still play an integral part in analysis and reporting. Table 2-2 discusses ETL activities.
Table 2-2

ETL Activities

Activity

Description

Extract

Involves getting data from some database.

Transforming

Involves converting data from a database into a suitable format for analysis and reporting

Loading

Involves warehousing data in a database management system.

To perform ETL activities, you must use a query language. The most popular query language is SQL (Standard Query Language). There are other query languages that emerged with the open source movement, such as HiveQL and BigQuery. The Python programming language supports SQL. Python frameworks can connect to databases by implementing libraries, such as SQLAlchemy, pyodbc, SQLite, SparkSQL, and pandas, among others.

Big Data Frameworks

Big data frameworks enable developers to collect, manage, and manipulate distributed data. Most open source big data frameworks use in-memory cluster computing. The most popular frameworks include Hadoop, Spark, Flink, Storm, and Samza. This book uses PySpark to perform ETL activities, explore data, and build machine learning pipelines.

Apache Spark

Apache Spark executes in-memory cluster computing. It enables developers to build scalable applications using Java, Scala, Python, R, and SQL. It includes cluster components like the driver, cluster manager, and executor. You can use it as a standalone cluster manager or on top of Mesos, Hadoop, YARN, or Baronets. You can use it to access data in the Hadoop File System (HDFS), Cassandra, HBase, and Hive, among other data sources. The Spark data structure is considered a resilient distributed data set. This book introduces a framework that integrates both Python and Apache Spark (PySpark). The book uses it to operate Spark MLlib. To understand this framework, you first need to grasp the idea behind resilient distributed data sets.

Resilient Distributed Data Sets

Resilient Distributed Data Sets (RDDs) are immutable elements for parallelizing data or for transforming existing data. Chief RDD operations include transformation and actions. We store them in any storage supported by Hadoop. For instance, in a Hadoop Distributed File System (HDF), Cassandra, HBase, Amazon S3, etc.

Spark Configuration

Areas of Spark configuration include Spark properties, environment variables, and logging. The default configuration directory is SPARK_HOME/conf.

You can install the findspark library in your environment using pip install findspark and install the pyspark library using pip install pyspark.

Listing 2-1 prepares the PySpark framework using the findspark framework.
import findspark as initiate_pyspark
initiate_pyspark.init("filepathspark-3.0.0-bin-hadoop2.7")
Listing 2-1

Prepare the PySpark Framework

Listing 2-2 stipulates the PySpark app using the SparkConf() method.
from pyspark import SparkConf
pyspark_configuration = SparkConf().setAppName("pyspark_linear_method").setMaster("local")
Listing 2-2

Stipulate the PySpark App

Listing 2-3 prepares the PySpark session with the SparkSession() method.
from pyspark.sql import SparkSession
pyspark_session = SparkSession(pyspark_context)
Listing 2-3

Prepare the Spark Session

Spark Frameworks

Spark frameworks extend the core of the Spark API. There are four main Spark frameworks—SparkSQL, Spark Streaming, Spark MLlib, and GraphX.

SparkSQL

SparkSQL enables you to use relational query languages like SQL, HiveQL, and Scala. It includes a schemaRDD that has row objects and schema. You create it using an existing RDD, parquet file, or JSON data set. You execute the Spark Context to create a SQL context.

Spark Streaming

Spark streaming is a scalable streaming framework that supports Apache Kafka, Apache Flume, HDFS, and Apache Kensis, etc. It processes input data using DStream in small batches you push using HDFS, databases, and dashboards. Recent versions of Python do not support Spark Streaming. Consequently, we do not cover the framework in this book. You can use a Spark Streaming application to read input from any data source and store a copy of the data in HDFS. This allows you to build and launch a Spark Streaming application that processes incoming data and runs an algorithm on it.

Spark MLlib

MLlib is an ML framework that allows you to develop and test ML and DL models. In Python, the frameworks work hand-in-hand with the NumPy framework. Spark MLlib can be used with several Hadoop data sources and incorporated alongside Hadoop workflows. Common algorithms include regression, classification, clustering, collaborative filtering, and dimension reduction. Key workflow utilities include feature transformation, standardization and normalization, pipeline development, model evaluation, and hyperparameter optimization.

GraphX

GraphX is a scalable and fault-tolerant framework for iterative and fast graph parallel computing, social networks, and language modeling. It includes graph algorithms such as PageRank for estimating the importance of each vertex in a graph, Connected Components for labeling connected components of the graph with the ID of its lowest-numbered vertex, and Triangle Counting for finding the number of triangles that pass through each vertex.

ML Frameworks

To solve ML problems, you need to have a framework that supports building and scaling ML models. There is no shortage of ML models – there are innumerable frameworks for ML. There are several ML frameworks that you can use. Subsequent chapters cover frameworks like Scikit-Learn, Spark MLlib, H2O, and XGBoost.

Scikit-Learn

The Scikit-Learn framework includes ML algorithms like regression, classification, and clustering, among others. You can use it with other frameworks such as NumPy and SciPy. It can perform most of the tasks required for ML projects like data processing, transformation, data splitting, normalization, hyperparameter optimization, model development, and evaluation. Scikit-Learn comes with most distribution packages that support Python. Use pip install sklearn to install it in your Python environment .

H2O

H2O is an ML framework that uses a driverless technology. It enables you to accelerate the adoption of AI solutions. It is very easy to use, and it does not require any technical expertise. Not only that, but it supports numerical and categorical data, including text. Before you train the ML model, you must first load the data into the H2O cluster. It supports CSV, Excel, and Parquet files. Default data sources include local file systems, remote files, Amazon S3, HDFS, etc. It has ML algorithms like regression, classification, cluster analysis, and dimension reduction. It can also perform most tasks required for ML projects like data processing, transformation, data splitting, normalization, hyperparameter optimization, model development, checking pointing, evaluation, and productionizing. Use pip install h2o to install the package in your environment.

Listing 2-4 prepares the H2O framework.
import h2o
h2o.init()
Listing 2-4

Initializing the H2O Framework

XGBoost

XGBoost is an ML framework that supports programming languages, including Python. It executes gradient-boosted models that are scalable, and learns fast parallel and distributed computing without sacrificing memory efficiency. Not only that, but it is an ensemble learner. As mentioned in Chapter a, ensemble learners can solve both regression and classification problems. XGBoost uses boosting to learn from the errors committed in the preceding trees. It is useful when tree-based models are overfitted. Use pip install xgboost to install the model in your Python environment.

DL Frameworks

DL frameworks provide a structure that supports scaling artificial neural networks. You can use it stand-alone or with other models. It typically includes programs and code frameworks. Primary DL frameworks include TensorFlow, PyTorch, Deeplearning4j, Microsoft Cognitive Toolkit (CNTK), and Keras.

Keras

Keras is a high-level DL framework written using Python; it runs on top of an ML platform known as TensorFlow. It is effective for rapid prototyping of DL models. You can run Keras on Tensor Processing Units or on massive Graphic Processing Units. The main Keras APIs include models, layers, and callbacks. Chapter 7 covers this framework. Execute pip install Keras and pip install tensorflow to use the Keras framework.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.238.70