Working with Big Data inR | 207
   > sd(data$mpg)
   [1] 7.815984
8.2.2 Basic Plots for Data Exploration
Even though statistical techniques gives a good idea about the nature and quality of data in
a data set, however, more effective data exploration is possible through visualization. R pro-
gramming provides a bunch of very strong libraries for data exploration using charts and plots.
The front-runner amongst them is the ggplot2 library. It was created by Hadley Wickham, the
ggplot2 library offers a comprehensive graphics module for creating elaborate and complex
plots.
In order to start using the library functions of ggplot2, we need to load the library as follows.
   > library(ggplot2)
Let us now understand the different graphs used for data exploration and how to generate them
using R code.
Box Plot: A box plot is an extremely effective mechanism to get a one-shot view and under-
stand the nature of the data. Boxplot (also called box and whisker plot) gives a standard
visualization of the ve-number summary statistics of a data, namely Minimum, First quartile
(Q1), Median (Q2), Third Quartile (Q3) and Maximum. Below is a detailed interpretation of
a box plot.
The central rectangle or the box spans from first to third quartile (i.e., Q1 to Q3), thereby
giving the interquartile range or IQR.
Median is given by the line or band within the box.
The lower whisker extends up to 1.5 times of the interquartile range (IQR) from the bottom
of the box, i.e., the first quartile or Q1.
The upper whisker extends up to 1.5 times of the interquartile range (IQR) from the top of
the box, i.e., the third quartile or Q3.
The data values coming beyond the lower or upper whiskers are the ones which are of
unusually low or high values, respectively. These are the outliers, which may deserve special
consideration.
Syntax: boxplot (x, data, notch, var width, names, main)
Usage:
    > boxplot(iris)# Iris is a popular data set used in machine
learning which comes bundled in R installation
A separate window comes up in R console with the box plot generated as shown in Figure 8.3.
M08 Big Data Simplified XXXX 01.indd 207 5/10/2019 10:01:14 AM
208 | Big Data Simplied
FIGURE 8.3 Box plot for an entire data set
Evidently, we can see that Figure 8.3 gives the box plot for the entire iris data set, i.e., for all
the features in the iris data set, there is a component or box plot in the overall plot. However,
if we want to review individual features separately, then we can also do that using the following
Rcommand.
  >boxplot(iris$Sepal.Width, main=Boxplot, ylab = Sepal.Width)
The output of the command, i.e., the box plot of an individual feature called ‘sepal width’ of the
iris data set is shown in Figure 8.4.
M08 Big Data Simplified XXXX 01.indd 208 5/10/2019 10:01:14 AM
Working with Big Data inR | 209
Histogram: Histogram is another plot which helps in the effective visualization of numeric attri-
butes. It helps in understanding the distribution of a numeric data into series of intervals, which
is also termed as ‘bins’. Histograms might be of different shapes depending upon the nature of
data, for example, skewness.
Syntax: hist (v, main, xlab, xlim, ylim, breaks, col, border)
Usage:
    >hist(iris$Sepal.Length, main = Histogram, xlab = Sepal Length,
col = blue, border = green)
The output of the command, i.e., the histogram of an individual feature, such as ‘petal length’ of
the iris data set is shown in Figure 8.5.
FIGURE 8.4 Box plot for a specific feature
M08 Big Data Simplified XXXX 01.indd 209 5/10/2019 10:01:15 AM
210 | Big Data Simplied
Scatterplot: A scatter plot helps in visualizing the bivariate relationships, i.e., relationship between
two variables. It is a two-dimensional plot in which the points or dots are drawn on coordinates
provided by values of the attributes.
Syntax: plot (x, y, main, xlab, ylab, xlim, ylim, axes)
Usage:
    >plot(Sepal.Length~Petal.Length,data=iris,main=Scatter
Plot,xlab=Petal Length,ylab=Sepal Length)
    >abline(lm(iris $Sepal.Length~ iris$Petal.Length), col=red) # Fit
a regression line (red) to show the trend
FIGURE 8.5 Histogram for a specific feature
M08 Big Data Simplified XXXX 01.indd 210 5/10/2019 10:01:15 AM
Working with Big Data inR | 211
The output of the command, i.e., the scatter plot for the feature pair petal length and sepal length
of the iris data set is shown in Figure 8.6.
If you want to see the scatter plot of all feature pairs of the iris data set in one frame, then you
have to use the following code and its output is exhibited in Figure 8.7.
   > plot(iris)
FIGURE 8.6 Scatter plot of petal length vs. sepal length
M08 Big Data Simplified XXXX 01.indd 211 5/10/2019 10:01:16 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.236