Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

K. M. Abdul Kadhar, G. AnandData Science with Raspberry Pihttps://doi.org/10.1007/978-1-4842-6825-4_7

7. Analyzing the Data

K. Mohaideen Abdul Kadhar¹ and G. Anand¹

(1)

Pollachi, Tamil Nadu, India

Exploratory Data Analysis

Exploratory data analysis (EDA) is the process of understanding the data by summarizing its characteristics. This step is important before modeling the data for machine learning. From this analysis, the user can extract the information, identify the root cause of any issues in the data, and figure out the steps to initiate any policies for development. In simple terms, this type of analysis explores the data to understand and identify the patterns and trends in it. There is no common method for doing EDA; it depends on the data we are working with. For simplicity in this chapter, we will use common methods and plots for doing EDA.

Choosing a Dataset

To do the EDA, we’ll use the Boston housing dataset that can be imported from the Scikit-Learn library. This dataset was already described in Chapter 6. This dataset contains 506 samples under 13 different feature attributes such as per capita crime rate by town (CRIM), average number of rooms per dwelling (RM), index of accessibility to radial highways (RAD), etc., and a target attribute MEDV indicates the median value of owner-occupied homes in the thousands.

1.
Import the required libraries.

The first step is to load the required libraries for doing the EDA. In this chapter, we will use the packages such as Pandas, NumPy, and Matplotlib for plotting:

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_boston

2.
Import a dataset.

The Boston housing dataset can be imported from the Scikit-Learn library and saved as the boston_data variable, as given in the following code:

dataset = load_boston()

The more important thing is that most of the open source data is stored in a comma-separated format. This comma-separated format has difficulties fetching and analyzing the data. Thus, the comma-separated data can be converted into a dataframe using the Pandas package in Python.

import pandas as pd

boston_data=pd.DataFrame(dataset.data,columns=dataset.feature_names)

If the dataset is very large, we can display the top and bottom five rows with headings using the following code:

# To display top 5 rows of data

print(boston_data.head(5))

CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT

0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98

1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14

2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03

3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94

4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33

# To display bottom 5 rows of data

print(boston_data.tail(5))

CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT

501 0.06263 0.0 11.93 0.0 0.573 ... 1.0 273.0 21.0 391.99 9.67

502 0.04527 0.0 11.93 0.0 0.573 ... 1.0 273.0 21.0 396.90 9.08

503 0.06076 0.0 11.93 0.0 0.573 ... 1.0 273.0 21.0 396.90 5.64

504 0.10959 0.0 11.93 0.0 0.573 ... 1.0 273.0 21.0 393.45 6.48

505 0.04741 0.0 11.93 0.0 0.573 ... 1.0 273.0 21.0 396.90 7.88

3.
Check the information about the data in a dataset.

Before doing data analysis, checking the information such as the data type and size of the data, describing the data, and knowing the amount of data available in a dataset are important steps because sometimes the numerical values in the dataset may be stored as string data types. It is difficult to plot and analyze numerical values stored as the string data type, so the string data type that is numerical should be converted into integers for better analysis. The size of the dataset can be viewed with the help of the following code:

boston_data.shape

Output:

(506, 13)

This output shows that the dataset has 506 rows and 13 columns. In other words, we can say that the dataset has 506 samples with 13 features.

Then, the information about the dataset can be viewed with the help of the following code:

boston_data.info()

Output:

RangeIndex: 506 entries, 0 to 505

Data columns (total 13 columns):

# Column Count Non-Null Dtype

--- ------ -------------- -----

0 CRIM 506 non-null float64

1 ZN 506 non-null float64

2 INDUS 506 non-null float64

3 CHAS 506 non-null float64

4 NOX 506 non-null float64

5 RM 506 non-null float64

6 AGE 506 non-null float64

7 DIS 506 non-null float64

8 RAD 506 non-null float64

9 TAX 506 non-null float64

10 PTRATIO 506 non-null float64

11 B 506 non-null float64

12 LSTAT 506 non-null float64

dtypes: float64(13)

memory usage: 51.5 KB

boston_data.dtypes

Output:

CRIM float64

ZN float64

INDUS float64

CHAS float64

NOX float64

RM float64

AGE float64

DIS float64

RAD float64

TAX float64

PTRATIO float64

B float64

LSTAT float64

dtype: object

Moreover, with the help of describe() function , we can see the distribution of data such as minimum values, maximum values, mean, etc. The description of the Boston data can be viewed using the following code:

boston_data.describe()

Output:

CRIM ZN INDUS ... PTRATIO B LSTAT

count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000

mean 3.613524 11.363636 11.136779 ... 18.455534 356.674032 12.653063

std 8.601545 23.322453 6.860353 ... 2.164946 91.294864 7.141062

min 0.006320 0.000000 0.460000 ... 12.600000 0.320000 1.730000

25 percent 0.082045 0.000000 5.190000 ... 17.400000 375.377500 6.950000

50 percent 0.256510 0.000000 9.690000 ... 19.050000 391.440000 11.360000

75 percent 3.677083 12.500000 18.100000 ... 20.200000 396.225000 16.955000

max 88.976200 100.000000 27.740000 ... 22.000000 396.900000 37.970000

Modifying the Columns in the Dataset

Modifications in the data such as removing unnecessary columns, adding dummy columns, dropping duplicate columns, encoding the column, and normalizing the data are required if the dataset needs to have preprocessing done. Dropping the unnecessary columns is more important when many columns are not used for analysis. Dropping those columns is the better solution to make the data lighter and reliable. Dropping the unnecessary columns in the Boston dataset can be done with the following code:

boston_data =boston_data.drop(['CRIM','ZN','LSTAT'])

print(boston_data.head(5))

Output:

INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B

0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90

1 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90

2 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83

3 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63

4 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90

In the previous code, the columns of CRIM, ZN, and LSTAT are dropped, and only 10 columns of data are presented.

Renaming the column name helps the user to improve the readability of the data. In the following code, the column name DIS is renamed to Distance:

boston_data= boston_data.rename(columns={"DIS":"Distance"})

boston_data.head(5)

INDUS CHAS NOX RM AGE Distance RAD TAX PTRATIO B

0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90

1 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90

2 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83

3 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63

4 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90

Identifying duplicates, dropping the duplicates, and detecting outliers were already discussed in the previous chapters.

Statistical Analysis

A better understanding of the data at hand can go a long way in simplifying the job of a data scientist, and this is where statistics can come in handy. Statistics can provide the tools necessary to identify structures in the data, and such insights can prove to be valuable in building a model to best fit our data. The role of statistics with respect to data can vary from simple analysis to creating self-learning models. In this section, we will introduce the various types of distributions, statistical measures of data, and ways to fit data to distributions.

Before discussing distributions, let’s first understand how data is associated with probability. When we consider a dataset, it normally represents a single sample from a population. For instance, if we have a dataset consisting of the height and weight of all the students in a school, the model developed from this data after some statistical analysis can be used to predict the height and weight of students from another school. The dataset in our hand is just one sample, whereas the population may consist of as many schools.

The numerical data that we encounter may be continuous or discrete in nature. The difference between the two is that the continuous data may take any value, whereas the discrete data can take only certain values. For example, data such as the number of cars manufactured per day, the number of feedback received from customers, etc., are discrete in nature, whereas data such as height, weight, humidity, temperature, etc., represents continuous data.

Probability distributions, a fundamental concept in statistics, provide a way to represent the possible values taken by a random variable and the respective probabilities. The probability mass function (PMF) denotes the discrete probability distribution, and the probability density function (PDF) denotes the continuous probability distribution. Some of the common distributions that a data scientist needs to be aware of are discussed in the following section.

Uniform Distribution

Uniform distribution, also called a rectangular distribution , has a constant probability. In other words, all the outcomes have the same probability of occurrence. The number of outcomes in the case of uniform distribution may be unlimited. The most common example for a uniform distribution is the roll of a fair die, where all six outcomes have an equal probability of 1/6. Let’s illustrate uniform distribution by plotting the probabilities of the outcomes for the fair die experiment. In other words, the probabilities of occurrence for each face of the die are equally likely. Figure 7-1 shows the distribution plot.

import numpy as np

import matplotlib.pyplot as plt

probabilities = np.full((6),1/6)

events = [1,2,3,4,5,6]

plt.bar(events,probabilities)

plt.xlabel('Die roll events')

plt.ylabel('Event Probability')

plt.title('Fair die - Uniform Distribution')

plt.show()

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig1_HTML.jpg — Figure 7-1
Uniform distribution of fair die experiment

If a histogram plot is made for a dataset by dividing the numerical data into a number of bins and all the bins are found to have equal distribution, then the dataset can be said to be uniformly distributed.

Binomial Distribution

As the name suggests, this distribution is used when there are only two possible outcomes. A random variable X that follows a binomial distribution is dependent on two parameters:

The number of trials n in the case of binomial distribution must be fixed, and the trials are considered to be independent of each other. In other words, the outcome of a particular trial does not depend on the outcomes of the previous trials.
There are only two possible outcomes for each event: success or failure. The probability of success, say p, remains the same from trial to trial.

Therefore, the binomial distribution function in Python normally takes two values as inputs: the number of trials n and the probability of success p. To understand binomial distribution, let’s look at the common experiment of tossing a coin:

from scipy.stats import binom

import matplotlib.pyplot as plt

import numpy as np

n=15 # no of times coin is tossed

r_values = list(range(n + 1))

x=[0.2,0.5,0.7,0.9] #probabilities of getting a head

k=1

for p in x:

dist = [binom.pmf(r, n, p) for r in r_values ]

plt.subplot(2,2,k)

plt.bar(r_values,dist)

plt.xlabel('number of heads')

plt.ylabel('probability')

plt.title('p= percent.1f' percentp)

k+=1

plt.show()

In the previous code, we have 15 trials for tossing the coin. The probability of getting a head remains the same for each trial, and the outcome of each trial is independent of the previous outcomes. The binomial distribution is computed using the binom.pmf function available in the stats module of the scipy package. The experiment is repeated for different probabilities of success using a for loop, and Figure 7-2 shows the resulting distribution plot.

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig2_HTML.jpg — Figure 7-2
Binomial distribution for tossing a coin 15 times

Figure 7-2 shows the binomial distribution for our coin toss experiment for different probabilities of success. The first subplot shows the binomial distribution when the probability of getting a head is 0.2. This implies that there is a 20 percent chance of getting a head. Twenty percent of 15 tosses is 3, which implies that there is a high probability of getting three heads in 15 tosses. Hence, the probability is at a maximum of 3. It can be seen that the binomial distribution has a bell-shaped response. The response is skewed to the left when the probability of success is low and shifts to the right with an increase in probability, as illustrated in the rest of the subplots.

Binomial distribution can be encountered in various domains of data science. For instance, when a pharmaceutical company wants to test a new vaccine, then there are only two possible outcomes: the vaccine works or it does not. Also, the result for an individual patient is an independent event and does not depend on other trials for different patients. Binomial distribution can be applied to various business issues as well. For example, consider people working in the sales department making calls all day to sell their company’s products. The outcome of the call is whether a successful sale is made or not, and the outcome is independent for each worker. Similarly, there are many other areas in a business with binary outcomes where binomial distribution can be applied, and hence it plays an important role in business decision-making.

Normal Distribution

Normal distribution, also known as Gaussian distribution, is normally a bell-shaped curve centered at the mean where the probability is the maximum, and the probability reduces the further we move from the mean. This implies that the values closer to the mean occur more frequently, and the values that are further away from the mean occur less frequently. This distribution is dependent on two parameters: the mean (μ) of the data and the standard deviation (σ). The probability density function (pdf) for a normal distribution can be given as follows:

$fleft(x;mu, sigma ight)=frac{1}{sqrt{2{pi sigma}^2}};{e}^{-frac{{left(x-mu ight)}^2}{2{sigma}^2}}$

To illustrate the pdf function, consider the following code. An array x with 100 values in the range of -10 to 10 is created, and the pdf function of x is computed using the norm.pdf function in the stats module of the scipy package. The pdf function is computed for four different values of mean 0, 2.5, 5, and 7.5 using a for loop. If the mean value is not given, the norm.pdf function takes a default value of zero.

from scipy.stats import norm

import matplotlib.pyplot as plt

import numpy as np

mean=[0.0,2.5,5,7.5] # mean values for the normal distribution

x=np.linspace(-10,10,100) # array of 100 numbers in the range -10 to 10

for m in mean:

y=norm.pdf(x,loc=m)

plt.plot(x,y,label='mean= %.1f' %m)

plt.xlabel('x')

plt.ylabel('pdf(x)')

plt.legend(frameon=True)

plt.show()

Figure 7-3 shows that the normal distribution produces a bell-shaped curve that is centered on the mean value. That is, the curve is at the maximum at the point of mean, and it starts decreasing on either side as we move away from the mean value. Note that we have not specified the value of standard deviation. In that case, the norm.pdf function takes the default value of 1.

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig3_HTML.jpg — Figure 7-3
Normal distribution plot for different mean values

Similarly, let’s keep the value of the mean as constant and plot the distribution for different values of standard distribution using the following code:

from scipy.stats import norm

import matplotlib.pyplot as plt

import numpy as np

stdev=[1.0,2.0,3.0,4.0] # standard deviation values for the normal distribution

x=np.linspace(-10,10,100)

for s in stdev:

y=norm.pdf(x,scale=s)

plt.plot(x,y,label='stdev= %.1f' %s)

plt.xlabel('x')

plt.ylabel('pdf(x)')

plt.legend(frameon=True)

plt.show()

From Figure 7-4, we can see that all four curves are centered at the default mean value of zero. As the value of standard deviation σ is increased, the density is distributed across a wide range. In other words, the distribution of data is more spread out from the mean as the standard deviation value is increased and there is a high likelihood that more observations are further away from the mean.

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig4_HTML.jpg — Figure 7-4
Normal distribution plot for different values of standard deviation

An important property of the normal distribution that makes it an important statistical distribution for data scientists is the empirical rule . According to this rule, if we divide the range of observations in the x-axis in terms of standard deviation, then approximately 68.3 percent of the values fall within one standard deviation from the mean, 95.5 percent of the values fall within two standard deviation, and 99.7 percent of the values fall within three standard deviations, respectively. This empirical rule can be used for identifying outliers in the data if the data can be fit to a normal distribution. This principle is used in the Z-score for outlier detection, which we discussed earlier in Chapter 5.

Statistical Analysis of Boston Housing Price Dataset

Let’s take the Boston housing price dataset and try to identify the best features that can be used to model the data based on the statistical properties of the features. As we have already discussed, the Boston dataset consists of 13 different features for 506 cases (506 × 13). In addition to these features, the median value of owner-occupied homes (in the thousands) denoted by the variable MEDV is identified as the target. That is, given the 13 different features, the median value of a house is to be estimated. The features from the dataset are first converted to a dataframe using the Pandas package. Then the target variable is added to the last column of this dataframe, making its dimension 506 × 14. This is illustrated in the following code:

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

from sklearn.datasets import load_boston

import matplotlib.pyplot as plt

dataset = load_boston()

boston_data=pd.DataFrame(dataset.data,columns=dataset.feature_names)

boston_data['MEDV'] = dataset['target']

print(boston_data.head())

CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT MEDV

0 0.00632 18.0 2.31 0.0 0.538 ... 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 ... 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 ... 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 ... 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 ... 222.0 18.7 396.90 5.33 36.2

[5 rows x 14 columns]

Once we have the data in hand, the best way to go about it is to plot the histogram of all the features so that we can get an understanding of the nature of their distribution. Rather than plotting the histogram of each feature individually, the hist function in the Pandas package can be used to plot them all in one go, as illustrated here:

fig, axis = plt.subplots(2,7,figsize=(16, 16))

boston_data.hist(ax=axis,grid=False)

plt.show()

From Figure 7-5, we can see that the distribution of the target variable MEDV is like a normal distribution.

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig5_HTML.jpg — Figure 7-5
Histogram plots of the Boston dataset features

Further, if we observe all the other parameters, the distribution for the parameter RM (which denotes the average number of rooms per dwelling) is also similar to the target MEDV. Therefore, the RM can definitely be used for modeling the dataset. Also, the parameters DIS (weighted mean of distances to five Boston employment centers) and LSTAT (percentage of lower status of the population) have similar distribution. The distribution of the parameter AGE (proportion of owner-occupied units built prior to 1940) is exactly the opposite of these two parameters. The rest of the parameters have less significant distribution compared to the target parameter. Since these three parameters seem to be related either positively or negatively, it is pointless to use all three for building the model. So, we have to see which of these three parameters are related to our target variable MEDV. The best way to do this is to measure the correlation between these parameters using the corr function in the Pandas package, as illustrated here:

cols=['RM','AGE','DIS','LSTAT','MEDV']

print(boston_data[cols].corr())

RM AGE DIS LSTAT MEDV

RM 1.000000 -0.240265 0.205246 -0.613808 0.695360

AGE -0.240265 1.000000 -0.747881 0.602339 -0.376955

DIS 0.205246 -0.747881 1.000000 -0.496996 0.249929

LSTAT -0.613808 0.602339 -0.496996 1.000000 -0.737663

MEDV 0.695360 -0.376955 0.249929 -0.737663 1.000000

From these results, it can be seen that the diagonal elements are all 1s, which implies maximum correlation, and they represent the self-correlation values. If we look at the row corresponding to our target parameter MEDV, we can see that RM is positively more correlated with MEDV as we judged earlier looking at the histogram distribution. It can be also seen that the parameter LSTAT is negatively more correlated with MEDV, which implies that there will be an inverse relationship between these two parameters. A scatter plot of RM and LSTAT against MEDV, respectively, would give us a better understanding of this relationship, as illustrated here:

plt.subplot(1,2,1)

plt.scatter(list(boston_data['RM']),list(boston_data['MEDV']))

plt.xlabel('RM')

plt.ylabel('MEDV')

plt.subplot(1,2,2)

plt.scatter(list(boston_data['LSTAT']),list(boston_data['MEDV']))

plt.xlabel('LSTAT')

plt.ylabel('MEDV')

plt.show()

Figure 7-6 confirms our conclusions derived using the distribution graphs and the correlation values. It can be seen that RM and MEDV are positively correlated; i.e., the median value of owner-occupied homes increases with an increase in the average number of rooms per dwelling. Similarly, it can be seen that LSTAT and MEDV are negatively correlated; i.e., the median value of the owner-occupied home drops with an increase in the percentage of a lower status of population. Therefore, these two parameters are good choices to model the Boston housing dataset. It can also be seen from the figure that there are some outliers in the RM versus MEDV plot, which could be treated using the techniques discussed in Chapter 5 before further processing.

../images/496535_1_En_7_Chapter/496535_1_En_7_Fig6_HTML.jpg — Figure 7-6
Scatter plot of RM and LSTAT versus MEDV

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Analyzing the Data

Create new playlist

Sign In

Sign Up

7. Analyzing the Data

Exploratory Data Analysis

Choosing a Dataset

Modifying the Columns in the Dataset

Statistical Analysis

Uniform Distribution

Binomial Distribution

Normal Distribution

Statistical Analysis of Boston Housing Price Dataset

Table of Contents for
7. Analyzing the Data