© K. Mohaideen Abdul Kadhar and G. Anand 2021
K. M. Abdul Kadhar, G. AnandData Science with Raspberry Pihttps://doi.org/10.1007/978-1-4842-6825-4_5

5. Preparing the Data

K. Mohaideen Abdul Kadhar1   and G. Anand1
(1)
Pollachi, Tamil Nadu, India
 

The most important step in data science is to prepare the data. Data preparation is the process of cleaning, processing, and transforming the raw data for analysis. From this stage, the errors in the data can be effectively handled by cleaning, identifying the missing values, handling outliers, etc. Hence, this chapter discusses the methodologies used to prepare the data using the Pandas package in Python.

Pandas and Data Structures

Pandas is a software library written for the Python programming language that is used mainly for data manipulation and analysis.

In a nutshell, Pandas is like Excel for Python, with tables (which in Pandas are called dataframes) made of rows and columns (which in Pandas are called series). Pandas has many functionalities that make it an awesome library for data processing, inspection, and manipulation.

Installing and Using Pandas

Installing Pandas on your system requires NumPy to be installed, and if building the library from source, it requires the appropriate tools to compile the C and Cython sources on which Pandas is built.

You can find details about this installation in the Pandas documentation. Pandas can be installed using pip function as: pip install pandas. Once Pandas is installed, you can import it and check the version, as shown here:
import pandas
pandas.__version__
Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd, and this import convention will be used throughout the remainder of this book.
import pandas as pd

Pandas Data Structures

A data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data. Pandas introduces two new data structures to Python, Series and DataFrame, both of which are built on top of NumPy (which means they are fast).

Series

A series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the series. By default, each item will receive an index label from 0 to N, where N is the length of the series minus 1, as illustrated here:
s = pd.Series([1, 'Raspberry Pi', 3.14, -500, 'Data'])
print(s)
Output:
0          1
1   Raspberry Pi
2    3.14
3    -500
4    Data
dtype: object
Instead of providing the default index, we can specify an index to be used for each entry while creating the series, as illustrated here:
s = pd.Series([1, 'Raspberry Pi', 3.14, -500, 'Data'],
               index=['M', 'A', 'X', 'I', 'E'])
print(s)
Output:
M          1
A   Raspberry Pi
X    3.14
I    -500
E    Data
dtype: object
The Series constructor can convert a dictionary into a series as well, using the keys of the dictionary as its index, as illustrated here:
d = {'English': 95, 'Math': 100, 'Science': 98, 'Social Science': 93}
marks = pd.Series(d)
print(marks)
Output:
English           95
Math             100
Science           98
Social Science    93
dtype: float64
The index can be used to select specific items from the series. For instance, the marks for math can be selected by specifying the index Math. Similarly, a group of items can be printed by providing their corresponding indices separated by commas in a list, as illustrated here:
print (marks['Math'])
print(marks[['English', 'Science', 'Social Science']])
Output :
100.0
English           95
Science           98
Social Science    93
dtype: float64
Boolean indexing for filtering values can also be used. For example, using the index marks < 96 returns a series of Boolean values, which we then pass to our series marks, returning the corresponding True items, as illustrated here:
marks[marks < 96]
Output:
Math             100
Science           98
dtype: float64
The value of a particular item in the series can be changed on the go by accessing the corresponding index of the item, as illustrated here:
print('Old value:', marks['Math'])
marks['Math'] = 99
print('New value:', marks['Math'])
Output:
('Old value:', 100.0)
('New value:', 99.0)
We can also check whether an item exists in the series or not using the following code:
print('Math' in marks)
print('French' in marks)
Output:
True
False
Mathematical operations can also be done on a series of numerical values, as illustrated here:
marks * 10
Output:
English           950
Math              990
Science           980
Social Science    930
dtype: float64
np.square(marks)
Output:
English           9025
Math              9801
Science           9604
Social Science    8649
dtype: float64

DataFrame

The tabular DataFrame data structure is composed of rows and columns, similar to a spreadsheet or a database table. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

Reading Data

To create a DataFrame data structure out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.
a={'Name':['Augustus', 'Hazel', 'Esther', 'Cavas'],
      'Gender':['Male','Female','Female','Male'],
      'Age':[19, 18, 22, 21]}
b=pd.DataFrame.from_dict(a)
            print(b)
Output:
   Name      Gender   Age
0  Augustus    Male   19
1     Hazel  Female   18
2    Esther  Female   22
3     Cavas    Male   21

Reading CSV Data

Reading a CSV file is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the sep parameter. The following code shows the syntax to read a CSV file into a DataFrame 'df' and print the first five rows of df using the head() function:
df = pd.read_csv('data.csv')
print(df.head())
There’s also a set of writer functions for writing the DataFrame object to a variety of formats such as CSV files, HTML tables, JSON, etc. The following line of code shows the syntax to write a DataFrame object to a CSV file:
df.to_csv('path_to_file.csv')

Reading Excel Data

Pandas allows us to read and write Excel files, so we can easily read from Excel, in Python, and then write the data back out to Excel. Reading Excel files requires the xlrd library, which can be installed using the pip command, as shown here:
pip install xlrd.
The following code illustrates the syntax used to read a sheet from an Excel file into a DataFrame df. Replace data.xlsx with the path/filename of your Excel file to run the code.
df = pd.read_excel('data.xlsx', 'Sheet1')
print(df.head())
Similarly, the data from a DataFrame object can be written to an Excel file, as shown here:
dataframe.to_excel('path_to_file.xlsx', index=False)

Reading URL Data

The read_table function can be used to read directly from a URL. The following code illustrates a DataFrame created using raw data from a given URL:
url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'
from_url = pd.read_table(url, sep=' ')
from_url.head(3)
Output:
      rank         sandwich  ...   lat        lng
0     1                 BLT  ...   41.895734  -87.679960
1     2       Fried Bologna  ...   41.884672  -87.647754
2     3   Woodland Mushroom  ...   41.890602  -87.630925

Cleaning the Data

In most of the data analytics projects, the available data is not always perfect. The raw data always tends to be messy with corrupt or inaccurate data in addition to the useful data. It is therefore essential for the data scientists to treat these messy data samples so as to convert the raw data to a form which can work, and they spend a considerably long time doing so.

Data cleaning is the process of identifying inaccurate, incorrect, or incomplete parts of the data and treating them by replacing, deleting, or modifying the data. In other words, it is the process of preparing the data for analysis by treating all the irregularities in the raw data. In the following sections, we will discuss how to handle missing values and outliers, fill in the inappropriate values, and remove duplicate entries.

Handling Missing Values

Missing values are quite common in raw data. Assume that the input data consists of product feedback from thousands of customers collected using survey forms. It is common behavior for customers to skip a few entries while filling out the survey forms. For instance, a few customers may not share their experience with the product, some may not share the duration for which they have been using the product, and a few others may not fill their contact information. While compiling these survey forms and converting them into a table, there is sure to be plenty of missing values in the table.

Data from sensors may also have missing data due to various reasons like a temporary power outage at the sensor node, hardware failure, interference in communication, etc. Therefore, handling these missing values is the foremost task for data scientists while dealing with raw data. The following code illustrates the creation of a database of random numbers using the random.randn function in the NumPy library:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6,4),
index = ['1','3','4','6','7','9'],
 columns = ['a','b','c','d'])
It can be seen from the previous code that the indices for rows and columns have been allocated manually. From the indices allotted for rows, it can be seen that indexes 2, 5, and 8 are missing. Using the reindex function in the Pandas library, these indices are created with missing ‘Not a Number’ (NaN) values, as illustrated here:
df=df.reindex(['1','2','3','4','5','6','7','8','9'])
print(df)
Output:
         a         b         c         d
1  0.099344  0.293956  1.002970  0.516942
2       NaN       NaN       NaN       NaN
3  1.608906 -1.748396 -1.013634 -0.651055
4  3.211263 -2.555312 -1.036068 -0.728020
5       NaN       NaN       NaN       NaN
6 -0.101766 -0.205572  1.369707 -1.133026
7  0.062344  1.483505  0.026995  1.560656
8       NaN       NaN       NaN       NaN
9 -0.324347 -0.342040  0.107224  0.272153
Now that a database with missing values has been created, the next step is to treat these values. Before considering the options for treating these values, the foremost task is to detect the location of the missing values. The isnull() function in the Pandas library can be used to detect the rows containing missing values, as illustrated here:
df1=df[df.isna().any(axis=1)]
print(df1)
Output:
      a   b   c   d
2   NaN NaN NaN NaN
5   NaN NaN NaN NaN
8   NaN NaN NaN NaN

The previous process gives us a fair idea of the amount of missing data in our database. Once this missing data is detected, the next step is to treat the missing data. There are two ways we can do this: one is to fill the missing data with values, and the second one is to simply remove the missing data.

The fillna() function in the Pandas library can be used to fill the missing values with a user-specified scalar value, as illustrated here. As shown, the missing values in rows 2 and 5 are replaced by 0.000000.
df2=df.fillna(0)
print(df2.head())
Output:
          a         b         c         d
1  0.099344  0.293956  1.002970  0.516942
2  0.000000  0.000000  0.000000  0.000000
3  1.608906 -1.748396 -1.013634 -0.651055
4  3.211263 -2.555312 -1.036068 -0.728020
5  0.000000  0.000000  0.000000  0.000000
Another way to replace the missing values is to use the ffill or bfill function in the Pandas library. ffill stands for “forward fill,” which fills the missing values by repeating the values that occur before them, and bfill stands for “backward fill,” which fills the missing values by repeating the values that occur after them. The following code illustrates the forward fill approach of filling in the missing values:
df3= df.fillna(method='ffill')
print(df3.head())
Output:
          a         b         c         d
1  0.099344  0.293956  1.002970  0.516942
2  0.099344  0.293956  1.002970  0.516942
3  1.608906 -1.748396 -1.013634 -0.651055
4  3.211263 -2.555312 -1.036068 -0.728020
5  3.211263 -2.555312 -1.036068 -0.728020
The second possible way to deal with missing values is to simply drop them by using the dropna function in the Pandas library, as illustrated here:
df4=df.dropna()
print(df4)
Output:
          a         b         c         d
1  0.099344  0.293956  1.002970  0.516942
3  1.608906 -1.748396 -1.013634 -0.651055
4  3.211263 -2.555312 -1.036068 -0.728020
6 -0.101766 -0.205572  1.369707 -1.133026
7  0.062344  1.483505  0.026995  1.560656
9 -0.324347 -0.342040  0.107224  0.272153

We have created a simple dataset with missing values to understand the concept of treating the missing values. In reality, the datasets used in analytics projects are large and may easily contain 500 to 1,000 rows or even more. You are encouraged to apply the learning from this example on real datasets. The method for treating missing values may depend on the nature of application as well as on the number or frequency of missing values in the dataset.

Handling Outliers

In a dataset, outliers are the observations (i.e data) that stand out from all the other observations. In other words, outliers are data points that are distant from all the other data in the dataset. Outliers can originate either due to errors in measurement/data entry or due to genuine extreme values in the data. For instance, consider the series of numbers 112, 123, 120, 132, 106, 26, 118, 140, and 125. In this series, all the numbers are close to 100 except 26. Hence, 26 is an outlier as it is vastly distant from the rest of the numbers.

Outliers can be detected in two ways: using visualization techniques and using a mathematical approach. In this section, we introduce two mathematical approaches to identify outliers in our data, namely, interquartile range (IQR) and the Z-score.

Interquartile range is a measure of the variability or spread of data in a dataset. The data is first ordered and divided into four quarters. The values that divide the total range into four quarters are called quartiles. Therefore, there will be three quartiles for splitting data into four quarters. The quartiles are Q1, Q2, and Q3, where Q2 is the median for the entire data, Q1 is the median for the upper half of the data, and Q3 is the median for the lower half of the data. IQR is the difference between the third quartile and first quartile, i.e., Q3 – Q1.

To illustrate the process of removing outliers using IQR, let’s first create a DataFrame with 15 entries that includes outliers.
  import pandas as pd
  a={'Name':['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'],
     'Weight':[56,62,48,72,15,80,76,64,68,180,75,47,58,63,71]}
  df=pd.DataFrame.from_dict(a)
  print(df.head())
Output:
   Name  Weight
0    A      56
1    B      62
2    C      48
3    D      72
4    E      25
In the previous code, we created a database containing the weight in kilograms of 15 adults. For convenience, we have named the adults with the letters A to M. Weights of 15kg and 180kg have been included as outliers as it is unlikely for healthy adults to weigh so little or much. To detect these outliers, we need to compute the 25 percent and 75 percent quartile values, Q1 and Q3, respectively. From these values, the IQR value can be calculated by determining the difference of Q3 – Q1. This process is illustrated here:
Q1=df.Weight.quantile(0.25)
Q3=df.Weight.quantile(0.75)
IQR=Q3-Q1
print('Q1=',Q1,'Q3=',Q3,'IQR=',IQR)
Output:
Q1= 57.0 Q3= 73.5 IQR= 16.5
By comparing the entries in the DataFrame object with the quartiles calculated previously, it can be seen that there are four values below Q1, seven values between Q1 and Q3, and four values above Q3. But we know that there is only one outlier below Q1 and one outlier above Q3. To detect those outliers, we need to form an interval with a lower limit much below Q1 and an upper limit well above Q3. Once these limits are established, then it will be safe to consider that the values below the lower limit and the values above the upper limit will be outliers. This is illustrated in the following code:
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
df1=df[(df.Weight < lower_limit) | (df.Weight > upper_limit)]
print(df1)
Output:
   Name  Weight
4    E      25
9    J     180
It can be seen that the limits created using the IQR value have detected the outliers in our data accurately. Now these outliers can be easily filtered out using the following code:
df2=df.drop(df1.index)
   print(df2)
Output:
   Name  Weight
0     A      56
1     B      62
2     C      48
3     D      72
5     F      80
6     G      76
7     H      64
8     I      68
10    K      75
11    L      47
12    M      58
13    N      63
14    o      71

Z-Score

The Z-score, a lso called the standard score, gives an idea of how far away a data point is from the mean value. Technically, the Z-score fits the data in a normal distribution and measures the number of standard deviations by which the data points are about the mean value of the entire dataset, as illustrated in Figure 5-1.
../images/496535_1_En_5_Chapter/496535_1_En_5_Fig1_HTML.jpg
Figure 5-1

Normal distribution of data for outlier detection based on the Z-score

The Figure 5-1 shows that each data point is mapped along a normal distribution centered at the zero mean. The data points that are too far from the zero mean are treated as outliers. In the majority of cases, the threshold is fixed as 3, and any data point beyond 3σ or -3σ is treated as an outlier. Let’s take the same database that we used in the previous section and identify the outliers using the Z-score.
  import pandas as pd
  from scipy import stats
  import numpy as np
  a={'Name':['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'],
     'Weight':[56,62,48,72,15,80,76,64,68,180,75,47,58,63,71,]}
  df=pd.DataFrame.from_dict(a)
  z = np.abs(stats.zscore(df.Weight))
  print(z)
  df1=df[z>3]
  print(df1)
Output:
   Name  Weight
9    J     180
From the previous code, it can be seen that the Z-score corresponding to the weight value of 180 exceeds the threshold of 3, and hence it is displayed as an outlier. Unfortunately, the weight value of 15 is not detected as an outlier. The reason for this could be understood by comparing the value with respect to the mean and standard deviation, which can be achieved through the np.mean and np.std functions, as illustrated here:
print(np.mean(df.Weight))
print(np.std(df.Weight))
Output:
67.0
33.448467827390836
Let’s approximate the value of standard deviation as 33.45. It can be seen that the difference between the weight value 180 and the mean value is 111, which is greater than three times the standard deviation (>3σ), whereas the difference between the weight values 15 and the mean value is just 54, which is less than two times the standard deviation (<2σ). One way to overcome this problem is to reduce the value of threshold. Let’s assume a Threshold value of 1.
df1=df[z>1]
     print(df1)
Output:
     Name  Weight
4    E      15
9    J     180
From the previous illustration, it can be seen that the ideal threshold of 3 may not hold true for every dataset, and hence the threshold should be selected based on the distribution of the data. Now similar to the case of IQR, these outliers can be simply filtered out using the following code:
df2=df.drop(df.Name[z>1].index)
print(df2)
Output:
    Name  Weight
0     A      56
1     B      62
2     C      48
3     D      72
5     F      80
6     G      76
7     H      64
8     I      68
10    K      75
11    L      47
12    M      58
13    N      63
14    o      71

Filtering Out Inappropriate Values

In some cases, the dataset may contain some inappropriate values that are completely irrelevant to the data. This is especially true in the case of sensor data. The data recorded from the sensor is normally time-series data with a unique timestamp for each data point. These timestamps are not required for analysis in many cases and hence can be treated as inappropriate values. To illustrate this concept, we create a time-series temperature data similar to the sensor data as follows:
import pandas as pd
data={'Time':['12:00:05','12:08:33','12:25:12','12:37:53','12:59:08'],
      'Temperature':['T=22','T=22','T=23','T=23','T=24']}
df=pd.DataFrame.from_dict(data)
print(df)
Output:
       Time Temperature
0  12:00:05        T=22
1  12:08:33        T=22
2  12:25:12        T=23
3  12:37:53        T=23
4  12:59:08        T=24
Now, the timestamp corresponding to each data point and the header 'T=' in each data point should be removed. The timestamp can be removed using the drop function in the Pandas library, whereas the header can be removed by using the str.replace function. Because of the presence of a header in each data point, the data is initially stored as a string data type. So, the datatype has to be changed to int or float after removing these headers. These procedures are illustrated as follows:
df.drop('Time',inplace=True,axis=1)
df=df.Temperature.str.replace('T=','')
df=df.astype(float)
print(df)
Output:
0    22.0
1    22.0
2    23.0
3    23.0
4    24.0
Name: Temperature, dtype: float64

Removing Duplicates

Duplicate entries are common in data science, especially when we collect data from various sources and consolidate them for processing. Depending on the nature of our analysis, these duplicates may pose a problem. Therefore, it is better to remove these duplicates before analyzing the data, as illustrated here:
import pandas as pd
a={'Name':['Alan','Joe','Jim','Tom','Alan','Anna','Elle','Rachel','Mindy'],
   'Age':[22,24,25,24,22,23,21,22,23]}
df=pd.DataFrame.from_dict(a)
print('DATA ',df)
print('DUPLICATES ',df[df.duplicated()])
df1=df.drop_duplicates()
print('DATA AFTER REMOVING DUPLICATES ',df1)
Output:
DATA
     Name  Age
0    Alan   22
1     Joe   24
2     Jim   25
3     Tom   24
4    Alan   22
5    Anna   23
6   Ellen   21
7  Rachel   22
8   Mindy   23
DUPLICATES
   Name  Age
4  Alan   22
DATA AFTER REMOVING DUPLICATES
     Name  Age
0    Alan   22
1     Joe   24
2     Jim   25
3     Tom   24
5    Anna   23
6   Ellen   21
7  Rachel   22
8   Mindy   23

As shown in the code, a DataFrame is created from a dictionary consisting of the name and age of a few people, and we have deliberately created a duplicate entry for the name Alan. It can be seen that the duplicated function in the Pandas library clearly identifies the second entry for this name. This duplicate entry is then removed by using the drop_duplicates function in the Pandas library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.251.37