Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix D
Introduction to NumPy and Pandas

In this appendix you will learn to use two popular Python libraries used by data scientists: NumPy and Pandas. These libraries are commonly used during the data exploration and feature engineering phase of a project. The examples in this appendix require the use of Jupyter Notebooks.

NOTE To follow along with this appendix, ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-appendixd.git

NumPy

NumPy is a math library for Python that allows for fast and efficient manipulation of arrays. The main object provided by NumPy is a homogeneous multidimensional array called ndarray. All the elements of an ndarray must be of the same data type, and dimensions are referred to as axes in NumPy terminology.

To use NumPy in a Python project, you typically add the following import statement to your Python file:

import numpy as np

The lowercase alias np is a standard convention for referring to NumPy in Python projects.

Creating NumPy Arrays

You can create an ndarray in a number of ways. To create an ndarray object out of a three-element Python list, use the np.array() statement. The following code snippet, when typed in a Jupyter Notebook, will result in a NumPy array created with three elements:

Note the use of the square brackets within the parentheses. Omitting the square brackets will result in an error. The ndarray x has one axis and three elements. Unlike arrays created using the Python library class called array, NumPy arrays can be multidimensional. The following statements create a NumPy array with two axes, and print the contents of the ndarray object:

The ndarray created by the preceding statement will have two axes, and can be represented for visualization purposes as a two-dimensional matrix with four rows and three columns:

11 28 9

56 38 91

33 87 36

87 8 4

The first axis contains four elements (the number of rows), and the second axis contains three elements (the number of columns). Table D.1 contains some of the commonly used attributes of ndarrays.

TABLE D.1: Commonly Used Ndarray Attributes

ATTRIBUTE	DESCRIPTION
`ndim`	Returns the number of axes.
`shape`	Returns the number of elements along each axis.
`size`	Returns the total number of elements in the ndarray.
`dtype`	Returns the data type of the elements in the ndarray.

All elements in an ndarray must have the same data type. NumPy provides its own data types, the most commonly used of which are np.int16, np.int32, and np.float64. Unlike Python, NumPy provides multiple data types for a particular data class. This is similar to the C language concept of short, int, long, signed, and unsigned variants of a data class. For example, NumPy provides the following data types for signed integers:

byte: This is compatible with a C char.
short: This is compatible with a C short.
intc: This is compatible with a C int.
int_: This is compatible with a Python int.
longlong: This is compatible with a C long long.
intp: This data type can be used to represent a pointer. The number of bytes depends on the processor architecture and operating system your code is running on.
int8: An 8-bit signed integer.
int16: A 16-bit signed integer.
int32: A 32-bit signed integer.
int64: A 64-bit signed integer.

Some NumPy data types are compatible with Python; these usually end in an underscore character (such as int_, float_). You can find a complete list of NumPy data types at https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.scalars.html#arrays-scalars-built-in.

You can specify the data type when creating ndarrays as illustrated in the following snippet:

When you specify the elements in the array at the time of creation, but do not specify the data type, NumPy attempts to guess the most appropriate data type. The default data type is float_, which is compatible with a Python float.

When creating a NumPy array, if you do not know the values of the elements, but know the size and number of axes, you can use one of the following functions to create ndarrays with placeholder content:

zeros: Creates an ndarray of specified dimensions, with each element being zero.
ones: Creates an ndarray of specified dimensions, with each element being one.
empty: Creates an uninitialized ndarray of specified dimensions.
random.random: Creates an array of random values that lie in the half open interval [0.0, 1.0).

As an example, if you wanted to create an ndarray with four rows and three columns, where each element is an int16 with value 1, you would use a statement similar to the following:

The following statement would create an ndarray with two rows and three columns populated with random numbers. The data type of random numbers is float_:

NumPy provides methods to create sequences of numbers. The most commonly used are:

arange: Creates a single-axis ndarray with evenly spaced elements.
linspace: Creates a single-axis ndarray with evenly spaced elements.

The arange function is similar to the Python range function in functionality. The arange function takes four arguments: the first value is the start of the range, the second is the end of the range, the third is the step increment between numbers within the range, and the fourth is an optional parameter that allows you to specify the data type.

For example, the following snippet creates an ndarray, the elements of which lie in the range [0,9(.Each element in the ndarray is greater than the previous element by three:

The upper limit of the range is not included in the numbers that are generated by the arange function. Therefore, the ndarray generated by the preceding statement will have the following elements, and not include the number 9:

0, 3, 6

If you want a sequence of integers from 0 up to a specific number, you can call the arange function with a single argument. The following statements achieve identical results:

The linspace function is similar to arange in that it generates a sequence of numbers that lie within a range. The difference is that the third element is the number of values that are required between the start and end values. The following statement creates an ndarray with three elements, between 0 and 9:

Unlike the arange function, the linspace function ensures that the specified lower and upper bounds are part of the sequence.

Modifying Arrays

NumPy provides several functions that allow you to modify the contents of arrays. While it is not possible to cover each of them in this appendix, the most commonly used operations will be discussed.

ARITHMETIC OPERATIONS

NumPy allows you to perform element-wise arithmetic operations between two arrays. The result of the operation is stored in a new array. The +, -, /, and * operators retain their arithmetic meaning. The following code snippet demonstrates how to perform arithmetic operations on two ndarrays:

# Elementwise Arithmetic operations can be performed on ndarrays
array1 = np.array([[1,2,3], [2, 3, 4]])
array2 = np.array([[3,4,5], [4, 5, 6]])
 
Sum = array1 + array2
Difference = array1 - array2
Product = array1 * array2
Division = array1 / array2
 
print (Sum)
[[ 4 6 8]
 [ 6 8 10]]
 
print (Difference)
[[-2 -2 -2]
 [-2 -2 -2]]
 
print (Product)
[[ 3 8 15]
 [ 8 15 24]]
 
print (Division)
[[0.33333333 0.5    0.6    ]
 [0.5    0.6    0.66666667]]

You can use the +=, -=, *=, and/= operators to perform in-place element-wise arithmetic operations. The results of these operations are not stored in a new array. The use of these operators is demonstrated in the following snippet:

# in-place elementwise arithmetic operations
array1 = np.array([1,2,3], dtype=np.float64)
array2 = np.array([3,4,5], dtype=np.float64)
array3 = np.array([4,5,6], dtype=np.float64)
array4 = np.array([5,6,7], dtype=np.float64)
 
# in-place arithmetic and can be performed using arrays of the same size
array1 += np.array([10,10,10], dtype=np.float64)
array2 -= np.array([10,10,10], dtype=np.float64)
array3 *= np.array([10,10,10], dtype=np.float64)
array4 /= np.array([10,10,10], dtype=np.float64)
 
# in-place arithmetic can be performed using a scalar value
array1 += 100.0
array2 -= 100.0
array3 *= 100.0
array4 /= 100.0
 
print (array1)
[111. 112. 113.]
 
print (array2)
[-107. -106. -105.]
 
print (array3)
[4000. 5000. 6000.]
 
print (array4)
[0.005 0.006 0.007]

The exponent operator is represented by two asterisk symbols (**). The following statements demonstrate the use of the exponent operator:

COMPARISON OPERATIONS

NumPy provides the standard comparison operators <, >, <= , >=, != , and==. The comparison operators can be used with ndarrays of the same size or a scalar. The result of using a comparison operator is an ndarray of Booleans. The use of comparison operators is demonstrated in the following snippet:

array1 = np.array([1,4,5])
array2 = np.array([3,2,5])
 
# less than
print (array1 < array2)
[True False False]
 
# less than equal to
print (array1 <= array2)
[True False True]
 
# greater than
print (array1 > array2)
[False True False]
 
# greater than equal to
print (array1 >= array2)
[False True True]
 
# equal to
print (array1 == array2)
[False False True]
 
# not equal to
print (array1 != array2)
[ True True False]

MATRIX OPERATIONS

NumPy provides the ability to perform matrix operations on ndarrays. The following list contains some of the most commonly used matrix operations:

inner: Performs a dot product between two arrays.
outer: Performs the outer product between two arrays.
cross: Performs the cross product between two arrays.
transpose: Swaps the rows and columns of an array.

The use of matrix operations is demonstrated in the following snippet:

array1 = np.array([1,4,5], dtype=np.float_)
array2 = np.array([3,2,5], dtype=np.float_)
 
# inner (dot product)
print (np.inner(array1, array2))
36.0
 
# outer product
print (np.outer(array1, array2))
[[ 3. 2. 5.]
 [12. 8. 20.]
 [15. 10. 25.]]
 
# cross product
print (np.cross(array1, array2))
[ 10. 10. -10.]

Indexing and Slicing

NumPy provides the ability to index elements in an array as well as slice larger arrays into smaller ones. NumPy array indexes are zero-based. The following code snippet demonstrates how to index and slice one-dimensional arrays:

# create a one-dimensional array with 10 elements
array1 = np.linspace(0, 9, 10)
 
print (array1)
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
 
# get the third element. Indexes are zero-based.
print (array1[3])
3.0
 
# extracts elements 2 , 3, 4 into a sub array
print (array1[2:5])
[2. 3. 4.]
 
#extract first 6 elements of the array (elements 0 to 5)
print (array1[:6])
[0. 1. 2. 3. 4. 5.]
 
# extract elements 5 onwards
print(array1[5:])
[5. 6. 7. 8. 9.]
 
# extract every alternate element, step value is specified as 2
print (array1[::2])
[0. 2. 4. 6. 8.]
 
# reverse all the elements in array1
print (array1[::-1])
[9. 8. 7. 6. 5. 4. 3. 2. 1. 0.]

Indexing a multidimensional array requires you to provide a tuple with the value for each axis. The following code snippet demonstrates how to index and slice multidimensional arrays:

# create a two-dimensional array with 12 elements
array1 = np.array([[1,2,3,4], [5,6,7,8], [9, 10, 11, 12]])
 
print (array1)
[[ 1 2 3 4]
 [ 5 6 7 8]
 [ 9 10 11 12]]
 
# get the element in the second row, third column. Indexes are zero-based.
print (array1[1,2])
7
 
# get all the elements in the first column
print (array1[:,0])
[1 5 9]
 
#get all the elements in the first row
print (array1[0,:])
[1 2 3 4]
 
# get a sub 2-dimensional array
print (array1[:3, :2])
[[ 1 2]
 [ 5 6]
 [ 9 10]]

Pandas

Pandas is a free, open source data analysis library for Python and is one of the most commonly used tools for data munging. The key objects provided by Pandas are the series and dataframe. A Pandas series is similar to a one-dimensional list and a dataframe is similar to a two-dimensional table. One of the key differences between Pandas dataframes and NumPy arrays is that the columns in a dataframe object can have different data types, and can even handle missing values.

To use Pandas in a Python project, you typically add the following import statement to your Python file:

import pandas as pd

The lowercase alias pd is a standard convention for referring to Python in Python projects.

Creating Series and Dataframes

You have many ways to create a series and a dataframe object with Pandas. The simplest way is to create a Pandas series out of a Python list, as demonstrated in the following snippet:

# create a Pandas series from a Python list.
car_manufacturers = ['Volkswagen','Ford','Mercedes-Benz','BMW','Nissan']
pds_car_manufacturers = pd.Series(data=car_manufacturers)
print (pds_car_manufacturers)
 
0     Volkswagen
1           Ford
2  Mercedes-Benz
3            BMW
4         Nissan
dtype: object

A Pandas series contains an additional index column, which contains a unique integer value for each row of the series. In most cases, Pandas automatically creates this index column, and the index value can be used to select an item using square brackets [ ]:

If your data is loaded into a Python dictionary, you can convert the dictionary into a Pandas dataframe. A dataframe built from a Python dictionary does not, by default, have an integer index for each row of the dataframe. The following example shows how to convert a Python dictionary into a Pandas dataframe:

# create a Pandas series from a Python dictionary
#
# Pandas does not generate a series index.
cars = {'RJ09VWQ':'Blue Volkswagen Polo',
        'WQ81R09':'Red Ford Focus',
        'PB810AQ':'White Mercedes-Benz E-Class',
        'TU914A8':'Silver BMW 1 Series'}
 
pds_cars = pd.Series(data=cars)
 
print (pds_cars)
RJ09VWQ         Blue Volkswagen Polo
WQ81R09               Red Ford Focus
PB810AQ  White Mercedes-Benz E-Class
TU914A8          Silver BMW 1 Series
dtype: object

Even though the dataframe does not have a numeric index, you can still use numbers to select a value:

Since the dataframe was created from a Python dictionary, the keys of the dictionary can be used to select a value:

The reason you are able to use the keys from the dictionary object is that Pandas is clever enough to create an Index object for your dataframe. You can use the following statement to view the contents of the index of the dataframe:

In most real-world use cases, you do not create Pandas dataframes from Python lists and dictionaries; instead, you will want to load the entire contents of CSV file straight into a dataframe. The following snippet shows how to load the contents of a CSV file that is included with the resources that accompany this lesson into a Pandas dataframe:

The dataframe created in this case is a matrix of columns and rows. You can get a list of column names by using the columns attribute:

# print the names of the columns
print (df_iris.columns)
 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

You can also create a dataframe by selecting a subset of named columns from an existing dataframe, as shown in the following example:

Getting Dataframe Information

Pandas provides several useful functions to inspect the contents of a dataframe, get information on the memory footprint of the dataframe, and get statistical information on the columns of a dataframe. Some of the commonly used functions to inspect the contents of a dataframe are listed here:

shape(): Use this function to find out the number of columns and rows in a dataframe.
head(n ): Use this function to inspect the first n rows of the dataset. If you do not specify a value for n, the default used is 5.
tail(n ): Use this function to inspect the last n rows of the dataset. If you do not specify a value for n, the default used is 5.
sample(n ): Use this function to inspect a random sample of n rows from the dataset. If you do not specify a value for n, the default used is 1.

The following code snippet demonstrates the use of the shape(), head(), tail(), and sample() functions:

# how many rows and columns in the dataset
print (df_iris_subset.shape)
(891, 6)
 
# print first 5 rows
print (df_iris_subset.head())
 
     PassengerId Survived Pclass     Sex    Fare  Age
0              1        0      3    male  7.2500 22.0
1              2        1      1  female 71.2833 38.0
2              3        1      3  female  7.9250 26.0
3              4        1      1  female 53.1000 35.0
4              5        0      3    male  8.0500 35.0
 
# print last 3 rows
print (df_iris_subset.tail(3))
 
     PassengerId Survived Pclass    Sex  Fare  Age
888          889        0      3 female 23.45  NaN
889          890        1      1   male 30.00 26.0
890          891        0      3   male  7.75 32.0
 
# print a random sample of 10 rows
print (df_iris_subset.sample(10))
 
     PassengerId Survived Pclass    Sex    Fare  Age
710          711        1      1 female 49.5042 24.0
127          128        1      3   male  7.1417 24.0
222          223        0      3   male  8.0500 51.0
795          796        0      2   male 13.0000 39.0
673          674        1      2   male 13.0000 31.0
115          116        0      3   male  7.9250 21.0
451          452        0      3   male 19.9667  NaN
642          643        0      3 female 27.9000  2.0
853          854        1      1 female 39.4000 16.0
272          273        1      2 female 19.5000 41.0

Pandas provides several useful functions to get information on the statistical characteristics and memory footprint of the data. Some of the most commonly used functions to obtain statistical information are:

describe(): Provides information on the number of non-null values, mean, standard deviation, minimum value, maximum value, and quartiles of all numeric columns in the dataframe.
mean(): Provides the mean value of each column.
median(): Provides the median value of each column.
std(): Provides the standard deviation of the values of each column.
count(): Returns the number of non-null values in each column.
max(): Returns the largest value in each column.
min(): Returns the smallest value in each column.

The following snippet demonstrates the use of some of these statistical functions:

# get statistical information on numeric columns
print (df_iris_subset.describe())
 
       PassengerId   Survived     Pclass       Fare        Age
count   891.000000 891.000000 891.000000 891.000000 714.000000
mean    446.000000   0.383838   2.308642  32.204208  29.699118
std     257.353842   0.486592   0.836071  49.693429  14.526497
min       1.000000   0.000000   1.000000   0.000000   0.420000
25%     223.500000   0.000000   2.000000   7.910400  20.125000
50%     446.000000   0.000000   3.000000  14.454200  28.000000
75%     668.500000   1.000000   3.000000  31.000000  38.000000
max     891.000000   1.000000   3.000000 512.329200  80.000000
 
# mean of all columns
print (df_iris_subset.mean())
 
PassengerId  446.000000
Survived       0.383838
Pclass         2.308642
Fare          32.204208
Age           29.699118
dtype: float64
 
# the following statement is identical to the previous one,
# as axis = 0 implies columns.
print (df_iris_subset.mean(axis=0))
 
PassengerId  446.000000
Survived       0.383838
Pclass         2.308642
Fare          32.204208
Age           29.699118
dtype: float64
 
#correlation between columns
print (df_iris_subset.corr())
 
             PassengerId  Survived    Pclass      Fare       Age
PassengerId     1.000000 -0.005007 -0.035144  0.012658  0.036847
Survived       -0.005007  1.000000 -0.338481  0.257307 -0.077221
Pclass         -0.035144 -0.338481  1.000000 -0.549500 -0.369226
Fare            0.012658  0.257307 -0.549500  1.000000  0.096067
Age             0.036847 -0.077221 -0.369226  0.096067  1.000000
 
# number of non-null values in each column
print (df_iris_subset.count())
 
PassengerId  891
Survived     891
Pclass       891
Sex          891
Fare         891
Age          714
dtype: int64

The info() function provides information on the data type of each column and the total memory required to store the dataframe. The following snippet demonstrates the use of the info() function:

# get information on data types and memory footprint
print (df_iris_subset.info())
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
PassengerId  891 non-null int64
Survived     891 non-null int64
Pclass       891 non-null int64
Sex          891 non-null object
Fare         891 non-null float64
Age          714 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 41.8+ KB

The isnull() function can be used to highlight the null values in a dataframe. The output of the isnull() function is a dataframe of the same dimensions as the original, with each location containing a Boolean value that is true if the corresponding location in the original dataframe is null. The use of the isnull() function is demonstrated in the following snippet:

# highlight the null values in a random sample of data
print (df_iris_subset.sample(10).isnull())
 
     PassengerId Survived Pclass   Sex  Fare   Age
385        False    False  False False False False
530        False    False  False False False False
166        False    False  False False False  True
334        False    False  False False False  True
603        False    False  False False False False
807        False    False  False False False False
819        False    False  False False False False
99         False    False  False False False False
767        False    False  False False False False
480        False    False  False False False False

If you want to quickly find out the number of null values in each column of the dataframe, use the sum() function with the isnull() function, as demonstrated in the following snippet:

# find out if there are any missing values in the data
print (df_iris_subset.isnull().sum())
 
PassengerId   0
Survived      0
Pclass        0
Sex           0
Fare          0
Age         177
dtype: int64

Selecting Data

Pandas provides powerful functions to select data from a dataframe. You can create a dataframe (or series) that contains a subset of the columns in an existing dataframe by specifying the names of the columns you want to select:

# extract a single column as a series object
pds_class = df_iris_subset[['Pclass']]
print (pds_class.head())
 
   Pclass
0       3
1       1
2       3
3       1
4       3
 
# extract a specific subset of named columns into a new dataframe
df_test1 = df_iris_subset[['PassengerId', 'Age']]
print (df_test1.head())
 
  PassengerId  Age
0           1 22.0
1           2 38.0
2           3 26.0
3           4 35.0
4           5 35.0

You can create a dataframe that contains a subset of the rows in an existing dataframe by specifying a range of row index numbers:

# extract first 3 rows into a new data frame
df_test2 = df_iris_subset[0:3]
print (df_test2.head())
 
   PassengerId Survived Pclass    Sex    Fare  Age
0            1        0      3   male  7.2500 22.0
1            2        1      1 female 71.2833 38.0
2            3        1      3 female  7.9250 26.0

You can use the iloc() function to extract a sub matrix from an existing dataframe:

# extract first 3 rows and 3 columns into a new dataframe
df_test3 = df_iris_subset.iloc[0:3,0:3]
print (df_test3.head())
 
   PassengerId Survived Pclass
0            1        0      3
1            2        1      1
2            3        1      3

You can also use comparison operators to extract from a dataframe all rows that fulfill a specific criterion. For example, the following snippet will extract all rows from the df_iris_subset dataframe that have a value greater than 26 in the Age column:

# extracting all rows where Age > 26 into a new dataframe
df_test4 = df_iris_subset[df_iris_subset['Age'] > 26]
print (df_test4.count())
 
>>> PassengerId  395
Survived       395
Pclass         395
Sex            395
Fare           395
Age            395
dtype: int64

In addition to the functions covered in this chapter, Pandas provides several others, including functions that can be used to sort data and techniques to use a standard Python function as a filter function over all the values of a dataframe. To find out more about the capabilities of Pandas, visit http://pandas.pydata.org.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.