Appendix D
Introduction to NumPy and Pandas

In this appendix you will learn to use two popular Python libraries used by data scientists: NumPy and Pandas. These libraries are commonly used during the data exploration and feature engineering phase of a project. The examples in this appendix require the use of Jupyter Notebooks.

NumPy

NumPy is a math library for Python that allows for fast and efficient manipulation of arrays. The main object provided by NumPy is a homogeneous multidimensional array called ndarray. All the elements of an ndarray must be of the same data type, and dimensions are referred to as axes in NumPy terminology.

To use NumPy in a Python project, you typically add the following import statement to your Python file:

import numpy as np

The lowercase alias np is a standard convention for referring to NumPy in Python projects.

Creating NumPy Arrays

You can create an ndarray in a number of ways. To create an ndarray object out of a three-element Python list, use the np.array() statement. The following code snippet, when typed in a Jupyter Notebook, will result in a NumPy array created with three elements:

Note the use of the square brackets within the parentheses. Omitting the square brackets will result in an error. The ndarray x has one axis and three elements. Unlike arrays created using the Python library class called array, NumPy arrays can be multidimensional. The following statements create a NumPy array with two axes, and print the contents of the ndarray object:

The ndarray created by the preceding statement will have two axes, and can be represented for visualization purposes as a two-dimensional matrix with four rows and three columns:

11   28   9

56   38   91

33   87   36

87   8    4

The first axis contains four elements (the number of rows), and the second axis contains three elements (the number of columns). Table D.1 contains some of the commonly used attributes of ndarrays.

TABLE D.1: Commonly Used Ndarray Attributes

ATTRIBUTE DESCRIPTION
ndim Returns the number of axes.
shape Returns the number of elements along each axis.
size Returns the total number of elements in the ndarray.
dtype Returns the data type of the elements in the ndarray.

All elements in an ndarray must have the same data type. NumPy provides its own data types, the most commonly used of which are np.int16, np.int32, and np.float64. Unlike Python, NumPy provides multiple data types for a particular data class. This is similar to the C language concept of short, int, long, signed, and unsigned variants of a data class. For example, NumPy provides the following data types for signed integers:

  • byte: This is compatible with a C char.
  • short: This is compatible with a C short.
  • intc: This is compatible with a C int.
  • int_: This is compatible with a Python int.
  • longlong: This is compatible with a C long long.
  • intp: This data type can be used to represent a pointer. The number of bytes depends on the processor architecture and operating system your code is running on.
  • int8: An 8-bit signed integer.
  • int16: A 16-bit signed integer.
  • int32: A 32-bit signed integer.
  • int64: A 64-bit signed integer.

Some NumPy data types are compatible with Python; these usually end in an underscore character (such as int_, float_). You can find a complete list of NumPy data types at https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.scalars.html#arrays-scalars-built-in.

You can specify the data type when creating ndarrays as illustrated in the following snippet:

When you specify the elements in the array at the time of creation, but do not specify the data type, NumPy attempts to guess the most appropriate data type. The default data type is float_, which is compatible with a Python float.

When creating a NumPy array, if you do not know the values of the elements, but know the size and number of axes, you can use one of the following functions to create ndarrays with placeholder content:

  • zeros: Creates an ndarray of specified dimensions, with each element being zero.
  • ones: Creates an ndarray of specified dimensions, with each element being one.
  • empty: Creates an uninitialized ndarray of specified dimensions.
  • random.random: Creates an array of random values that lie in the half open interval [0.0, 1.0).

As an example, if you wanted to create an ndarray with four rows and three columns, where each element is an int16 with value 1, you would use a statement similar to the following:

The following statement would create an ndarray with two rows and three columns populated with random numbers. The data type of random numbers is float_:

NumPy provides methods to create sequences of numbers. The most commonly used are:

  • arange: Creates a single-axis ndarray with evenly spaced elements.
  • linspace: Creates a single-axis ndarray with evenly spaced elements.

The arange function is similar to the Python range function in functionality. The arange function takes four arguments: the first value is the start of the range, the second is the end of the range, the third is the step increment between numbers within the range, and the fourth is an optional parameter that allows you to specify the data type.

For example, the following snippet creates an ndarray, the elements of which lie in the range [0,9(.Each element in the ndarray is greater than the previous element by three:

The upper limit of the range is not included in the numbers that are generated by the arange function. Therefore, the ndarray generated by the preceding statement will have the following elements, and not include the number 9:

0, 3, 6

If you want a sequence of integers from 0 up to a specific number, you can call the arange function with a single argument. The following statements achieve identical results:

The linspace function is similar to arange in that it generates a sequence of numbers that lie within a range. The difference is that the third element is the number of values that are required between the start and end values. The following statement creates an ndarray with three elements, between 0 and 9:

Unlike the arange function, the linspace function ensures that the specified lower and upper bounds are part of the sequence.

Modifying Arrays

NumPy provides several functions that allow you to modify the contents of arrays. While it is not possible to cover each of them in this appendix, the most commonly used operations will be discussed.

ARITHMETIC OPERATIONS

NumPy allows you to perform element-wise arithmetic operations between two arrays. The result of the operation is stored in a new array. The +, -, /, and * operators retain their arithmetic meaning. The following code snippet demonstrates how to perform arithmetic operations on two ndarrays:

You can use the +=, -=, *=, and/= operators to perform in-place element-wise arithmetic operations. The results of these operations are not stored in a new array. The use of these operators is demonstrated in the following snippet:

The exponent operator is represented by two asterisk symbols (**). The following statements demonstrate the use of the exponent operator:

COMPARISON OPERATIONS

NumPy provides the standard comparison operators <, >, <= , >=, != , and==. The comparison operators can be used with ndarrays of the same size or a scalar. The result of using a comparison operator is an ndarray of Booleans. The use of comparison operators is demonstrated in the following snippet:

MATRIX OPERATIONS

NumPy provides the ability to perform matrix operations on ndarrays. The following list contains some of the most commonly used matrix operations:

  • inner: Performs a dot product between two arrays.
  • outer: Performs the outer product between two arrays.
  • cross: Performs the cross product between two arrays.
  • transpose: Swaps the rows and columns of an array.

The use of matrix operations is demonstrated in the following snippet:

Indexing and Slicing

NumPy provides the ability to index elements in an array as well as slice larger arrays into smaller ones. NumPy array indexes are zero-based. The following code snippet demonstrates how to index and slice one-dimensional arrays:

Indexing a multidimensional array requires you to provide a tuple with the value for each axis. The following code snippet demonstrates how to index and slice multidimensional arrays:

Pandas

Pandas is a free, open source data analysis library for Python and is one of the most commonly used tools for data munging. The key objects provided by Pandas are the series and dataframe. A Pandas series is similar to a one-dimensional list and a dataframe is similar to a two-dimensional table. One of the key differences between Pandas dataframes and NumPy arrays is that the columns in a dataframe object can have different data types, and can even handle missing values.

To use Pandas in a Python project, you typically add the following import statement to your Python file:

import pandas as pd

The lowercase alias pd is a standard convention for referring to Python in Python projects.

Creating Series and Dataframes

You have many ways to create a series and a dataframe object with Pandas. The simplest way is to create a Pandas series out of a Python list, as demonstrated in the following snippet:

A Pandas series contains an additional index column, which contains a unique integer value for each row of the series. In most cases, Pandas automatically creates this index column, and the index value can be used to select an item using square brackets [ ]:

If your data is loaded into a Python dictionary, you can convert the dictionary into a Pandas dataframe. A dataframe built from a Python dictionary does not, by default, have an integer index for each row of the dataframe. The following example shows how to convert a Python dictionary into a Pandas dataframe:

Even though the dataframe does not have a numeric index, you can still use numbers to select a value:

Since the dataframe was created from a Python dictionary, the keys of the dictionary can be used to select a value:

The reason you are able to use the keys from the dictionary object is that Pandas is clever enough to create an Index object for your dataframe. You can use the following statement to view the contents of the index of the dataframe:

In most real-world use cases, you do not create Pandas dataframes from Python lists and dictionaries; instead, you will want to load the entire contents of CSV file straight into a dataframe. The following snippet shows how to load the contents of a CSV file that is included with the resources that accompany this lesson into a Pandas dataframe:

The dataframe created in this case is a matrix of columns and rows. You can get a list of column names by using the columns attribute:

You can also create a dataframe by selecting a subset of named columns from an existing dataframe, as shown in the following example:

Getting Dataframe Information

Pandas provides several useful functions to inspect the contents of a dataframe, get information on the memory footprint of the dataframe, and get statistical information on the columns of a dataframe. Some of the commonly used functions to inspect the contents of a dataframe are listed here:

  • shape(): Use this function to find out the number of columns and rows in a dataframe.
  • head(n ): Use this function to inspect the first n rows of the dataset. If you do not specify a value for n, the default used is 5.
  • tail(n ): Use this function to inspect the last n rows of the dataset. If you do not specify a value for n, the default used is 5.
  • sample(n ): Use this function to inspect a random sample of n rows from the dataset. If you do not specify a value for n, the default used is 1.

The following code snippet demonstrates the use of the shape(), head(), tail(), and sample() functions:

Pandas provides several useful functions to get information on the statistical characteristics and memory footprint of the data. Some of the most commonly used functions to obtain statistical information are:

  • describe(): Provides information on the number of non-null values, mean, standard deviation, minimum value, maximum value, and quartiles of all numeric columns in the dataframe.
  • mean(): Provides the mean value of each column.
  • median(): Provides the median value of each column.
  • std(): Provides the standard deviation of the values of each column.
  • count(): Returns the number of non-null values in each column.
  • max(): Returns the largest value in each column.
  • min(): Returns the smallest value in each column.

The following snippet demonstrates the use of some of these statistical functions:

The info() function provides information on the data type of each column and the total memory required to store the dataframe. The following snippet demonstrates the use of the info() function:

The isnull() function can be used to highlight the null values in a dataframe. The output of the isnull() function is a dataframe of the same dimensions as the original, with each location containing a Boolean value that is true if the corresponding location in the original dataframe is null. The use of the isnull() function is demonstrated in the following snippet:

If you want to quickly find out the number of null values in each column of the dataframe, use the sum() function with the isnull() function, as demonstrated in the following snippet:

Selecting Data

Pandas provides powerful functions to select data from a dataframe. You can create a dataframe (or series) that contains a subset of the columns in an existing dataframe by specifying the names of the columns you want to select:

You can create a dataframe that contains a subset of the rows in an existing dataframe by specifying a range of row index numbers:

You can use the iloc() function to extract a sub matrix from an existing dataframe:

You can also use comparison operators to extract from a dataframe all rows that fulfill a specific criterion. For example, the following snippet will extract all rows from the df_iris_subset dataframe that have a value greater than 26 in the Age column:

In addition to the functions covered in this chapter, Pandas provides several others, including functions that can be used to sort data and techniques to use a standard Python function as a filter function over all the values of a dataframe. To find out more about the capabilities of Pandas, visit http://pandas.pydata.org.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.167.114