Back Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix: NumPy and Pandas

In this appendix, we discuss the NumPy and Pandas libraries. NumPy is one of the most important libraries, if not arguably the most important library, in the Python data science ecosystem. By wrapping data in custom-built NumPy objects optimized for manipulations commonly used in data science, NumPy provides the “atomics” of the “molecular data science ecosystem.” Pandas is another important data science library that provides quick tabular data functionality, such as storing, querying, manipulation, and more. Both of these libraries are important to understand to follow through the implementation and exercises in the book.

We will take a thorough look at both NumPy and Pandas in this chapter, with the objective of serving as an effective ramp-up from someone who is totally unfamiliar with these libraries to someone who has a solid grasp of the concepts and syntax behind important operations.

NumPy Arrays

NumPy arrays are perhaps the most important and ubiquitous non-native data storage object in the Python data science ecosystem. In this section, we will learn how to construct and manipulate NumPy arrays.

NumPy Array Construction

We can construct a NumPy array by passing a list into the np.array() constructor function. For instance, arr = np.array([0, 1, 2, 3, 4, 5]) creates a NumPy array under arr holding the values 0, 1, 2, 3, 4, 5.

There are many instances in which we want a NumPy array of elements organized by some sort of pattern or identity, but don’t desire to type out manually – like a 1000-element-long array of zeros or an array that counts from 0 to 10⁶. NumPy offers several helpful functions for common pattern-based arrays you may want to generate:

np.arange(start, stop, step) acts like range() in native Python, taking in two parameters indicating the start and end, as well as an optional step size (1 by default). For instance, np.arange(1, 10, 2) creates an array with values [1, 3, 5, 7, 9]. Recall that the end/stop value is not an inclusive bound (i.e., it is not included in the resulting list). Using a negative step value allows for counting backward, that is, start > stop.
np.linspace(start, end, num) returns an array of length num elements, equally spaced from the first number start to the end number end (inclusive). For instance, np.linspace(1, 10, 5) creates an array with values [1., 3.25, 5.5, 7.75, 10.].
np.zeros(shape) takes in a tuple and initializes an array of that shape with all zeros. For instance, np.zeros((2, 2, 2)) returns the NumPy array with contents [[[0, 0], [0, 0]], [[0, 0], [0, 0]].
np.ones(shape) takes in a tuple and initializes an array of that shape with all ones. For instance, np.ones ((2, 2, 2)) returns the NumPy array with contents [[[1, 1], [1, 1]], [[1, 1], [1, 1]].
np.random.uniform(low, high, shape) takes in a low bound and a high bound and fills an array with the given shape with uniform-randomly sampled values from that range. If no tuple is provided for the shape parameter, the function returns a single value instead of an array.
np.random.normal(mean, std, shape) takes in a mean and a standard deviation and fills an array with the given shape with values sampled from a normal distribution with that shape. If no tuple is provided for the shape parameter, the function returns a single value instead of an array.

NumPy arrays are of type numpy.ndarray – here, “n-d” indicates that the array can be of any integer n dimensions. The examples we have explored so far are one-dimensional, but we can also build arrays of higher dimensions. A two-dimensional array is an array in which each element itself holds another list/array of elements. A three-dimensional array is an array in which each element holds another list/array of elements and each element of that array holds a third level of list/array – and so on. The shape of an array indicates its dimensionality and the length/size of each dimension. For instance, the shape (128, 64, 32) indicates that the corresponding array is four-dimensional and has 128 lists of 64 lists each of 32 elements each.

NumPy arrays can be reshaped into any desired shape, as long as the total number of elements is the same in the resulting array. For instance, np.arange(100) returns an array with values [0, 1, 2, ..., 98, 99], but np.arange(100).reshape((10,10)) organizes the 100 elements in 10 arrays of 10 elements each, like [[0, 1, ..., 8, 9], [10, 11, ..., 18, 19], ..., [90, 91, ..., 98, 99]] (Figure A-1).

Note

.reshape() also accepts negative values: you can specify an “unknown” dimension for reshaping provided you specify all other known dimensions. For instance, one could reshape a 120-element array into shape (3, 4, 2, 5) with .reshape(-1, 4, 2, 5), .reshape(3, 4, -1, 5), and any other permutation. You may not use more than one missing dimension, since this creates ambiguity in the resulting shape. This is often useful both for convenience and in cases where you are dealing with variable-length “lists” of arrays, which each have their own unique structure, where –1 is used in lieu of the dimension expressing the number of arrays in the list.

Figure A-1
A 10-by-10 array of numbers from 0 to 99 (inclusive) generated by the arange and reshape functions

Simple NumPy Indexing

Indexing one-dimensional NumPy arrays is the same as with native Python. For instance, arr[1:4] can be used to select the second through the fourth elements of the array arr = [0, 1, 2, 3, 4]. NumPy also works with Python’s negative indexing syntax. The index arr[1:-1] yields the same result as arr[1:4].

Indexing n-dimensional NumPy arrays follows a similar structure, in which the syntax for indexing an individual dimension is defined for each dimension of the n-d array. The array np.arange(100).reshape((10, 10))[:5, :5] indexes the first five rows and the first five columns of the 10-by-10 array (Figure A-2). Note that the indexing specification for each dimension is separated by a comma.

Figure A-2
The top-left 5-by-5 quadrant of the array visualized in Figure A-1

If you wish to set indexing specifications for certain dimensions but not others, indicate the lack of an indexing range for a certain dimension by simply typing a colon ‘:’. For instance, Figure A-3 demonstrates the 10-by-10 array indexed via [:5, :] (left) and [5:, :] (right).

Figure A-3
Indexing top (left) and bottom (right) halves of the array visualized in Figure A-1

Figure A-4 demonstrates the result obtained by indexing via [:, :5] and [:, 5:].

Figure A-4
Indexing left (left) and right (right) halves of the array visualized in Figure A-1

Another important concept to understand is the difference between the indexing commands [i] and [i:i+1]. Functionally, the two index the same information: calling np.array([0, 1, 2, 3])[1] indexes the second element (which has value 1), and calling np.array([0, 1, 2, 3])[1:2] begins indexing at the second element and stops at the third element (noninclusve) – this also only indexes the second element. The difference in actual result, however, is that the former index syntax indexes a single element and the latter index syntax indexes a range of elements, even if such a range includes only one method. Thus, np.array([0, 1, 2, 3])[1] returns 1, whereas np.array([0, 1, 2, 3])[1:2] returns np.array([1]).

As another exercise, consider the array initialized by np.zeros((5, 5, 5, 5, 5)) – what is the shape of the index command [:, 0, 3:, 1:2, 2:4]? We can follow through how the indexing specification for each dimension affects the shape of the resulting array, which has shape (5, 1, 3, 1, 2). You can verify this yourself by calling .shape on the indexed array.

Quantitative Manipulation

NumPy offers many functions to manipulate both Python quantitative objects (native integer and float types, for instance) and NumPy arrays (Table A-1). Familiar Python mathematical operations like addition, subtraction, multiplication, division, modulus, exponentiation, binary, and comparative relationships can be applied element-wise (i.e., the operation is applied to the ith index of the first array and the ith index of the second array if two arrays are involved).

Table A-1

Sample operations performed on NumPy arrays

Array 1 Contents	Operation	Array 2 Contents	Result
0, 1, 2	+	2, 1, 0	2, 2, 2
5, 4, 3	-	3, 2, 1	2, 2, 2
4, 2, 1	*	0.5, 1, 2	2, 2, 2
4, 2, 1	/	2, 1, 0.5	2, 2, 2
6, 8, 11	%	4, 3, 3	2, 2, 2
1, 2, 3	**	2, 2, 2	1, 4, 9
2, 3, 4	^	2, 2, 2	0, 1, 6
2, 3, 4	&	2, 2, 2	2, 2, 0
2, 3, 4	\|	2, 2, 2	2, 3, 6
-1, 0, 1	<	0, 0, 0	True, False, False
-1, 0, 1	==	0, 0, 0	False, True, False

Note

A common mistake for beginners is to confuse the caret operator (^) with exponentiation. Instead, Python uses ** for exponentiation and ^ for XOR. Fortran denoted exponentiation with the double asterisk because most computers at that time used 6-bit encodings and thus did not support a caret character. According to Ken Thompson, co-creator of C, the association of the caret character and the XOR operation was arbitrary – a random choice of the remaining characters! ^ is universally associated with exponentiation in most other contexts likely from its origination as the superscript notation for TeX. The syntax was introduced in algebra systems and graphing calculators in the late 1980s and early 1990s.

This is different from Python syntax. For instance, [0, 1, 2] + [3, 4, 5] will not return [3, 5, 7] but rather [0, 1, 2, 3, 4, 5] if we use standard lists and not NumPy arrays.

The two elements involved in an operation must be the same length, unless one of the arrays is a repetition of the same value; in this case, that array can be replaced with a one-element array containing that value or just that value by itself. For instance, np.arange(100) * np.array([2, 2, ..., 2]) can be replaced with np.arange(100) * np.array([2]) or np.arange(100) * 2.

Otherwise, applying relationships between two NumPy arrays of different lengths that do not fall in the previously mentioned category will yield a ValueError: operands could not be broadcast together.

NumPy also offers multiple mathematical functions that can be applied to a single value or array (in which case the function is applied element-wise and returns an array of the same length) (Table A-2).

Table A-2

Example NumPy functions

Function	Usage	Function	Usage
Sine	np.sin(0) -> 0.0	Floor	np.floor(2.4) -> 2
Cosine	np.cos(0) -> 1.0	Ceiling	np.ceil(2.4) -> 3
Tangent	np.tan(0) -> 0.0	Round	np.round(2.4) -> 2
Arcsine	np.arcsin(0) -> 0.0	Exponential	np.exp(0) -> 1.0
Arccosine	np.arccos(1) -> 0.0	Natural Log	np.log(np.e) -> 1.0
Arctangent	np.arctan(0) -> 0.0	Base 10 Log	np.log10(100) -> 2.0
Maximum	np.max([1,2]) -> 2	Square Root	np.sqrt(9) -> 3.0
Minimum	np.min([1,2]) -> 1	Absolute Value	np.abs(-2.5) -> 2.5
Mean	np.mean([1,2]) -> 1.5	Median	np.mean([1,2,3]) -> 2

These functions are efficient and very helpful in obtaining mathematical derivations from arrays. For instance, we may implement the sigmoid function ( $sigma (x)=frac{1}{1+{e}^{-x}}$ ) as sigmoid = lambda x: 1/(1 + np.exp(-x)). This function can work with both single scalar values and NumPy arrays.

Advanced NumPy Indexing

This set of simple NumPy indexing should satisfy most important and common operations you’ll manipulate arrays with. However, if desired, it can be incredibly helpful to learn the syntax of more advanced NumPy indexing methods, which allow you to express more complex desired outcomes in a syntactically short amount of space.

NumPy colon-and-bracket indexing accepts a third parameter (in addition to the start and stop indexing) indicating the step size. For instance, np.arange(10)[2:6:2] indexes every other element from the array: [2, 4]. As expected, leaving the start and end indices unspecified while providing a step size indexes the entire array with the provided step size: np.arange(10)[::2] yields [0, 2, 4, 6, 8].

If working with arrays containing a large number of axes, NumPy offers the ellipsis (...) to represent lack of an indexing specification for certain dimensions. For instance, if we want to index only the first and last dimensions of an array z with shape (5, 5, 5, 5, 5, 5) and leave the rest untouched, we could write something like z[1:4, :, :, :, :, 1:4] without ellipsis notation or equivalently write (in much more compact form) z[1:4, ..., 1:4].

NumPy arrays also support reassignment. Individual elements can be changed via arr[index] = new_value. Multiple elements can be reassigned: consider an array defined by arr = np.arange(6); replacing the third through fifth elements with arr[2:5] = np.arange(3) yields arr as [0, 1, 0, 1, 2, 5]. Moreover, the indices do not need to be consecutive: arr[::2] = np.arange(3) yields arr as [0, 1, 1, 3, 2, 5].

Reassignment can be dangerous if you’re not paying attention. Consider the following series of array manipulations (Listing A-1): we initialize an array of numbers from 0 to 9 (inclusive), set a new array copy to that array, and then reassign the first element of the first array to 10.

arr = np.arange(10)

copy = arr

arr[0] = 10

Listing A-1

Danger of reassignment

As expected, the contents of arr are [10, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, the contents of copy are also [10, 1, 2, 3, 4, 5, 6, 7, 8, 9]! When we set copy to arr, we’re not actually copying the contents of arr: we’re creating another reference to the original array’s location in memory. Thus, when a reassignment is made to arr, it also appears in copy. In order to prevent this linking, we must physically copy an array; this can be done with copy = np.copy(arr) or with copy = arr[:]. The latter method indexes the entire array, but physically copies it in memory such that reassignments and manipulations are not linked.

Note that the indices used in colon-bracket syntax (start:stop:step) are only specifying a set of indices generated by a given set of rules, which means there is no reason we can’t specify our own custom indices. If we want the second, fourth, and sixth elements of an array, we can index it with subset = arr[[1, 3, 5]]. The double brackets may feel unnecessary at first, but think of the command as a shorthand for two lines of code: indices = [1, 3, 5] and subset = arr[indices]. For specialized indexing, you can programmatically generate your own index lists.

For certain specialized indexing operations, however, NumPy can help us with conditional indexing. For instance, if we want to retrieve all items in an array that are larger than 3 in value, we can call arr[arr > 3]. Recall that arr > 3 returns a Boolean array in which each element is either True if the corresponding index element in arr satisfies the condition of being larger than 3 and False otherwise. When we index an array with these element-wise Boolean specifications, NumPy includes an element of arr if the corresponding Boolean is True and does not if it is False.

NumPy Data Types

NumPy offers for values in NumPy arrays to be stored in multiple different forms, or data types. Here are some of the most important:

Boolean (np.bool_)
Unsigned integers (np.uint8, np.uint16, np.uint32, np.uint64)
Signed integers (np.int8, np.int16, np.int32, np.int64)
Floats (np.float16, np.float32, np.float64)

When you initialize an array, you can pass in the desired data type with the dtype parameter, for instance, np.array([-1, 0, 1, 2, 3], dtype=np.int8).

You can cast (“‘convert”) one data type into another using arr.astype(np.datatype). Consider the following arrays (Listing A-2).

arr1 = np.array([1,2,3])

arr2 = arr1.astype(np.uint8)

Listing A-2

Casting data types

Calling arr1.dtype yields dtype(‘int64’); calling arr2.dtype yields dtype(‘uint8’). When we first construct arr1, integers are represented by the np.int64 type; they are then cast as unsigned integers into arr2 with no effect on the contents. However, note that casting to a lower representation size can alter the values of the array; for instance, casting an array with the values [-1, -2, -3] to np.uint8 yields [255, 254, 253] (calculated by subtracting from 2⁸). As another example, casting an array with value [1.123456789] to np.float16 yields the value [1.123].

Generally, you’ll need to cast to a lower representation size than a higher one, usually in response to memory/storage problems. Casting is also often utilized to prepare images for image processing libraries, which may require image data to be stored in np.uint8 type to guarantee values consist only of integers from [0, 255].

Function Application and Vectorization

Often, we would like to apply a function element-wise to a NumPy array and that this function is not already supported. (You always want to use default functions if they are implemented for potential efficiency gains.) For instance, say we would like to graph the piecewise function

$f(x)=left{egin{array}{c}frac{x^2}{25},kern0.5em x<0\ {}sin xcdot {x}^2,kern0.5em xge 0end{array} ight.$

from −5 ≤ x ≤ 5. The array containing the x-axis values can be generated with inputs = np.linspace(-5, 5, 100). In this case, we are sampling 100 points from the function, which is high enough precision for our visualization purposes.

We can implement the function as follows (Listing A-3).

def f(x):

if x < 0: return x**2/25

else: return np.sin(x) * x**2

Listing A-3

A custom piecewise function

However, simply applying f(inputs) yields a ValueError:

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

/tmp/ipykernel_33/829457706.py in <module>

----> 1 f(np.linspace(-5, 5, 100))

/tmp/ipykernel_33/3998949136.py in f(x)

1 def f(x):

----> 2 if x < 0: return x**2/25

3 else: return np.sin(x) * x**2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Our multipart function involves some relatively complex logic (i.e., if statements and comparisons), and thus applying the function blankly fails. In this case, the value error stems from using if with multiple Boolean values; since some elements of the Boolean array formed by x < 0 are True and some are False, Python cannot decide whether to execute the code within the if or not – the truth value of the array is ambiguous. In this case, Python cannot quite tell that we want it to apply the function element-wise; we must communicate this explicitly.

One manual method is to use list comprehension and create a new array formed by applying the function individually to each element of inputs: outputs = np.array([f(element) for element in inputs]). The shorter but functionally (get it?) equivalent alternative is to use function vectorization. np.vectorize takes in a Python function and returns another function that applies the original function element-wise: outputs = np.vectorize(f)(inputs) or alternatively vectorized = np.vectorize(f) and outputs = vectorized(inputs) for a longer but perhaps more readable representation.

Function vectorization is also convenient when we want to apply element-wise operations to multiple inputs. For instance, we may want to return True when the sum of elements across three input arrays is larger than 10 and False otherwise (Listing A-4).

def f(x, y, z):

if x + y + z > 10: return True

return False

Listing A-4

A sample multi-input function

We can apply the function as follows (Listing A-5).

x = np.arange(0, 5)

y = np.arange(7, 2, -1)

z = np.arange(-1, 9, 2)

np.vectorize(f)(x, y, z)

# array([False, False, False, True, True])

Listing A-5

Using function vectorization on a function with multiple inputs

Note that even though some have observed minor speedups by using np.vectorize, the function is “provided primarily for convenience, not for performance” (from the NumPy documentation website).

NumPy Array Application: Image Manipulation

Let’s use our knowledge of NumPy arrays to have some fun with image manipulation. The skimage.io.imread function can take in a URL of an image and return it as a NumPy array. Our sample image will be a landscape view of the New York City skyline (Listing A-6, Figure A-5).

from skimage import io

import matplotlib.pyplot as plt

url = 'https://upload.wikimedia.org/wikipedia/commons/ thumb/2/2b/NYC_Downtown_Manhattan_Skyline_seen_from_Paulus_Hook_2019-12-20_IMG_7347_FRD_%28cropped%29.jpg/1920px-NYC_Downtown_Manhattan_Skyline_seen_from_Paulus_Hook_2019-12-20_IMG_7347_FRD_%28cropped%29.jpg'

image = io.imread(url)

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(image)

plt.show()

Listing A-6

Loading a sample image

Figure A-5
A sample image of the New York skyline

Calling image.shape yields the tuple (770, 1920, 3). This means the image is 770 pixels high and 1920 pixels wide. The image is in color, and so by standard it has three channels corresponding to red, green, and blue (RGB). We can separate the image into its “color composition” by indexing each of the channels independently and displaying the two-dimensional slice in the corresponding color (Listing A-7, Figure A-6).

for i, color in enumerate(['Reds', 'Blues', 'Greens']):

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(image[:,:,i], cmap=color)

plt.show()

Listing A-7

Separating and visualizing the individual red, green, and blue color maps of the single image from Figure A-5

Figure A-6
Viewing the individual red, green, and blue color maps of the sample image from Figure A-5

Say we want to “collapse” the three-dimensional image into a two-dimensional one by converting it from color to grayscale. One natural approach is to take the mean of channels for each pixel, which can be implemented with np.mean(image, axis=2) (Listing A-8, Figure A-7). Here, the axis parameter indicates that we are taking the mean along the third axis, indicated by 2 just as the third element of a tuple is indexed with 2.

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(np.mean(image, axis=2), cmap='gray')

plt.show()

Listing A-8

Visualizing a mean-based grayscale representation

Figure A-7
Obtaining a grayscale representation of the image by taking the mean of the image across the color depth axis (i.e., replacing each pixel with the average of the RGB values)

The result is a pretty good grayscale representation! We can also produce a similar effect by taking the maximum value of each channel per pixel with np.max(image, axis=2), which produces a vintage-looking “overexposed” grayscale representation (Figure A-8).

Figure A-8
Obtaining a grayscale representation of the image by taking the max of the image across the color depth axis

And taking the minimum value for each pixel instead with np.min(image, axis=2) yields – as one may expect – a generally darker grayscale representation (Figure A-9).

Figure A-9
Obtaining a grayscale representation of the image by taking the min of the image across the color depth axis

We can augment the original image by adding noise (Listing A-9, Figures A-10 and A-11). An array of normally distributed noise with the same shape as the original image can be generated with noise = np.random.normal(0, 40, (770, 1920, 3)). In this case, we center the mean at 0 and use standard deviation 40. Recall that images are generally stored to have numerical pixel values between 0 and 255. A larger standard deviation will yield larger visual noise, whereas a smaller one will yield less visible noise. Additionally, note that we need to cast the noisy image as an unsigned 8-bit integer (between 0 and 2⁸ − 1 = 255) because the noise vector is drawn from a continuous distribution, yielding pixel values that are not within the valid set of all integers from 0 to 255 inclusive that are accepted for image display. Trying to display the image without casting the array as uint8 type will yield a bizarre, mostly blank canvas.

noise_vector = np.random.normal(0, 40, (770, 1920, 3))

altered_image = image + noise_vector

display_image = altered_image.astype(np.uint8)

Listing A-9

Generating a noisy image by adding noise randomly drawn from a normal distribution to the image from Figure A-5

Figure A-10
A visual representation of the random noise array

Figure A-11
Visualizing the combination of the noise array to the original array

We can also adjust the mean value of the normal distribution from which values for the noise matrix are sampled to generally influence the “feel” of the image overall (Figure A-12, Figure A-13).

Figure A-12
Visualizing the combination of the noise array to the original array, with the noise array generated from a normal distribution with mean 100

Figure A-13
Visualizing the combination of the noise array to the original array, with the noise array generated from a normal distribution with mean 200

The features of the image can also be enhanced or dimmed by multiplying all the values in the image by some constant – 0 ≤ k < 1 to dim the image and k > 1 to enhance it. This is known as contrast. Note that we need to similarly cast the altered image as an unsigned 8-bit integer because multiplying each value by a non-integer value does not guarantee an integer outcome required for picture display (Listing A-10). Observe that minute differences form harsh, colorful boundaries in high-contrast images due to the exacerbated/exaggerated quantitative value difference (Figure A-14).

for factor in [0.2, 0.6, 1.5, 3, 8]:

altered_image = image * factor

display_image = altered_image.astype(np.uint8)

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(display_image)

plt.show()

Listing A-10

Generating and visualizing different levels of contrast by multiplying values in the sample image by varying factors

Figure A-14
Obtaining different levels of brightness and saturation by multiplying the array by a factor

Another familiar parameter in image editing tools is brightness, which can be adjusted by adding or subtracting the same value to or from all pixels in an image, thus uniformly increasing or decreasing the array values.

It’s common knowledge that the New York City skyline just isn’t complete without King Kong and Godzilla battling it out. Let’s load in a sample image of the scene (Listing A-11, Figure A-15).

url = 'https://upload.wikimedia.org/wikipedia/commons/ thumb/f/f4/KK_v_G_trailer_%281962%29.png/440px-KK_v_G_trailer_%281962%29.png'

beasts = io.imread(url)

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(beasts)

plt.show()

Listing A-11

Loading and displaying a sample image of King Kong battling Godzilla

Figure A-15
A sample image of King Kong battling Godzilla

We’ll use a simple bitwise AND operation between two sample images to merge them. In order to do this, we first need to make sure that the images are the same size. One way to ensure equivalent array shapes is to resize the higher-resolution image (the New York City skyline in this case) to the shape of the lower-resolution image. This can be accomplished with the Python cv2 computer vision library, which offers the helpful function cv2.resize: resized = cv2.resize(original, desired_shape).

Note

Do not confuse resizing with reshaping! Resizing is an image operation that reduces the resolution/dimensions of an image, whereas reshaping is a generalized array operation that shifts how elements in the array are arranged/distributed across dimensions while keeping the number of elements constant.

The resulting merge (Listing A-12, Figure A-16) isn’t bad!

merged = cv2.resize(image, beasts.shape) & beasts

plt.figure(figsize=(10, 5), dpi=400)

plt.imshow(merged)

plt.show()

Listing A-12

Generating and visualizing a “merger” by using the bitwise OR operation

Figure A-16
A decent merging of two images using a simple bitwise OR operation

(See more fun image manipulation with convolutions in Chapter 4!)

We were able to do a lot simply by manipulating NumPy arrays with minimal help from additional libraries! Having a strong grasp of NumPy will prove not only helpful for manipulating data stored in array form, but almost any data type in the Python data science ecosystem due to NumPy’s ubiquitous nature.

Pandas DataFrames

While NumPy arrays allow for the efficient storage of raw data from images to tables to text in array form, its generality can limit how efficiently we work with specific types of data. Perhaps the best-developed library to work with table-based data is Pandas, which is built upon the DataFrame, a two-dimensional container for tabular data. (In fact, Pandas is built upon NumPy!) With Pandas, you can read and write from and to files, select data, filter data, and transform data. Pandas is an essential tool in the context of tabular data: there simply is no other library like Pandas in Python at the time of this book’s writing that is as well-maintained and appropriate for effective tabular data manipulation.

Constructing Pandas DataFrames

To construct a Pandas dictionary from scratch, pass in a dictionary to the pd.DataFrame() constructor in which each key is a string representing the column name and the value is a list or array representing its values (Listing A-13, Figure A-17).

df = pd.DataFrame({'a':[1, 2, 3],

'b':[4, 5, 6],

'c':[7, 8, 9]})

Listing A-13

Generating a dummy DataFrame

Figure A-17
A simple dummy Pandas DataFrame

If the lists provided for each column are not the same length, you will encounter a ValueError: All arrays must be of the same length error.

This method of constructing DataFrames is especially helpful when attempting to create small DataFrames, for instance, as a dummy table to test out manipulations or record and collect data for visualizations.

You can accomplish the same outcome by first initializing a blank DataFrame by passing in no information into the constructor and then creating the columns one by one (Listing A-14).

df = pd.DataFrame()

df['a'] = [1, 2, 3]

df['b'] = [4, 5, 6]

df['c'] = [7, 8, 9]

Listing A-14

Initializing a dummy DataFrame via column creation/assignment

It should be noted here that this operation is very similar to NumPy array reassignment mechanics (Listing A-15). The bracket notation allows for the indexing of an element or collections of elements along a given object’s axes. Note, however, that certain element or dimension of an array needs to already exist to be reassigned in NumPy, whereas in Pandas the DataFrame can be empty before assignment.

arr = np.zeros((3, 3))

arr[0] = [1, 2, 3]

arr[1] = [4, 5, 6]

arr[2] = [7, 8, 9]

Listing A-15

Analog in NumPy to the column assignment operation in Listing A-14

DataFrame columns can be indexed using brackets and the column’s name (Listing A-16). This returns a Series object, which can be thought of as a dictionary. In a dictionary, each key is associated with a value; in a Series, each index is associated with a value.

df['a']

Returns:

0 1

1 2

2 3

Name: a, dtype: int64

Listing A-16

Result of indexing a single column in a Pandas DataFrame

Thus, we can obtain the first item of the series indexed in Listing A-16 by df['a'][0], which returns 1.

DataFrames are more like a dictionary than a list because of the explicit and possible modification of the index, even if it is ordered like a list by default.

You’ll more commonly be reading data from a file. For instance, if you want to read the data in a comma-separated value file, use data = pd.read_csv(path). Depending on the organization of your .csv file, you may need to specify certain delimiters. Pandas has corresponding reading functions for Excel spreadsheets (pd.read_excel), JSON (pd.read_json), HTML tables (pd.read_html), SQL data (pd.read_sql), and many other file types. Correspondingly, you can export Pandas DataFrames in a desired supported form – for example, data.to_csv(path) or data.to_excel(path). See the IO tools page on the Pandas documentation for a full and up-to-date list of Pandas file reading and processing functionality: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html.

Simple Pandas Mechanics

We can write a function to construct a multiplication table, represented in Pandas with a DataFrame (Listing A-17, Figure A-18). The multiplication table is a square DataFrame with both indices and columns containing the integers from [1, n] (inclusive); each element within the table is the product of the corresponding index and column coordinates. The function initializes a blank DataFrame with the desired index and column values and then iteratively fills in the desired elements using standard array logic.

def makeTable(n = 10):

table = pd.DataFrame(index=range(1, n+1),

columns=range(1, n+1))

for num1 in table.columns:

for num2 in table.index:

table[num1][num2] = num1 * num2

return table

table = makeTable(n=100)

Listing A-17

A function that generates a multiplication table using Pandas value reassignment of an arbitrary n × n size

Figure A-18
Sample generated 100-by-100 multiplication table

Recall that you can index a column in a DataFrame with a bracket. When we call table[5], we obtain the following displayed Series. In our context, this returns all the multiples of 5 from 1 ⋅ 5 to 100 ⋅ 5:

1 5

2 10

3 15

4 20

5 25

...

96 480

97 485

98 490

99 495

100 500

Name: 5, Length: 100, dtype: object

Say we want to view multiples of 5, 10, and 15 all at once. Rather than passing in a single reference to a column, we pass in a list of column references: table[[5, 10, 15]] returns the DataFrame shown in Figure A-19.

Figure A-19
Indexing a subset of the columns in the 100-by-100 multiplication table visualized in Figure A-18

We can also index rows with .loc. You can pass in a single row to index or a list of rows to index. table.loc[[5, 10, 15]] returns the DataFrame shown in Figure A-20.

Figure A-20
Indexing a subset of the rows in the 100-by-100 multiplication table visualized in Figure A-18

Note

In our discussion of indexing DataFrames, note that indexing refers to the process of selecting a subset of the data, whereas the indices refer to the row-level references in the DataFrame. Sometimes, “indices” refers to the parameters used in indexing a DataFrame, like the 3:5 in the sample Python list indexing list1[3:5].

Naturally, if you want to specify indices for both columns and rows, you can chain individual index commands together: table[[5,10,15]].loc[[5, 10, 15]]. However, the preferred way is to take advantage of .loc, which supports simultaneous row and column indexing in .loc[row, col] format and therefore is more efficient than chaining separate calls. The equivalent command to index both indices and rows with [10, 15] would be table.loc[[5, 10, 15], [5, 10, 15]] (Figure A-21).

Figure A-21
Specifying indices along both the column and row axes of a DataFrame

Note that by selecting certain indices, our new table does not have “standard” indices 0, 1, 2, ... The data.reset_index() method pops out the original index and replaces it with a fresh “standard” index axis (Figure A-22).

Figure A-22
Resetting an index with the old index “popped out” as a new column

To prevent the popping out of the old axis as a new column, specify drop = True as an argument in the reset_index() method (Figure A-23).

Figure A-23
Resetting an index without the old index “popped out” as a new column

In general, to drop a column or set of columns, call data.drop(col, axis=1, inplace=True) or data.drop([col1, col2, ...], axis=1, inplace=True). The 1 axis represents the columns, whereas the 0 axis represents the rows. If you want to drop certain rows instead, set axis = 0. The inplace argument determines whether to execute the drop command on the current object or on a copy. If inplace is set to False, the original DataFrame will not be altered, but another DataFrame with the dropped data will be returned.

Note

A common mistake for beginners to Pandas is to simply call data.drop(col, axis=1). This command has no effect on the data, because the default value for inplace is False. To fix this, you can call data = data.drop(col, axis=1), which reassigns the data variable to the returned dropped DataFrame generated from the original DataFrame referenced by data. The preferred solution, however, is to use inplace = True if you do not need to preserve the original data; this handles all operations internally in a more efficient manner than generating copies and reassigning.

To index a range of columns or indices, Pandas also supports native Python indexing. However, unlike Python, Pandas includes the end index. For instance, the command table.loc[90:100, 5:100:3] includes all rows from indices 90 to 100 (inclusive) and all rows from 5 to 100 with step size 3 (Figure A-24).

Figure A-24
Indexing columns and rows of a DataFrame via slicing

Say we want to be a bit mischievous and mess up the multiplication table. We can reassign, for instance, the name of each column to a different random name. In order to do this, we need to give Pandas a mapping between an original column name and a new column name in the form of a dictionary (Listing A-18, Figure A-25). Then, call the .rename() method from the DataFrame and specify columns = dictionary_mapping. The .rename() method has a similar in_place parameter as the .drop() command.

newCol = {}

nums = list(range(1, 101))

np.random.shuffle(nums)

for i in range(len(nums)):

newCol[nums[i]] = nums[(i+1) % 100]

table = table.rename(columns=newCol)

Listing A-18

A “sabotage” renaming operation that randomly assigns each original column to a new column name. This is done by shuffling the column names and setting each column name to be renamed as the column name after it. The mod 100 operation allows for “wrapping around” (i.e., the very last element is renamed to the name of the very first element)

Figure A-25
Result of randomly renaming columns in a DataFrame

In machine learning and deep learning, we are often interested in the scaling property of computational structures. (For a deep learning example, see Chapter 4, “Why Do We Need Convolutions?”) In the case of generating multiplication tables, we may wonder how the storage needed to store the Pandas DataFrame in memory scales as we increase the dimension of the table n (Listing A-19, Figure A-26).

import sys

x = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]

y = [sys.getsizeof(makeTable(n=n))/1000 for n in tqdm(x)]

plt.figure(figsize=(10, 5), dpi=400)

plt.plot(x, y, color='black')

plt.grid()

plt.xlabel('$n$')

plt.ylabel('KB')

plt.title('Storage Size for Pandas $n cdot n$ Multiplication DataFrames')

plt.show()

Listing A-19

Plotting out the storage scaling of the current multiplication table function

Figure A-26
Plotting the storage size of a Pandas DataFrame used to store an n-by-n table scaling across values of n

The storage size grows roughly quadratically, as expected. Even so, the storage size for large values of n becomes very large. While we cannot alter the computational complexity of the multiplication-table-building algorithm, we can generally improve the scaling by recognizing and cutting redundancies in the table.

For one, the table is symmetric about the diagonal extending from the top-left corner to the bottom-right corner (i.e., a ⋅ b = b ⋅ a). Thus, slightly less than half of the table contains duplicate information. Let’s alter our table function to only fill in unique multiplication equations by only filling index-wise starting from the current column value (Listing A-20).

def makeHalfTable(n = 10):

table = pd.DataFrame(index=range(1, n+1),

columns=range(1, n+1))

for num1 in table.columns:

for num2 in table.index[num1-1:]:

table[num1][num2] = num1 * num2

return table

Listing A-20

Adapting the multiplication table function to fill in only half of the multiplication table

We see that all the values that were not filled in contain NaN (Figure A-27), which are left there from initialization (created in line 2 of Listing A-20).

Figure A-27
Result of filling in only half of the multiplication table

Perhaps surprisingly, the scaling of this half-table method is only negligibly better than the complete table (Figure A-28).

Figure A-28
Plotting the storage size of a Pandas DataFrame used to store an n-by-n table scaling across values of n, default method vs. half-filling method

However, if we replace all the np.nan’s with 0s, we obtain pretty substantial storage savings. By returning table.fillna(0) – which, as the syntax naming suggests, fills all NA/null/NaN values with 0 – we get a much more lightweight scaling multiplication table generator (Figure A-29).

Figure A-29
Plotting the storage size of a Pandas DataFrame used to store an n-by-n table scaling across values of n, default method vs. half-filling method

One simplistic explanation is that np.nan is a “bulky higher-level object,” whereas 0 is a “primitive Python value”; thus, it makes sense intuitively that Python can handle storage of a large number of 0s more efficiently than a large number of np.nan’s. There are, of course, many more neglected complex low-level details that contribute to the optimization of storage and efficiency. This demonstration, however, shows some quick high-level approaches that can be used to cut redundancy and improve scaling.

Advanced Pandas Mechanics

Pandas offers several functions that offer advanced functionality for manipulating the contents of DataFrames. Let’s create a dummy DataFrame to manipulate throughout this section to illustrate the various manipulation functions (Listing A-21).

construct_dict = {'foo': ['A']*3 + ['B']*3,

'bar': ['I', 'II', 'III']*2,

'baz': range(1, 7)}

dummy_df = pd.DataFrame(construct_dict)

Listing A-21

Constructing a dummy DataFrame

The contents of such a DataFrame are displayed in Table A-3.

Table A-3

Contents of the sample DataFrame created in Listing A-21

	foo	bar	baz
0	A	I	1
1	A	II	2
2	A	III	3
3	B	I	4
4	B	II	5
5	B	III	6

Pivot

A pivot operation projects two existing columns in the data as the axes of a new pivoted table (the index and the column). Then, it fills in the pivoted table’s elements with values from a third or more columns that correspond to a particular combination of the first two columns. Consider the scheme in Figure A-30: we set “foo” and “bar” to be the index and column of the new pivoted table (respectively) and use “baz” to fill in values of the table. Because “baz” is 1 when “foo” is “A” and “bar” is “I”, the element at index “A” and column “I” is 1.

Pivoting can be implemented as df.pivot(index=..., columns=..., values=...). In Figure A-30, we use the command dummy_df.pivot(index='foo', columns="bar", values="baz").

If you pass in more than one column to the values parameter via a list (Listing A-22), Pandas creates a multilevel column to accommodate different values for combinations of the index and column features (Figure A-31).

mod_dummy_df = dummy_df.copy()

mod_dummy_df['baz2'] = range(101, 107)

mod_dummy_df.pivot(index='foo', columns='bar',

values=['baz','baz2'])

Listing A-22

Code to perform a pivot

Figure A-31
Visualizing the pivot operation on Pandas DataFrames

If there are multiple entries for the same combination of index and column features, Pandas will (as one may expect) throw an error: "Index complains duplicate entries, cannot reshape".

Pivoting is a convenient operation for automatically finding values at the combination of two features.

Melt

Melting can be thought of as an “unpivot.” It converts matrix-based data (i.e., the index and column are both “significant,” instead of the index serving as a counter) into list-based data, whereas pivots do the opposite. You can think of the operation as “melting” away the rigid, organized structure of a matrix into a primitive stream of data, like complex ice sculptures melt into elementary puddles of water.

Consider a melting operation with an ID feature “baz” and value features “foo” and “bar” (Figure A-32). Two columns in the melted DataFrame are created: “variable” and “value”. “value” holds the value stored by the feature name referenced in the “variable” column in the original dataset. The “baz” ID column is used to keep track of which row the melted variable-value pairs belong to.

Figure A-32
Visualizing the melting operation

Melting can be implemented as df.melt(id_vars=["baz"], value_vars=["foo", "bar"]).

Explode

Exploding is an operation that separates columns containing lists into a melted-style form. Consider the schematic in Figure A-33: the original DataFrame contains a list [1, 2] in index 0, where “foo” has value “A”, “bar” has value “I”, and “baz” has value 1. After the list is exploded, the exploded DataFrame will contain two entries at index 0, with the same values for non-exploded columns (“foo”, “bar”, “baz”) but separate values corresponding to items in the list (1 in one row, 2 in the other).

Figure A-33
Visualizing the exploding operation

It’s generally unlikely that you’ll encounter a raw dataset with lists as a column. However, knowing that the explode function exists can be helpful when you’re artificially constructing DataFrames; rather than writing code to create a specific organization of elements, just create a column with relevant list values and explode it. While the end result may be the same, it becomes much simpler for you to implement.

Exploding can be implemented as df.explode('column_name'). You can also pass in a list of column names to explode if there are multiple columns containing list elements.

Stack

The stack operation converts a two-dimensional Pandas DataFrame into a DataFrame with multilevel indices and one column by rearranging the elements into “vertical” form.

In the stacking operation visualized in Figure A-34, the first row value in the column “foo” is rearranged as being the first row in the first-level index and the “foo” row in the second-level index under the column 0.

Figure A-34
Visualizing the stacking operation

To access multilevel data, simply call index-based retrieval twice, like df.loc[0].loc['foo'].

The code to stack is simply df.stack().

Unstack

True to its name, the unstacking operation acts as an inverse to the stack operation, converting a stacked-style DataFrame with multilevel indices into a standard two-dimensional DataFrame. Performing a stack followed by an unstack operation produces no change in the DataFrame, except for the existence of a “0” introduced during stacking (Figure A-35).

The code to unstack is simply df.unstack().

Conclusion

In this appendix, we took a tour through the NumPy and Pandas libraries. As you will see or have already seen, these objects constitute the basis through which we allow data to interact with models, such as in the scikit-learn and Keras/TensorFlow libraries. Happy coding!

Index

Accuracy

Activation functions

gradient descent

hyperbolic tangent (tanh)

key improvements

LeakyReLU

linear functions

nonlinear functions

nonlinearity/variability

rectified linear unit

scaled linear activation

SELU

sigmoid function

swish function

Activation Maximization

AdaBoostClassifier

Adaptive Gradient Boosting (AdaBoost)

Adaptive Moment Estimation (Adam)

Adaptive Relation Modeling Network (ARM-Net)

adaptive relation modeling module

benchmark models

definition

module

PyTorch

Adult Census dataset

Aggregating/ensembling models

ak.StructuredDataBlock

AlexNet

Alignment-score-computing network

Alpha value

Ames Housing dataset

applyKernel function

Artificial neural network (ANN)

Asymmetric convolutions

Attention-based tabular modeling

Attention mechanism

Bahdanau-style attention

BERT

context vector

deep learning models

definition

GELU

Keras

attention scores

bidirectional model

bidirectional recurrent model

implementation

multi-head attention model

synthetic dataset

unidirectional model

LSTM

multimodal

natural language models

sequence-to-sequence dataset

SHA-RNN model

text sequences

transform architecture

Audio files

Audio model

Autoencoders

Table of Contents for Back Matter

Create new playlist

Sign In

Sign Up

Table of Contents for
Back Matter