NumPy Arrays
NumPy arrays are perhaps the most important and ubiquitous non-native data storage object in the Python data science ecosystem. In this section, we will learn how to construct and manipulate NumPy arrays.
NumPy Array Construction
We can construct a NumPy array by passing a list into the np.array() constructor function. For instance, arr = np.array([0, 1, 2, 3, 4, 5]) creates a NumPy array under arr holding the values 0, 1, 2, 3, 4, 5.
There are many instances in which we want a NumPy array of elements organized by some sort of pattern or identity, but don’t desire to type out manually – like a 1000-element-long array of zeros or an array that counts from 0 to 10
6. NumPy offers several helpful functions for common
pattern-based arrays you may want to generate:
np.arange(start, stop, step) acts like range() in native Python, taking in two parameters indicating the start and end, as well as an optional step size (1 by default). For instance, np.arange(1, 10, 2) creates an array with values [1, 3, 5, 7, 9]. Recall that the end/stop value is not an inclusive bound (i.e., it is not included in the resulting list). Using a negative step value allows for counting backward, that is, start > stop.
np.linspace(start, end, num) returns an array of length num elements, equally spaced from the first number start to the end number end (inclusive). For instance, np.linspace(1, 10, 5) creates an array with values [1., 3.25, 5.5, 7.75, 10.].
np.zeros(shape) takes in a tuple and initializes an array of that shape with all zeros. For instance, np.zeros((2, 2, 2)) returns the NumPy array with contents [[[0, 0], [0, 0]], [[0, 0], [0, 0]].
np.ones(shape) takes in a tuple and initializes an array of that shape with all ones. For instance, np.ones ((2, 2, 2)) returns the NumPy array with contents [[[1, 1], [1, 1]], [[1, 1], [1, 1]].
np.random.uniform(low, high, shape) takes in a low bound and a high bound and fills an array with the given shape with uniform-randomly sampled values from that range. If no tuple is provided for the shape parameter, the function returns a single value instead of an array.
np.random.normal(mean, std, shape) takes in a mean and a standard deviation and fills an array with the given shape with values sampled from a normal distribution with that shape. If no tuple is provided for the shape parameter, the function returns a single value instead of an array.
NumPy arrays are of type numpy.ndarray – here, “n-d” indicates that the array can be of any integer n dimensions. The examples we have explored so far are one-dimensional, but we can also build arrays of higher dimensions. A two-dimensional array is an array in which each element itself holds another list/array of elements. A three-dimensional array is an array in which each element holds another list/array of elements and each element of that array holds a third level of list/array – and so on. The shape of an array indicates its dimensionality and the length/size of each dimension. For instance, the shape (128, 64, 32) indicates that the corresponding array is four-dimensional and has 128 lists of 64 lists each of 32 elements each.
NumPy arrays can be reshaped into any desired shape, as long as the total number of elements is the same in the resulting array. For instance, np.arange(100) returns an array with values [0, 1, 2, ..., 98, 99], but np.arange(100).reshape((10,10)) organizes the 100 elements in 10 arrays of 10 elements each, like [[0, 1, ..., 8, 9], [10, 11, ..., 18, 19], ..., [90, 91, ..., 98, 99]] (Figure A-1).
Simple NumPy Indexing
Indexing one-dimensional NumPy arrays is the same as with native Python. For instance, arr[1:4] can be used to select the second through the fourth elements of the array arr = [0, 1, 2, 3, 4]. NumPy also works with Python’s negative indexing syntax. The index arr[1:-1] yields the same result as arr[1:4].
Indexing
n-dimensional NumPy arrays follows a similar structure, in which the syntax for indexing an individual dimension is defined for each dimension of the
n-d array. The array
np.arange(100).reshape((10, 10))[:5, :5] indexes the first five rows and the first five columns of the 10-by-10 array (Figure
A-2). Note that the indexing specification for each dimension is separated by a comma.
If you wish to set indexing specifications for certain dimensions but not others, indicate the lack of an indexing range for a certain dimension by simply typing a colon ‘
:’. For instance, Figure
A-3 demonstrates the 10-by-10 array indexed via
[:5, :] (left) and
[5:, :] (right).
Figure
A-4 demonstrates the result obtained by indexing via
[:, :5] and
[:, 5:].
Another important concept to understand is the difference between the indexing commands [i] and [i:i+1]. Functionally, the two index the same information: calling np.array([0, 1, 2, 3])[1] indexes the second element (which has value 1), and calling np.array([0, 1, 2, 3])[1:2] begins indexing at the second element and stops at the third element (noninclusve) – this also only indexes the second element. The difference in actual result, however, is that the former index syntax indexes a single element and the latter index syntax indexes a range of elements, even if such a range includes only one method. Thus, np.array([0, 1, 2, 3])[1] returns 1, whereas np.array([0, 1, 2, 3])[1:2] returns np.array([1]).
As another exercise, consider the array initialized by np.zeros((5, 5, 5, 5, 5)) – what is the shape of the index command [:, 0, 3:, 1:2, 2:4]? We can follow through how the indexing specification for each dimension affects the shape of the resulting array, which has shape (5, 1, 3, 1, 2). You can verify this yourself by calling .shape on the indexed array.
Quantitative Manipulation
NumPy offers many functions to
manipulate both Python quantitative objects (native integer and float types, for instance) and NumPy arrays (Table
A-1). Familiar Python mathematical operations like addition, subtraction, multiplication, division, modulus, exponentiation, binary, and comparative relationships can be applied
element-wise (i.e., the operation is applied to the
ith index of the first array and the
ith index of the second array if two arrays are involved).
Table A-1Sample operations performed on NumPy arrays
Array 1 Contents | Operation | Array 2 Contents | Yields | Result |
---|
0, 1, 2 | + | 2, 1, 0 | | 2, 2, 2 |
5, 4, 3 | - | 3, 2, 1 | | 2, 2, 2 |
4, 2, 1 | * | 0.5, 1, 2 | | 2, 2, 2 |
4, 2, 1 | / | 2, 1, 0.5 | | 2, 2, 2 |
6, 8, 11 | % | 4, 3, 3 | | 2, 2, 2 |
1, 2, 3 | ** | 2, 2, 2 | | 1, 4, 9 |
2, 3, 4 | ^ | 2, 2, 2 | | 0, 1, 6 |
2, 3, 4 | & | 2, 2, 2 | | 2, 2, 0 |
2, 3, 4 | | | 2, 2, 2 | | 2, 3, 6 |
-1, 0, 1 | < | 0, 0, 0 | | True, False, False |
-1, 0, 1 | == | 0, 0, 0 | | False, True, False |
This is different from Python syntax. For instance, [0, 1, 2] + [3, 4, 5] will not return [3, 5, 7] but rather [0, 1, 2, 3, 4, 5] if we use standard lists and not NumPy arrays.
The two elements involved in an operation must be the same length, unless one of the arrays is a repetition of the same value; in this case, that array can be replaced with a one-element array containing that value or just that value by itself. For instance, np.arange(100) * np.array([2, 2, ..., 2]) can be replaced with np.arange(100) * np.array([2]) or np.arange(100) * 2.
Otherwise, applying relationships between two NumPy arrays of different lengths that do not fall in the previously mentioned category will yield a ValueError: operands could not be broadcast together.
NumPy also offers multiple mathematical functions that can be applied to a single value or array (in which case the function is applied element-wise and returns an array of the same length) (Table
A-2).
Table A-2Example NumPy functions
Function | Usage | Function | Usage |
---|
Sine | np.sin(0) -> 0.0 | Floor | np.floor(2.4) -> 2 |
Cosine | np.cos(0) -> 1.0 | Ceiling | np.ceil(2.4) -> 3 |
Tangent | np.tan(0) -> 0.0 | Round | np.round(2.4) -> 2 |
Arcsine | np.arcsin(0) -> 0.0 | Exponential | np.exp(0) -> 1.0 |
Arccosine | np.arccos(1) -> 0.0 | Natural Log | np.log(np.e) -> 1.0 |
Arctangent | np.arctan(0) -> 0.0 | Base 10 Log | np.log10(100) -> 2.0 |
Maximum | np.max([1,2]) -> 2 | Square Root | np.sqrt(9) -> 3.0 |
Minimum | np.min([1,2]) -> 1 | Absolute Value | np.abs(-2.5) -> 2.5 |
Mean | np.mean([1,2]) -> 1.5 | Median | np.mean([1,2,3]) -> 2 |
These functions are efficient and very helpful in obtaining mathematical derivations from arrays. For instance, we may implement the sigmoid function () as sigmoid = lambda x: 1/(1 + np.exp(-x)). This function can work with both single scalar values and NumPy arrays.
Advanced NumPy Indexing
This set of simple NumPy indexing should satisfy most important and common operations you’ll manipulate arrays with. However, if desired, it can be incredibly helpful to learn the syntax of more advanced NumPy indexing methods, which allow you to express more complex desired outcomes in a syntactically short amount of space.
NumPy colon-and-bracket indexing accepts a third parameter (in addition to the start and stop indexing) indicating the step size. For instance, np.arange(10)[2:6:2] indexes every other element from the array: [2, 4]. As expected, leaving the start and end indices unspecified while providing a step size indexes the entire array with the provided step size: np.arange(10)[::2] yields [0, 2, 4, 6, 8].
If working with arrays containing a large number of axes, NumPy offers the ellipsis (...) to represent lack of an indexing specification for certain dimensions. For instance, if we want to index only the first and last dimensions of an array z with shape (5, 5, 5, 5, 5, 5) and leave the rest untouched, we could write something like z[1:4, :, :, :, :, 1:4] without ellipsis notation or equivalently write (in much more compact form) z[1:4, ..., 1:4].
NumPy arrays also support reassignment. Individual elements can be changed via arr[index] = new_value. Multiple elements can be reassigned: consider an array defined by arr = np.arange(6); replacing the third through fifth elements with arr[2:5] = np.arange(3) yields arr as [0, 1, 0, 1, 2, 5]. Moreover, the indices do not need to be consecutive: arr[::2] = np.arange(3) yields arr as [0, 1, 1, 3, 2, 5].
Reassignment can be dangerous if you’re not paying attention. Consider the following series of array manipulations (Listing
A-1): we initialize an array of numbers from 0 to 9 (inclusive), set a new array copy to that array, and then reassign the first element of the first array to 10.
arr = np.arange(10)
copy = arr
arr[0] = 10
Listing A-1Danger of reassignment
As expected, the contents of arr are [10, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, the contents of copy are also [10, 1, 2, 3, 4, 5, 6, 7, 8, 9]! When we set copy to arr, we’re not actually copying the contents of arr: we’re creating another reference to the original array’s location in memory. Thus, when a reassignment is made to arr, it also appears in copy. In order to prevent this linking, we must physically copy an array; this can be done with copy = np.copy(arr) or with copy = arr[:]. The latter method indexes the entire array, but physically copies it in memory such that reassignments and manipulations are not linked.
Note that the indices used in colon-bracket syntax (start:stop:step) are only specifying a set of indices generated by a given set of rules, which means there is no reason we can’t specify our own custom indices. If we want the second, fourth, and sixth elements of an array, we can index it with subset = arr[[1, 3, 5]]. The double brackets may feel unnecessary at first, but think of the command as a shorthand for two lines of code: indices = [1, 3, 5] and subset = arr[indices]. For specialized indexing, you can programmatically generate your own index lists.
For certain specialized indexing operations, however, NumPy can help us with conditional indexing. For instance, if we want to retrieve all items in an array that are larger than 3 in value, we can call arr[arr > 3]. Recall that arr > 3 returns a Boolean array in which each element is either True if the corresponding index element in arr satisfies the condition of being larger than 3 and False otherwise. When we index an array with these element-wise Boolean specifications, NumPy includes an element of arr if the corresponding Boolean is True and does not if it is False.
NumPy Data Types
NumPy offers for values in NumPy arrays to be stored in multiple different forms, or data types. Here are some of the most
important:
Boolean (np.bool_)
Unsigned integers (np.uint8, np.uint16, np.uint32, np.uint64)
Signed integers (np.int8, np.int16, np.int32, np.int64)
Floats (np.float16, np.float32, np.float64)
When you initialize an array, you can pass in the desired data type with the dtype parameter, for instance, np.array([-1, 0, 1, 2, 3], dtype=np.int8).
You can
cast (“‘convert”) one data type into another using
arr.astype(np.datatype). Consider the following arrays (Listing
A-2).
arr1 = np.array([1,2,3])
arr2 = arr1.astype(np.uint8)
Listing A-2Casting data types
Calling arr1.dtype yields dtype(‘int64’); calling arr2.dtype yields dtype(‘uint8’). When we first construct arr1, integers are represented by the np.int64 type; they are then cast as unsigned integers into arr2 with no effect on the contents. However, note that casting to a lower representation size can alter the values of the array; for instance, casting an array with the values [-1, -2, -3] to np.uint8 yields [255, 254, 253] (calculated by subtracting from 28). As another example, casting an array with value [1.123456789] to np.float16 yields the value [1.123].
Generally, you’ll need to cast to a lower representation size than a higher one, usually in response to memory/storage problems. Casting is also often utilized to prepare images for image processing libraries, which may require image data to be stored in np.uint8 type to guarantee values consist only of integers from [0, 255].
Function Application and Vectorization
Often, we would like to apply a function element-wise to a NumPy array and that this function is not already supported. (You always want to use default functions if they are implemented for potential efficiency gains.) For instance, say we would like to graph the
piecewise functionfrom −5 ≤ x ≤ 5. The array containing the x-axis values can be generated with inputs = np.linspace(-5, 5, 100). In this case, we are sampling 100 points from the function, which is high enough precision for our visualization purposes.
We can implement the
function as follows (Listing
A-3).
def f(x):
if x < 0: return x**2/25
else: return np.sin(x) * x**2
Listing A-3A custom piecewise function
However, simply applying
f(inputs) yields a
ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33/829457706.py in <module>
----> 1 f(np.linspace(-5, 5, 100))
/tmp/ipykernel_33/3998949136.py in f(x)
1 def f(x):
----> 2 if x < 0: return x**2/25
3 else: return np.sin(x) * x**2
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Our multipart function involves some relatively complex logic (i.e., if statements and comparisons), and thus applying the function blankly fails. In this case, the value error stems from using if with multiple Boolean values; since some elements of the Boolean array formed by x < 0 are True and some are False, Python cannot decide whether to execute the code within the if or not – the truth value of the array is ambiguous. In this case, Python cannot quite tell that we want it to apply the function element-wise; we must communicate this explicitly.
One manual method is to use list comprehension and create a new array formed by applying the function individually to each element of inputs: outputs = np.array([f(element) for element in inputs]). The shorter but functionally (get it?) equivalent alternative is to use function vectorization. np.vectorize takes in a Python function and returns another function that applies the original function element-wise: outputs = np.vectorize(f)(inputs) or alternatively vectorized = np.vectorize(f) and outputs = vectorized(inputs) for a longer but perhaps more readable representation.
Function vectorization is also convenient when we want to apply element-wise operations to multiple inputs. For instance, we may want to return True when the sum of elements across three input arrays is larger than 10 and False otherwise (Listing
A-4).
def f(x, y, z):
if x + y + z > 10: return True
return False
Listing A-4A sample multi-input function
We can apply the function as follows (Listing
A-5).
x = np.arange(0, 5)
y = np.arange(7, 2, -1)
z = np.arange(-1, 9, 2)
np.vectorize(f)(x, y, z)
# array([False, False, False, True, True])
Listing A-5Using function vectorization on a function with multiple inputs
Note that even though some have observed minor speedups by using np.vectorize, the function is “provided primarily for convenience, not for performance” (from the NumPy documentation website).
NumPy Array Application: Image Manipulation
Let’s use our
knowledge of NumPy arrays to have some fun with image manipulation. The
skimage.io.imread function can take in a URL of an image and return it as a NumPy array. Our sample image will be a landscape view of the New York City skyline (Listing
A-6, Figure
A-5).
from skimage import io
import matplotlib.pyplot as plt
url = 'https://upload.wikimedia.org/wikipedia/commons/ thumb/2/2b/NYC_Downtown_Manhattan_Skyline_seen_from_Paulus_Hook_2019-12-20_IMG_7347_FRD_%28cropped%29.jpg/1920px-NYC_Downtown_Manhattan_Skyline_seen_from_Paulus_Hook_2019-12-20_IMG_7347_FRD_%28cropped%29.jpg'
image = io.imread(url)
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(image)
plt.show()
Listing A-6Loading a sample image
Calling
image.shape yields the tuple
(770, 1920, 3). This means the image is 770 pixels high and 1920 pixels wide. The image is in color, and so by standard it has three channels corresponding to
red, green, and blue (RGB). We can separate the image into its “color composition” by indexing each of the channels independently and displaying the two-dimensional slice in the corresponding color (Listing
A-7, Figure
A-6).
for i, color in enumerate(['Reds', 'Blues', 'Greens']):
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(image[:,:,i], cmap=color)
plt.show()
Listing A-7Separating and visualizing the individual red, green, and blue color maps of the single image from Figure A-5
Say we want to “collapse” the three-dimensional image into a two-dimensional one by converting it from color to grayscale. One natural approach is to take the mean of channels for each pixel, which can be implemented with
np.mean(image, axis=2) (Listing
A-8, Figure
A-7). Here, the axis parameter indicates that we are taking the mean along the third axis, indicated by
2 just as the third element of a tuple is indexed with
2.
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(np.mean(image, axis=2), cmap='gray')
plt.show()
Listing A-8Visualizing a mean-based grayscale representation
The result is a pretty good grayscale representation! We can also produce a similar effect by taking the maximum value of each channel per pixel with
np.max(image, axis=2), which produces a vintage-looking “overexposed” grayscale representation (Figure
A-8).
And taking the minimum value for each pixel instead with
np.min(image, axis=2) yields – as one may expect – a generally darker grayscale representation (Figure
A-9).
We can augment the original image by adding noise (Listing
A-9, Figures
A-10 and
A-11). An array of normally distributed noise with the same shape as the original image can be generated with
noise = np.random.normal(0, 40, (770, 1920, 3)). In this case, we center the mean at 0 and use standard deviation 40. Recall that images are generally stored to have numerical pixel values between 0 and 255. A larger standard deviation will yield larger visual noise, whereas a smaller one will yield less visible noise. Additionally, note that we need to cast the noisy image as an unsigned 8-bit integer (between 0 and 2
8 − 1 = 255) because the noise vector is drawn from a continuous distribution, yielding pixel values that are not within the valid set of all integers from 0 to 255 inclusive that are accepted for image display. Trying to display the image without casting the array as
uint8 type will yield a bizarre, mostly blank canvas.
noise_vector = np.random.normal(0, 40, (770, 1920, 3))
altered_image = image + noise_vector
display_image = altered_image.astype(np.uint8)
Listing A-9Generating a noisy image by adding noise randomly drawn from a normal distribution to the image from Figure A-5
We can also adjust the mean value of the normal distribution from which values for the noise matrix are sampled to generally influence the “feel” of the image overall (Figure
A-12, Figure
A-13).
The features of the image can also be enhanced or dimmed by multiplying all the values in the image by some constant – 0 ≤
k < 1 to dim the image and
k > 1 to enhance it. This is known as
contrast. Note that we need to similarly cast the altered image as an unsigned 8-bit integer because multiplying each value by a non-integer value does not guarantee an integer outcome required for picture display (Listing
A-10). Observe that minute differences form harsh, colorful boundaries in high-contrast images due to the exacerbated/exaggerated quantitative value difference (Figure
A-14).
for factor in [0.2, 0.6, 1.5, 3, 8]:
altered_image = image * factor
display_image = altered_image.astype(np.uint8)
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(display_image)
plt.show()
Listing A-10Generating and visualizing different levels of contrast by multiplying values in the sample image by varying factors
Another familiar parameter in image editing tools is brightness, which can be adjusted by adding or subtracting the same value to or from all pixels in an image, thus uniformly increasing or decreasing the array values.
It’s common knowledge that the New York City skyline just isn’t complete without King Kong and Godzilla battling it out. Let’s load in a sample image of the scene (Listing
A-11, Figure
A-15).
url = 'https://upload.wikimedia.org/wikipedia/commons/ thumb/f/f4/KK_v_G_trailer_%281962%29.png/440px-KK_v_G_trailer_%281962%29.png'
beasts = io.imread(url)
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(beasts)
plt.show()
Listing A-11Loading and displaying a sample image of King Kong battling Godzilla
We’ll use a simple bitwise AND operation between two sample images to merge them. In order to do this, we first need to make sure that the images are the same size. One way to ensure equivalent array shapes is to resize the higher-resolution image (the New York City skyline in this case) to the shape of the lower-resolution image. This can be accomplished with the Python cv2 computer vision library, which offers the helpful function cv2.resize: resized = cv2.resize(original, desired_shape).
The resulting merge (Listing
A-12, Figure
A-16) isn’t bad!
merged = cv2.resize(image, beasts.shape) & beasts
plt.figure(figsize=(10, 5), dpi=400)
plt.imshow(merged)
plt.show()
Listing A-12Generating and visualizing a “merger” by using the bitwise OR operation
(See more fun image manipulation with convolutions in Chapter 4!)
We were able to do a lot simply by manipulating NumPy arrays with minimal help from additional libraries! Having a strong grasp of NumPy will prove not only helpful for manipulating data stored in array form, but almost any data type in the Python data science ecosystem due to NumPy’s ubiquitous nature.
Pandas DataFrames
While NumPy arrays allow for the efficient storage of raw data from images to tables to text in array form, its generality can limit how efficiently we work with specific types of data. Perhaps the best-developed library to work with table-based data is Pandas, which is built upon the DataFrame, a two-dimensional container for tabular data. (In fact, Pandas is built upon NumPy!) With Pandas, you can read and write from and to files, select data, filter data, and transform data. Pandas is an essential tool in the context of tabular data: there simply is no other library like Pandas in Python at the time of this book’s writing that is as well-maintained and appropriate for effective tabular data manipulation.
Constructing Pandas DataFrames
To construct a Pandas dictionary from scratch, pass in a dictionary to the
pd.DataFrame() constructor in which each key is a string representing the column name and the value is a list or array representing its values (Listing
A-13, Figure
A-17).
df = pd.DataFrame({'a':[1, 2, 3],
'b':[4, 5, 6],
'c':[7, 8, 9]})
Listing A-13Generating a dummy DataFrame
If the lists provided for each column are not the same length, you will encounter a ValueError: All arrays must be of the same length error.
This method of constructing DataFrames is especially helpful when attempting to create small DataFrames, for instance, as a dummy table to test out manipulations or record and collect data for visualizations.
You can accomplish the same outcome by first initializing a blank DataFrame by passing in no information into the constructor and then creating the columns one by one (Listing
A-14).
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] = [4, 5, 6]
df['c'] = [7, 8, 9]
Listing A-14Initializing a dummy DataFrame via column creation/assignment
It should be noted here that this operation is very similar to NumPy array reassignment mechanics (Listing
A-15). The bracket notation allows for the indexing of an element or collections of elements along a given object’s axes. Note, however, that certain element or dimension of an array needs to already exist to be reassigned in NumPy, whereas in Pandas the DataFrame can be empty before assignment.
arr = np.zeros((3, 3))
arr[0] = [1, 2, 3]
arr[1] = [4, 5, 6]
arr[2] = [7, 8, 9]
Listing A-15Analog in NumPy to the column assignment operation in Listing A-14
DataFrame columns can be indexed using brackets and the column’s name (Listing
A-16). This returns a
Series object, which can be thought of as a dictionary. In a dictionary, each key is associated with a value; in a Series, each
index is associated with a value.
"'
Returns:
0 1
1 2
2 3
Name: a, dtype: int64
"'
Listing A-16Result of indexing a single column in a Pandas DataFrame
Thus, we can obtain the first item of the series indexed in Listing A-16 by df['a'][0], which returns 1.
DataFrames are more like a dictionary than a list because of the explicit and possible modification of the index, even if it is ordered like a list by default.
You’ll more commonly be reading data from a file. For instance, if you want to read the data in a comma-separated value file, use data = pd.read_csv(path). Depending on the organization of your .csv file, you may need to specify certain delimiters. Pandas has corresponding reading functions for Excel spreadsheets (pd.read_excel), JSON (pd.read_json), HTML tables (pd.read_html), SQL data (pd.read_sql), and many other file types. Correspondingly, you can export Pandas DataFrames in a desired supported form – for example, data.to_csv(path) or data.to_excel(path). See the IO tools page on the Pandas documentation for a full and up-to-date list of Pandas file reading and processing functionality: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html.
Simple Pandas Mechanics
We can write a function to construct a multiplication table, represented in Pandas with a DataFrame (Listing
A-17, Figure
A-18). The multiplication table is a square DataFrame with both indices and columns containing the integers from [1,
n] (inclusive); each element within the table is the product of the corresponding index and column coordinates. The function initializes a blank DataFrame with the desired index and column values and then iteratively fills in the desired elements using standard array logic.
def makeTable(n = 10):
table = pd.DataFrame(index=range(1, n+1),
columns=range(1, n+1))
for num1 in table.columns:
for num2 in table.index:
table[num1][num2] = num1 * num2
return table
Listing A-17A function that generates a multiplication table using Pandas value reassignment of an arbitrary n × n size
Recall that you can index a column in a DataFrame with a bracket. When we call
table[5], we obtain the following displayed Series. In our context, this returns all the multiples of 5 from 1 ⋅ 5 to 100 ⋅ 5:
1 5
2 10
3 15
4 20
5 25
...
96 480
97 485
98 490
99 495
100 500
Name: 5, Length: 100, dtype: object
Say we want to view multiples of 5, 10, and 15 all at once. Rather than passing in a single reference to a column, we pass in a list of column references:
table[[5, 10, 15]] returns the DataFrame shown in Figure
A-19.
We can also index rows with
.loc. You can pass in a single row to index or a list of rows to index.
table.loc[[5, 10, 15]] returns the DataFrame shown in Figure
A-20.
Naturally, if you want to specify indices for both columns and rows, you can chain individual index commands together:
table[[5,10,15]].loc[[5, 10, 15]]. However, the preferred way is to take advantage of
.loc, which supports simultaneous row and column indexing in
.loc[row, col] format and therefore is more efficient than chaining separate calls. The equivalent command to index both indices and rows with
[10, 15] would be
table.loc[[5, 10, 15], [5, 10, 15]] (Figure
A-21).
Note that by selecting certain indices, our new table does not have “standard” indices 0, 1, 2, ... The
data.reset_index() method pops out the original index and replaces it with a fresh “standard” index axis (Figure
A-22).
To prevent the popping out of the old axis as a new column, specify
drop = True as an argument in the
reset_index() method (Figure
A-23).
In general, to drop a column or set of columns, call data.drop(col, axis=1, inplace=True) or data.drop([col1, col2, ...], axis=1, inplace=True). The 1 axis represents the columns, whereas the 0 axis represents the rows. If you want to drop certain rows instead, set axis = 0. The inplace argument determines whether to execute the drop command on the current object or on a copy. If inplace is set to False, the original DataFrame will not be altered, but another DataFrame with the dropped data will be returned.
To index a range of columns or indices, Pandas also supports native Python indexing. However, unlike Python, Pandas includes the end index. For instance, the command
table.loc[90:100, 5:100:3] includes all rows from indices 90 to 100 (inclusive) and all rows from 5 to 100 with step size 3 (Figure
A-24).
Say we want to be a bit mischievous and mess up the multiplication table. We can reassign, for instance, the name of each column to a different random name. In order to do this, we need to give Pandas a mapping between an original column name and a new column name in the form of a dictionary (Listing
A-18, Figure
A-25). Then, call the
.rename() method from the DataFrame and specify
columns = dictionary_mapping. The
.rename() method has a similar
in_place parameter as the
.drop() command.
newCol = {}
nums = list(range(1, 101))
np.random.shuffle(nums)
for i in range(len(nums)):
newCol[nums[i]] = nums[(i+1) % 100]
table = table.rename(columns=newCol)
Listing A-18A “sabotage” renaming operation that randomly assigns each original column to a new column name. This is done by shuffling the column names and setting each column name to be renamed as the column name after it. The mod 100 operation allows for “wrapping around” (i.e., the very last element is renamed to the name of the very first element)
In machine learning and deep learning, we are often interested in the scaling property of computational structures. (For a deep learning example, see Chapter
4, “Why Do We Need Convolutions?”) In the case of generating multiplication tables, we may wonder how the storage needed to store the Pandas DataFrame in memory scales as we increase the dimension of the table
n (Listing
A-19, Figure
A-26).
x = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
y = [sys.getsizeof(makeTable(n=n))/1000 for n in tqdm(x)]
plt.figure(figsize=(10, 5), dpi=400)
plt.plot(x, y, color='black')
plt.grid()
plt.xlabel('$n$')
plt.ylabel('KB')
plt.title('Storage Size for Pandas $n cdot n$ Multiplication DataFrames')
plt.show()
Listing A-19Plotting out the storage scaling of the current multiplication table function
The storage size grows roughly quadratically, as expected. Even so, the storage size for large values of n becomes very large. While we cannot alter the computational complexity of the multiplication-table-building algorithm, we can generally improve the scaling by recognizing and cutting redundancies in the table.
For one, the table is symmetric about the diagonal extending from the top-left corner to the bottom-right corner (i.e.,
a ⋅
b =
b ⋅
a). Thus, slightly less than half of the table contains duplicate information. Let’s alter our table function to only fill in unique multiplication equations by only filling index-wise starting from the current column value (Listing
A-20).
def makeHalfTable(n = 10):
table = pd.DataFrame(index=range(1, n+1),
columns=range(1, n+1))
for num1 in table.columns:
for num2 in table.index[num1-1:]:
table[num1][num2] = num1 * num2
return table
Listing A-20Adapting the multiplication table function to fill in only half of the multiplication table
We see that all the values that were not filled in contain NaN (Figure
A-27), which are left there from initialization (created in line 2 of Listing
A-20).
Perhaps surprisingly, the scaling of this half-table method is only negligibly better than the complete table (Figure
A-28).
However, if we replace all the
np.nan’s with 0s, we obtain pretty substantial storage savings. By returning
table.fillna(0) – which, as the syntax naming suggests, fills all NA/null/NaN values with 0 – we get a much more lightweight scaling multiplication table generator (Figure
A-29).
One simplistic explanation is that np.nan is a “bulky higher-level object,” whereas 0 is a “primitive Python value”; thus, it makes sense intuitively that Python can handle storage of a large number of 0s more efficiently than a large number of np.nan’s. There are, of course, many more neglected complex low-level details that contribute to the optimization of storage and efficiency. This demonstration, however, shows some quick high-level approaches that can be used to cut redundancy and improve scaling.
Advanced Pandas Mechanics
Pandas offers several functions that offer advanced functionality for manipulating the contents of DataFrames. Let’s create a dummy DataFrame to manipulate throughout this section to illustrate the various manipulation
functions (Listing
A-21).
construct_dict = {'foo': ['A']*3 + ['B']*3,
'bar': ['I', 'II', 'III']*2,
'baz': range(1, 7)}
dummy_df = pd.DataFrame(construct_dict)
Listing A-21Constructing a dummy DataFrame
The contents of such a DataFrame are displayed in Table
A-3.
Table A-3Contents of the sample DataFrame created in Listing A-21
| foo | bar | baz |
---|
0 | A | I | 1 |
1 | A | II | 2 |
2 | A | III | 3 |
3 | B | I | 4 |
4 | B | II | 5 |
5 | B | III | 6 |
Pivot
A
pivot operation projects two existing columns in the data as the axes of a new pivoted table (the index and the column). Then, it fills in the pivoted table’s elements with values from a third or more columns that correspond to a particular combination of the first two columns. Consider the scheme in Figure
A-30: we set “foo” and “bar” to be the
index and
column of the new pivoted table (respectively) and use “baz” to fill in
values of the table. Because “baz” is 1 when “foo” is “A” and “bar” is “I”, the element at index “A” and column “I” is 1.
Pivoting can be implemented as df.pivot(index=..., columns=..., values=...). In Figure A-30, we use the command dummy_df.pivot(index='foo', columns="bar", values="baz").
If you pass in more than one column to the values parameter via a list (Listing
A-22), Pandas creates a multilevel column to accommodate different values for combinations of the index and column features (Figure
A-31).
mod_dummy_df = dummy_df.copy()
mod_dummy_df['baz2'] = range(101, 107)
mod_dummy_df.pivot(index='foo', columns='bar',
values=['baz','baz2'])
Listing A-22Code to perform a pivot
If there are multiple entries for the same combination of index and column features, Pandas will (as one may expect) throw an error: "Index complains duplicate entries, cannot reshape".
Pivoting is a convenient operation for automatically finding values at the combination of two features.
Melt
Melting can be thought of as an “unpivot.” It converts matrix-based data (i.e., the index and column are both “significant,” instead of the index serving as a counter) into list-based data, whereas pivots do the opposite. You can think of the operation as “melting” away the rigid, organized structure of a matrix into a primitive stream of data, like complex ice sculptures melt into elementary puddles of water.
Consider a melting operation with an
ID feature “baz” and
value features “foo” and “bar” (Figure
A-32). Two columns in the melted DataFrame are created: “variable” and “value”. “value” holds the value stored by the feature name referenced in the “variable” column in the original dataset. The “baz” ID column is used to keep track of which row the melted variable-value pairs belong to.
Melting can be implemented as df.melt(id_vars=["baz"], value_vars=["foo", "bar"]).
Explode
Exploding is an operation that separates columns containing lists into a melted-style form. Consider the schematic in Figure
A-33: the original DataFrame contains a list [1, 2] in index 0, where “foo” has value “A”, “bar” has value “I”, and “baz” has value 1. After the list is exploded, the exploded DataFrame will contain two entries at index 0, with the same values for non-exploded columns (“foo”, “bar”, “baz”) but separate values corresponding to items in the list (1 in one row, 2 in the other).
It’s generally unlikely that you’ll encounter a raw dataset with lists as a column. However, knowing that the explode function exists can be helpful when you’re artificially constructing DataFrames; rather than writing code to create a specific organization of elements, just create a column with relevant list values and explode it. While the end result may be the same, it becomes much simpler for you to implement.
Exploding can be implemented as df.explode('column_name'). You can also pass in a list of column names to explode if there are multiple columns containing list elements.
Stack
The stack operation converts a two-dimensional Pandas DataFrame into a DataFrame with multilevel indices and one column by rearranging the elements into “vertical” form.
In the stacking operation visualized in Figure
A-34, the first row value in the column “foo” is rearranged as being the first row in the first-level index and the “foo” row in the second-level index under the column 0.
To access multilevel data, simply call index-based retrieval twice, like df.loc[0].loc['foo'].
The code to stack is simply df.stack().
Unstack
True to its name, the
unstacking operation acts as an inverse to the stack operation, converting a stacked-style DataFrame with multilevel indices into a standard two-dimensional DataFrame. Performing a stack followed by an unstack operation produces no change in the DataFrame, except for the existence of a “0” introduced during stacking (Figure
A-35).
The code to unstack is simply df.unstack().