Exploring NumPy

NumPy is a library built around the notion of numeric arrays—multidimensional, index-based (like a list) collection of data, which (unlike a list) guarantees the type of the stored values to stay consistent and predefined—say, a 2-dimensional array of integers or 1-dimensional array of floats. It is based on the C code and allows us to boost computation by a few orders of magnitude, compared to base Python. The gap in performance is staggering even on relatively small datasets and grows exponentially for large datasets and complex algorithms. NumPy is capable of handling a few million rows of data and is primarily bounded by the operational memory—not the CPU.

Let's illustrate this staggering difference in performance with an example. Imagine that we need to summarize three lists of values, pairwise. In pure Python, the code will be similar to this one:

>>> A, B, C = [1,2,3,4,5]*1000, [2,3,4,5,6]*1000, [10,9,8,7,6]*1000

>>> %timeit result = [sum(row) for row in zip(A,B,C)]
635 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, let's do the same, using NumPy, as follows:

First, we convert all three lists into NumPy arrays using the np.array function (it takes any iterable as input). Here is the code:

import numpy as np

Anp = np.array(A)
Bnp = np.array(B)
Cnp = np.array(C)

Now, we summarize them:

>>> %timeit result2 = Anp + Bnp + Cnp
4.67 µs ± 22.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> 4.67 / 635
0.00735

It takes less than 1% of the time it took normal Python! Now, imagine what we'll gain on more complex operations, where the number of operations grows exponentially with the number of data points!

In the preceding example, we performed vectorized addition—the + symbol, in this case, represented matrix summation. For more complex operations—for example, the if switch—we have to use multiple functions and methods built into NumPy, such as the numpy.where function for vectorized if/else operations. Even so, it is still possible to run custom Python functions on each cell, row, or column of the matrices but, most often, this code will be drastically slower than one using NumPy's native operations.

This vectorized code requires a somewhat different way of thinking, as your code will be running most often either on rows or columns of the matrices and not on single values. Therefore, to achieve good performance, it is usually not recommended to write your own, pure Python code, and, of course, loops are generally a no-go. Instead, most problems usually can be redefined using typical operations—ones already made available and efficient.

With the rise of neural networks and other computation-heavy algorithms, scientists and developers are pushing the boundaries of performance; recently, a new package was announced—CuPy—that aims to be a plugin replacement for NumPy, based on leveraging graphical boards instead of the CPU. Given that your computer has a good modern graphical card, it can achieve even more impressive performance, with little to no changes in the code over NumPy.

In this section, we've got to know NumPy, a foundational package for the Python data science ecosystem. NumPy is built around the notion of multidimensional arrays of the same data type. With this, most mathematical operations and matrix transformations can be executed in vectorized form. This vectorized way of data processing is great for any type of data operation, but NumPy can only support numeric operations. To work on a broader set of data types and have an easier, more humane interface for matrices, we'll go to pandas.

Table of Contents for Exploring NumPy

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring NumPy