Chapter 3: High-Speed Scientific Computing Using NumPy

This chapter introduces us to NumPy, a high-speed Python library for matrix calculations. Most data science/algorithmic trading libraries are built upon NumPy's functionality and conventions.

In this chapter, we are going to cover the following key topics:

  • Introduction to NumPy
  • Creating NumPy n-dimensional arrays (ndarrays)
  • Data types used with NumPy arrays
  • Indexing of ndarrays
  • Basic ndarray operations
  • File operations on ndarrays

Technical requirements

The Python code used in this chapter is available in the Chapter03/numpy.ipynb notebook in the book's code repository.

Introduction to NumPy

Multidimensional heterogeneous arrays can be represented in Python using lists. A list is a 1D array, a list of lists is a 2D array, a list of lists of lists is a 3D array, and so on. However, this solution is complex, difficult to use, and extremely slow.

One of the primary design goals of the NumPy Python library was to introduce high-performant and scalable structured arrays and vectorized computations.

Most data structures and operations in NumPy are implemented in C/C++, which guarantees their superior speed.

Creating NumPy ndarrays

An ndarray is an extremely high-performant and space-efficient data structure for multidimensional arrays.

First, we need to import the NumPy library, as follows:

import numpy as np

Next, we will start creating a 1D ndarray.

Creating 1D ndarrays

The following line of code creates a 1D ndarray:

arr1D = np.array([1.1, 2.2, 3.3, 4.4, 5.5]);

arr1D

This will give the following output:

array([1.1, 2.2, 3.3, 4.4, 5.5])

Let's inspect the type of the array with the following code:

type(arr1D)

This shows that the array is a NumPy ndarray, as can be seen here:

numpy.ndarray

We can easily create ndarrays of two dimensions or more.

Creating 2D ndarrays

To create a 2D ndarray, use the following code:

arr2D = np.array([[1, 2], [3, 4]]);

arr2D

The result has two rows and each row has two values, so it is a 2 x 2 ndarray, as illustrated in the following code snippet:

array([[1, 2],

       [3, 4]])

Creating any-dimension ndarrays

An ndarray can construct arrays with arbitrary dimensions. The following code creates an ndarray of 2 x 2 x 2 x 2 dimensions:

arr4D = np.array(range(16)).reshape((2, 2, 2, 2));

arr4D

The representation of the array is shown here:

array([[[[ 0,  1],

         [ 2,  3]],

        [[ 4,  5],

         [ 6,  7]]],

       [[[ 8,  9],

         [10, 11]],

        [[12, 13],

         [14, 15]]]])

NumPy ndarrays have a shape attribute that describes the ndarray's dimensions, as shown in the following code snippet:

arr1D.shape

The following snippet shows that arr1D is a one-dimensional array with five elements:

(5,)

We can inspect the shape attribute on arr2D with the following code:

arr2D.shape

As expected, the output describes it as being a 2 x 2 ndarray, as we can see here:

(2, 2)

In practice, there are certain matrices that are more frequently used, such as a matrix of 0s, a matrix of 1s, an identity matrix, a matrix containing a range of numbers, or a random matrix. NumPy provides support for generating these frequently used ndarrays with one command.

Creating an ndarray with np.zeros(...)

The np.zeros(...) method creates an ndarray populated with all 0s, as illustrated in the following code snippet:

np.zeros(shape=(2,5))

The output is all 0s, with dimensions being 2 x 5, as illustrated in the following code snippet:

array([[0., 0., 0., 0., 0.],

       [0., 0., 0., 0., 0.]])

Creating an ndarray with np.ones(...)

np.ones(...) is similar, but each value is assigned a value of 1 instead of 0. The method is shown in the following code snippet:

np.ones(shape=(2,2))

The result is a 2 x 2 ndarray with every value set to 1, as illustrated in the following code snippet:

array([[1., 1.],

       [1., 1.]])

Creating an ndarray with np.identity(...)

Often in matrix operations we need to create an identity matrix, which is available in the np.identity(...) method, as illustrated in the following code snippet:

np.identity(3)

This creates a 3 x 3 identity matrix with 1s on the diagonals and 0s everywhere else, as illustrated in the following code snippet:

array([[1., 0., 0.],

       [0., 1., 0.],

       [0., 0., 1.]])

Creating an ndarray with np.arange(...)

np.arange(...) is the NumPy equivalent of the Python range(...) method. This generates values with a start value, end value, and increment, except this returns NumPy ndarrays instead, as shown here:

np.arange(5)

The ndarray returned is shown here:

array([0, 1, 2, 3, 4])

By default, values start at 0 and increment by 1.

Creating an ndarray with np.random.randn(…)

np.random.randn(…) generates an ndarray of specified dimensions, with each element populated with random values drawn from a standard normal distribution (mean=0, std=1), as illustrated here:

np.random.randn(2,2)

The output is a 2 x 2 ndarray with random values, as illustrated in the following code snippet:

array([[ 0.57370365, -1.22229931],

       [-1.25539335,  1.11372387]])

Data types used with NumPy ndarrays

NumPy ndarrays are homogenous—that is, each element in an ndarray has the same data type. This is different from Python lists, which can have elements with different data types (heterogenous).

The np.array(...) method accepts an explicit dtype= parameter that lets us specify the data type that the ndarray should use. Common data types used are np.int32, np.float64, np.float128, and np.bool. Note that np.float128 is not supported on Windows.

The primary reason why you should be conscious about the various numeric types for ndarrays is the memory usage—the more precision the data type provides, the larger memory requirements it has. For certain operations, a smaller data type may be just enough.

Creating a numpy.float64 array

To create a 128-bit floating-values array, use the following code:

np.array([-1, 0, 1], dtype=np.float64)

The output is shown here:

array([-1.,  0.,  1.], dtype=float64)

Creating a numpy.bool array

We can create an ndarray by converting specified values to the target type. In the following code example, we see that even though integer data values were provided, the resulting ndarray has dtype as bool, since the data type was specified to be np.bool:

np.array([-1, 0, 1], dtype=np.bool)

The values are shown here:

array([ True, False,  True])

We observe that the integer values (-1, 0, 1) were converted to bool values (True, False, True). 0 gets converted to False, and all other values get converted to True.

ndarrays' dtype attribute

ndarrays have a dtype attribute to inspect the data type, as shown here:

arr1D.dtype

The output is a NumPy dtype object with a float64 value, as illustrated here:

dtype('float64')

Converting underlying data types of ndarray with numpy.ndarrays.astype(...)

We can easily convert the underlying data type of an ndarray to any other compatible data type with the numpy.ndarrays.astype(...) method. For example, to convert arr1D from np.float64 to np.int64, we use the following code:

arr1D.astype(np.int64).dtype

This reflects the new data type, as follows:

dtype('int64')

When numpy.ndarray.astype(...) converts to a narrower data type, it will truncate the values, as follows:

arr1D.astype(np.int64)

This converts arr1D to the following integer-valued ndarray:

array([1, 2, 3, 4, 5])

The original floating values (1.1, 2.2, …) are converted to their truncated integer values (1, 2, …).

Indexing of ndarrays

Array indexing refers to the way of accessing a particular array element or elements. In NumPy, all ndarray indices are zero-based—that is, the first item of an array has index 0. Negative indices are understood as counting from the end of the array.

Direct access to an ndarray's element

Direct access to a single ndarray's element is one of the most used forms of access.

The following code builds a 3 x 3 random-valued ndarray for our use:

arr = np.random.randn(3,3);

arr

The arr ndarray has the following elements:

array([[-0.04113926, -0.273338  , -1.05294723],

       [ 1.65004669, -0.09589629,  0.15586867],

       [ 0.39533427,  1.47193681,  0.32148741]])

We can index the first element with integer index 0, as follows:

arr[0]

This gives us the first row of the arr ndarray, as follows:

array([-0.04113926, -0.273338  , -1.05294723])

We can access the element at the second column of the first row by using the following code:

arr[0][1]

The result is shown here:

-0.2733379996693689

ndarrays also support an alternative notation to perform the same operation, as illustrated here:

arr[0, 1]

It accesses the same element as before, as can be seen here:

-0.2733379996693689

The numpy.ndarray[index_0, index_1, … index_n] notation is especially more concise and useful when accessing ndarrays with very large dimensions.

Negative indices start from the end of the ndarray, as illustrated here:

arr[-1]

This returns the last row of the ndarray, as follows:

array([0.39533427, 1.47193681, 0.32148741])

ndarray slicing

While single ndarray access is useful, for bulk processing we require access to multiple elements of the array at once (for example, if the ndarray contains all daily prices of an asset, we might want to process only all Mondays' prices).

Slicing allows access to multiple ndarray records in one command. Slicing ndarrays also works similarly to slicing of Python lists.

The basic slice syntax is i:j:k, where i is the index of the first record we want to include, j is the stopping index, and k is the step.

Accessing all ndarray elements after the first one

To access all elements after the first one, we can use the following code:

arr[1:]

This returns all the rows after the first one, as illustrated in the following code snippet:

array([[ 1.65004669, -0.09589629,  0.15586867],

       [ 0.39533427,  1.47193681,  0.32148741]])

Fetching all rows, starting from row 2 and columns 1 and 2

Similarly, to fetch all rows starting from the second one, and columns up to but not including the third one, run the following code:

arr[1:, :2]

This is a 2 x 2 ndarray as expected, as can be seen here:

array([[ 1.65004669, -0.09589629],

       [ 0.39533427,  1.47193681]])

Slicing with negative indices

More complex slicing notation that mixes positive and negative index ranges is also possible, as follows:

arr[1:2, -2:-1]

This is a less intuitive way of finding the slice of an element at the second row and at the second column, as illustrated here:

array([[-0.09589629]])

Slicing with no indices

Slicing with no indices yields the entire row/column. The following code generates a slice containing all elements on the third row:

arr[:][2]

The output is shown here:

array([0.39533427, 1.47193681, 0.32148741])

The following code generates a slice of the original arr ndarray:

arr[:][:]

The output is shown here:

array([[-0.04113926, -0.273338  , -1.05294723],

       [ 1.65004669, -0.09589629,  0.15586867],

       [ 0.39533427,  1.47193681,  0.32148741]])

Setting values of a slice to 0

Frequently, we will need to set certain values of an ndarray to a given value.

Let's generate a slice containing the second row of arr and assign it to a new variable, arr1, as follows:

arr1 = arr[1:2];

arr1

arr1 now contains the last row, as shown in the following code snippet:

array([[ 1.65004669, -0.09589629,  0.15586867]])

Now, let's set every element of arr1 to the value 0, as follows:

arr1[:] = 0;

arr1

As expected, arr1 now contains all 0s, as illustrated here:

array([[0., 0., 0.]])

Now, let's re-inspect our original arr ndarray, as follows:

arr

The output is shown here:

array([[-0.04113926, -0.273338  , -1.05294723],

       [ 0.        ,  0.        ,  0.        ],

       [ 0.39533427,  1.47193681,  0.32148741]])

We see that our operation on the arr1 slice also changed the original arr ndarray. This brings us to the most important point: ndarray slices are views into the original ndarrays, not copies.

It is important to remember this when working with ndarrays so that we do not inadvertently change something we did not mean to. This design is purely for efficiency reasons, since copying large ndarrays incurs large overheads.

To create a copy of an ndarray, we explicitly call the numpy.ndarray.copy(...) method, as follows:

arr_copy = arr.copy()

Now, let's change some values in the arr_copy ndarray, as follows:

arr_copy[1:2] = 1;

arr_copy

We can see the change in arr_copy in the following code snippet:

array([[-0.04113926, -0.273338  , -1.05294723],

       [ 1.        ,  1.        ,  1.        ],

       [ 0.39533427,  1.47193681,  0.32148741]])

Let's inspect the original arr ndarray as well, as follows:

arr

The output is shown here:

array([[-0.04113926, -0.273338  , -1.05294723],

       [ 0.        ,  0.        ,  0.        ],

       [ 0.39533427,  1.47193681,  0.32148741]])

We see that the original ndarray is unchanged since arr_copy is a copy of arr and not a reference/view to it.

Boolean indexing

NumPy provides multiple ways of indexing ndarrays. NumPy arrays can be indexed by using conditions that evaluate to True or False. Let's start by regenerating an arr ndarray, as follows:

arr = np.random.randn(3,3);

arr

This is a 3 x 3 ndarray with random values, as can be seen in the following code snippet:

array([[-0.50566069, -0.52115534,  0.0757591 ],

       [ 1.67500165, -0.99280199,  0.80878346],

       [ 0.56937775,  0.36614928, -0.02532004]])

Let's revisit the output of running the following code, which is really just calling the np.less(...) universal function (ufunc)—that is, the result of the following code is identical to calling the np.less(arr, 0)) method:

arr < 0

This generates another ndarray of True and False values, where True means the corresponding element in arr was negative and False means the corresponding element in arr was not negative, as illustrated in the following code snippet:

array([[ True,  True, False],

       [False,  True, False],

       [False, False,  True]])

We can use that array as an index to arr to find the actual negative elements, as follows:

arr[(arr < 0)]

As expected, this fetches the following negative values:

array([-0.50566069, -0.52115534, -0.99280199, -0.02532004])

We can combine multiple conditions with & (and) and | (or) operators. Python's & and | Boolean operators do not work on ndarrays since they are for scalars. An example of a & operator is shown here:

(arr > -1) & (arr < 1)

This generates an ndarray with the value True, where the elements are between -1 and 1 and False otherwise, as illustrated in the following code snippet:

array([[ True,  True,  True],

       [False,  True,  True],

       [ True,  True,  True]])

As we saw before, we can use that Boolean array to index arr and find the actual elements, as follows:

arr[((arr > -1) & (arr < 1))]

The following output is an array of elements that satisfied the condition:

array([-0.50566069, -0.52115534,  0.0757591 , -0.99280199,  0.80878346,

        0.56937775,  0.36614928, -0.02532004])

Indexing with arrays

ndarray indexing also allows us to directly pass lists of indices of interest. Let's first generate an ndarray of random values to use, as follows:

arr

The output is shown here:

array([[-0.50566069, -0.52115534,  0.0757591 ],

       [ 1.67500165, -0.99280199,  0.80878346],

       [ 0.56937775,  0.36614928, -0.02532004]])

We can select the first and third rows, using the following code:

arr[[0, 2]]

The output is a 2 x 3 ndarray containing the two rows, as illustrated here:

array([[-0.50566069, -0.52115534,  0.0757591 ],

       [ 0.56937775,  0.36614928, -0.02532004]])

We can combine row and column indexing using arrays, as follows:

arr[[0, 2], [1]]

The preceding code gives us the second column of the first and third rows, as follows:

array([-0.52115534,  0.36614928])

We can also change the order of the indices passed, and this is reflected in the output. The following code picks out the third row followed by the first row, in that order:

arr[[2, 0]]

The output reflects the two rows in the order we expected (third row first; first row second), as illustrated in the following code snippet:

array([[ 0.56937775,  0.36614928, -0.02532004],

       [-0.50566069, -0.52115534,  0.0757591 ]])

Now that we have learned how to create ndarrays and about the various ways to retrieve the values of their elements, let's discuss the most common ndarray operations.

Basic ndarray operations

In the following examples, we will use an arr2D ndarray, as illustrated here:

arr2D

This is a 2 x 2 ndarray with values from 1 to 4, as shown here:

array([[1, 2],

       [3, 4]])

Scalar multiplication with an ndarray

Scalar multiplication with an ndarray has the effect of multiplying each element of the ndarray, as illustrated here:

arr2D * 4

The output is shown here:

array([[ 4,  8],

       [12, 16]])

Linear combinations of ndarrays

The following operation is a combination of scalar and ndarray operations, as well as operations between ndarrays:

2*arr2D + 3*arr2D

The output is what we would expect, as can be seen here:

array([[ 5, 10],

       [15, 20]])

Exponentiation of ndarrays

We can raise each element of the ndarray to a certain power, as illustrated here:

arr2D ** 2

The output is shown here:

array([[ 1,  4],

       [ 9, 16]])

Addition of an ndarray with a scalar

Addition of an ndarray with a scalar works similarly, as illustrated here:

arr2D + 10

The output is shown here:

array([[11, 12],

       [13, 14]])

Transposing a matrix

Finding the transpose of a matrix, which is a common operation, is possible in NumPy with the numpy.ndarray.transpose(...) method, as illustrated in the following code snippet:

arr2D.transpose()

This transposes the ndarray and outputs it, as follows:

array([[1, 3],

       [2, 4]])

Changing the layout of an ndarray

The np.ndarray.reshape(...) method allows us to change the layout (shape) of the ndarray without changing its data to a compatible shape.

For instance, to reshape arr2D from 2 x 2 to 4 x 1, we use the following code:

arr2D.reshape((4, 1))

The new reshaped 4 x 1 ndarray is displayed here:

array([[1],

       [2],

       [3],

       [4]])

The following code example combines np.random.randn(...) and np.ndarray.reshape(...) to create a 3 x 3 ndarray of random values:

arr = np.random.randn(9).reshape((3,3));

arr

The generated 3 x 3 ndarray is shown here:

array([[ 0.24344963, -0.53183761,  1.08906941],

       [-1.71144547, -0.03195253,  0.82675183],

       [-2.24987291,  2.60439882, -0.09449784]])

Finding the minimum value in an ndarray

To find the minimum value in an ndarray, we use the following command:

np.min(arr)

The result is shown here:

-2.249872908111852

Calculating the absolute value

The np.abs(...) method, shown here, calculates the absolute value of an ndarray:

np.abs(arr)

The output ndarray is shown here:

array([[0.24344963, 0.53183761, 1.08906941],

       [1.71144547, 0.03195253, 0.82675183],

       [2.24987291, 2.60439882, 0.09449784]])

Calculating the mean of an ndarray

The np.mean(...) method, shown here, calculates the mean of all elements in the ndarray:

np.mean(arr)

The mean of the elements of arr is shown here:

0.01600703714906236

We can find the mean along the columns by specifying the axis= parameter, as follows:

np.mean(arr, axis=0)

This returns the following array, containing the mean for each column:

array([-1.23928958,  0.68020289,  0.6071078 ])

Similarly, we can find the mean along the rows by running the following code:

np.mean(arr, axis=1)

That returns the following array, containing the mean for each row:

array([ 0.26689381, -0.30554872,  0.08667602])

Finding the index of the maximum value in an ndarray

Often, we're interested in finding where in an array its largest value is. The np.argmax(...) method finds the location of the maximum value in the ndarray, as follows:

np.argmax(arr)

This returns the following value, to represent the location of the maximum value (2.60439882):

7

The np.argmax(...) method also accepts the axis= parameter to perform the operation row-wise or column-wise, as illustrated here:

np.argmax(arr, axis=1)

This finds the location of the maximum value on each row, as follows:

array([2, 2, 1], dtype=int64)

Calculating the cumulative sum of elements of an ndarray

To calculate the running total, NumPy provides the np.cumsum(...) method. The np.cumsum(...) method, illustrated here, finds the cumulative sum of elements in the ndarray:

np.cumsum(arr)

The output provides the cumulative sum after each additional element, as follows:

array([ 0.24344963, -0.28838798,  0.80068144, -0.91076403, -0.94271656,

       -0.11596474, -2.36583764,  0.23856117,  0.14406333])

Notice the difference between a cumulative sum and a sum. A cumulative sum is an array of a running total, whereas a sum is a single number.

Applying the axis= parameter to the cumsum method works similarly, as illustrated in the following code snippet:

np.cumsum(arr, axis=1)

This goes row-wise and generates the following array output:

array([[ 0.24344963, -0.28838798,  0.80068144],

       [-1.71144547, -1.743398  , -0.91664617],

       [-2.24987291,  0.35452591,  0.26002807]])

Finding NaNs in an ndarray

Missing or unknown values are often represented in NumPy using a Not a Number (NaN) value. For many numerical methods, these must be removed or replaced with an interpolation.

First, let's set the second row to np.nan, as follows:

arr[1, :] = np.nan;

arr

The new ndarray has the NaN values, as illustrated in the following code snippet:

array([[ 0.64296696, -1.35386668, -0.63063743],

       [        nan,         nan,         nan],

       [-0.19093967, -0.93260398, -1.58520989]])

The np.isnan(...) ufunc finds if values in an ndarray are NaNs, as follows:

np.isnan(arr)

The output is an ndarray with a True value where NaNs exist and a False value where NaNs do not exist, as illustrated in the following code snippet:

array([[False, False, False],

       [ True,  True,  True],

       [False, False, False]])

Finding the truth values of x1>x2 of two ndarrays

Boolean ndarrays are an efficient way of obtaining indices for values of interest. Using Boolean ndarrays is far more performant than looping over the matrix elements one by one.

Let's build another arr1 ndarray with random values, as follows:

arr1 = np.random.randn(9).reshape((3,3));

arr1

The result is a 3 x 3 ndarray, as illustrated in the following code snippet:

array([[ 0.32102068, -0.51877544, -1.28267292],

       [-1.34842617,  0.61170993, -0.5561239 ],

       [ 1.41138027, -2.4951374 ,  1.30766648]])

Similarly, let's build another arr2 ndarray, as follows:

arr2 = np.random.randn(9).reshape((3,3));

arr2

The output is shown here:

array([[ 0.33189432,  0.82416396, -0.17453351],

       [-1.59689203, -0.42352094,  0.22643589],

       [-1.80766151,  0.26201455, -0.08469759]])

The np.greater(...) function is a binary ufunc that generates a True value when the left-hand-side value in the ndarray is greater than the right-hand-side value in the ndarray. This function can be seen here:

np.greater(arr1, arr2)

The output is an ndarray of True and False values as described previously, as we can see here:

array([[False, False, False],

       [ True,  True, False],

       [ True, False,  True]])

The > infix operator, shown in the following snippet, is a shorthand of numpy.greater(...):

arr1 > arr2

The output is the same, as we can see here:

array([[False, False, False],

       [ True,  True, False],

       [ True, False,  True]])

any and all Boolean operations on ndarrays

In addition to relational operators, NumPy supports additional methods for testing conditions on matrices' values.

The following code generates an ndarray containing True for elements that satisfy the condition, and False otherwise:

arr_bool = (arr > -0.5) & (arr < 0.5);

arr_bool

The output is shown here:

array([[False, False,  True],

       [False, False, False],

       [False,  True,  True]])

The following numpy.ndarray.any(...) method returns True if any element is True and otherwise returns False:

arr_bool.any()

Here, we have at least one element that is True, so the output is True, as shown here:

True

Again, it accepts the common axis= parameter and behaves as expected, as we can see here:

arr_bool.any(axis=1)

And the operation performed row-wise yields, as follows:

array([True, False, True])

The following numpy.ndarray.all(...) method returns True when all elements are True, and False otherwise:

arr_bool.all()

This returns the following, since not all elements are True:

False

It also accepts the axis= parameter, as follows:

arr_bool.all(axis=1)

Again, each row has at least one False value, so the output is False, as shown here:

array([False, False, False])

Sorting ndarrays

Finding an element in a sorted ndarray is faster than processing all elements of the ndarray.

Let's generate a 1D random array, as follows:

arr1D = np.random.randn(10);

arr1D

The ndarray contains the following data:

array([ 1.14322028,  1.61792721, -1.01446969,  1.26988026, -0.20110113,

       -0.28283051,  0.73009565, -0.68766388,  0.27276319, -0.7135162 ])

The np.sort(...) method is pretty straightforward, as can be seen here:

np.sort(arr1D)

The output is shown here:

array([-1.01446969, -0.7135162 , -0.68766388, -0.28283051, -0.20110113,

        0.27276319,  0.73009565,  1.14322028,  1.26988026,  1.61792721])

Let's inspect the original ndarray to see if it was modified by the numpy.sort(...) operation, as follows:

arr1D

The following output shows that the original array is unchanged:

array([ 1.14322028,  1.61792721, -1.01446969,  1.26988026, -0.20110113,

       -0.28283051,  0.73009565, -0.68766388,  0.27276319, -0.7135162 ])

The following np.argsort(...) method creates an array of indices that represent the location of each element in a sorted array:

np.argsort(arr1D)

The output of this operation generates the following array:

array([2, 9, 7, 5, 4, 8, 6, 0, 3, 1])

NumPy ndarrays have the numpy.ndarray.sort(...) method as well, which sorts arrays in place. This method is illustrated in the following code snippet:

arr1D.sort()

np.argsort(arr1D)

After the call to sort(), we call numpy.argsort(...) to make sure the array was sorted, and this yields the following array that confirms that behavior:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Searching within ndarrays

Finding indices of elements where a certain condition is met is a fundamental operation on an ndarray.

First, we start with an ndarray with consecutive values, as illustrated here:

arr1 = np.array(range(1, 11));

arr1

This creates the following ndarray:

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

We create a second ndarray based on the first one, except this time the values in the second one are multiplied by 1000, as illustrated in the following code snippet:

arr2 = arr1 * 1000;

arr2

Then, we know arr2 contains the following data:

array([ 1000,  2000,  3000,  4000,  5000,  6000,  7000,  8000,  9000,

       10000])

We define another ndarray that contains 10 True and False values randomly, as follows:

cond = np.random.randn(10) > 0;

cond

The values in the cond ndarray are shown here:

array([False, False,  True, False, False,  True,  True,  True, False, True])

The np.where(...) method allows us to select values from one ndarray or another, depending on the condition being True or False. The following code will generate an ndarray with a value picked from arr1 when the corresponding element in the cond array is True; otherwise, the value is picked from arr2:

np.where(cond, arr1, arr2)

The returned array is shown here:

array([1000, 2000,    3, 4000, 5000,    6,    7,    8, 9000,   10])

File operations on ndarrays

Most NumPy arrays are read in from files and, after processing, written out back to files.

File operations with text files

The key advantages of text files are that they are human-readable and compatible with any custom software.

Let's start with the following random array:

arr

This array contains the following data:

array([[-0.50566069, -0.52115534,  0.0757591 ],

       [ 1.67500165, -0.99280199,  0.80878346],

       [ 0.56937775,  0.36614928, -0.02532004]])

The numpy.savetxt(...) method saves the ndarray to disk in text format.

The following example uses a fmt='%0.2lf' format string and specifies a comma delimiter:

np.savetxt('arr.csv', arr, fmt='%0.2lf', delimiter=',')

Let's inspect the arr.csv file written out to disk in the current directory, as follows:

!cat arr.csv

The comma-separated values (CSV) file contains the following data:

-0.51,-0.52,0.08

1.68,-0.99,0.81

0.57,0.37,-0.03

The numpy.loadtxt(...) method loads an ndarray from text file to memory. Here, we explicitly specify the delimiter=',' parameter, as follows:

arr_new = np.loadtxt('arr.csv', delimiter=',');

arr_new

And the ndarray read in from the text file contains the following data:

array([[-0.51, -0.52,  0.08],

       [ 1.68, -0.99,  0.81],

       [ 0.57,  0.37, -0.03]])

File operations with binary files

Binary files are far more efficient for computer processing—they save and load more quickly and are smaller than text files. However, their format may not be supported by other software.

The numpy.save(...) method stores ndarrays in a binary format, as illustrated in the following code snippet:

np.save('arr', arr)

!cat arr.npy

The output of the arr.npy file is shown here:

The numpy.save(...) method automatically assigns the .npy extension to binary files it creates.

The numpy.load(...) method, shown in the following code snippet, is used for reading binary files:

arr_new = np.load('arr.npy');

arr_new

The newly read-in ndarray is shown here:

array([[-0.50566069, -0.52115534,  0.0757591 ],

       [ 1.67500165, -0.99280199,  0.80878346],

       [ 0.56937775,  0.36614928, -0.02532004]])

Another advantage of having binary file formats is that data can be stored with extreme precision, especially when dealing with floating values, which is not always possible with text files since there is some loss of precision in certain cases.

Let's check if the old arr ndarray and the newly read-in arr_new array match exactly, by running the following code:

arr == arr_new

This will generate the following array, containing True if the elements are equal and False otherwise:

array([[ True,  True,  True],

       [ True,  True,  True],

       [ True,  True,  True]])

So, we see that each element matches exactly.

Summary

In this chapter, we have learned how to create matrices of any dimension in Python, how to access the matrices' elements, how to calculate basic linear algebra operations on matrices, and how to save and load matrices.

Working with NumPy matrices is a principal operation for any data analysis since vector operations are machine-optimized and thus are much faster than operations on Python lists—usually between 5 and 100 times faster. Backtesting any algorithmic strategy typically consists of processing enormous matrices, and then the speed difference can translate to hours or days of saved time.

In the next chapter, we introduce the second most important library for data analysis: Pandas, built upon NumPy. NumPy provides support for data manipulations based upon DataFrames (a DataFrame is the Python version of an Excel worksheet—that is, a two-dimensional data structure where each column has its own type).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.121.160