Array routines

In this section, we will deal with most operations on arrays. We will classify them into four main categories:

  • Routines to create new arrays
  • Routines to manipulate a single array
  • Routines to combine two or more arrays
  • Routines to extract information from arrays

The reader will surely realize that some operations of this kind can be carried out by methods, which once again shows the flexibility of Python and NumPy.

Routines to create arrays

We have previously seen the command to create an array and store it to a variable A. Let's take a look at it again:

>>> A=numpy.array([[1,2],[2,1]])

The complete syntax, however, writes as follows:

array(object,dtype=None,copy=True,order=None, subok=False,ndim=0)

Let's go over the options: object is simply the data we use to initialize the array. In the previous example, the object is a 2 x 2 square matrix; we may impose a datatype with the dtype option. The result is stored in the variable A. If copy is True, the returned object will be a copy of the array, if False, the returned object will only be a copy, if dtype is different from the datatype of object. The arrays are stored following a C-style ordering of rows and columns. If the user prefers to store the array following the memory style of FORTRAN, the order='Fortran' option should be used. The subok option is very subtle; if True, the array may be passed as a subclass of the object, if False, then only ndarray arrays are passed. And finally, the ndmin option indicates the smallest dimension returned by the array. If not offered, this is computed from object.

A set of special arrays can be obtained with commands such as zeros, ones, empty, identity, and eye. The names of these commands are quite informative:

  • zeros creates an array filled with zeros.
  • ones creates an array filled with ones.
  • empty returns an array of required shape without initializing its entries.
  • identity creates a square matrix with dimensions indicated by a single positive integer n. The entries are filled with zeros, except the diagonal, which is filled with ones.

The eye command is very similar to identity. It also constructs diagonal arrays, but unlike identity, eye allows specifying diagonals offset the traditional centered, as it can operate on rectangular arrays as well. In the following lines of code, we use zeros, ones, and identity commands:

>>> Z=numpy.zeros((5,5), dtype=int)
>>> U=numpy.ones((2,2), dtype=int)
>>> I=numpy.identity(3, dtype=int)

In the first two cases, we indicated the shape of the array (as a Python tuple of positive integers) and the optional datatype imposition.

The syntax for eye is as follows:

numpy.eye(N,M=None,k=0,dtype=float)

The integers, N and M indicate the shape of the array, and the integer k indicates the index of the diagonal to populate.

An index k=0 (the default) points to the traditional diagonal; a positive index refers to upper diagonals and negative to lower diagonals. To illustrate this point, the following example shows how to create a 4 x 4 sparse matrix with nonzero elements on the first upper and subdiagonals:

>>> D=numpy.eye(4,k=1) + numpy.eye(4,k=-1)
>>> print (D)

The output is shown as follows:

[[ 0.  1.  0.  0.]
 [ 1.  0.  1.  0.]
 [ 0.  1.  0.  1.]
 [ 0.  0.  1.  0.]]

Using the previous four commands together with basic slicing, it is possible to create even more complex arrays very simply. We propose the following challenge.

Use exclusively, the previous definitions of U and I together with an eye array. How would the reader create a 5 x 5 array A of values, type float with fives at the four entries (0, 0), (0, 1), (1, 0), and (1, 1); sixes along the remaining entries of the diagonal; and threes in the two other corners ? The solution to this question can be addressed by issuing the following set of commands:

>>> A=3.0*(numpy.eye(5,k=4) + numpy.eye(5,k=-4))
>>> A[0:2,0:2]=5*U; A[2:5,2:5]=6*I
>>> print (A)

The output is shown as follows:

[[ 5.  5.  0.  0.  3.]
 [ 5.  5.  0.  0.  0.]
 [ 0.  0.  6.  0.  0.]
 [ 0.  0.  0.  6.  0.]
 [ 3.  0.  0.  0.  6.]]

The flexibility of creating an array in NumPy is even more clear using the fromfunction command. For instance, if we require a 4 x 4 array where each entry reflects the product of its indices, we may use the lambda function (lambda i,j: i*j) in the fromfunction command, as follows:

>>> B=numpy.fromfunction( (lambda i,j: i*j), (4,4), dtype=int)
>>> print (B)

The output is shown as follows:

[[0 0 0 0]
 [0 1 2 3]
 [0 2 4 6]
 [0 3 6 9]]

A very important tool dealing with arrays is the concept of masking. Masking is based on the idea of selecting or masking those indices for which their corresponding entries satisfy a given condition. For example, in the array B shown in the previous example, we can mask all zero-valued entries with the B==0 command, as follows:

>>> print (B==0)

The output is shown as follows:

[[ True  True  True  True]
 [ True False False False]
 [ True False False False]
 [ True False False False]]

Now, how would the reader update B so that all zero's would be replaced by the sum of the squares of their corresponding indices?

Multiplying a mask by a second array of the same shape offers a new array in which each entry is either zero (if the corresponding entry in the mask is False), or the entry of the second array (if the corresponding entry in the mask is True):

>>> B += numpy.fromfunction((lambda i,j:i*i+j*j), (4,4))*(B==0)
>>> print (B)

The output is shown as follows:

[[0 1 4 9]
 [1 1 2 3]
 [4 2 4 6]
 [9 3 6 9]]

Note that we have created a new array filled with Boolean values as the size of the original array and in each step. This isn't a big deal in these toy examples, but when handling large datasets, allocating too much memory could seriously slow down our computations and exhaust the memory of our system. Among the commands to create arrays, there are two in particular putmask and where, which facilitate the management of resources internally, thus speeding up the process.

Note, for example, when we look for all odd-valued entries in B, the resulting mask has size of 16, although the interesting entries are only eight:

>>> print (B%2!=0)

The output is shown as follows:

[[False  True False  True]
 [ True  True False  True]
 [False False False False]
 [ True  True False  True]]

The numpy.where() command helps us gather those entries more efficiently. Let's take a look at the following command:

>>> numpy.where(B%2!=0)

The output is shown as follows:

(array([0, 0, 1, 1, 1, 3, 3, 3], dtype=int32),array([1, 3, 0, 1, 3, 0, 1, 3], dtype=int32))

If we desire to change those entries (all odd), to, say they are squares plus one, we can use the numpy.putmask() command instead, and better manage the memory at the same time. The following is a sample code for the numpy.putmask() command:

>>> numpy.putmask( B, B%2!=0, B**2+1)
>>> print (B)

The output is shown as follows:

[[ 0  2  4 82]
 [ 2  2  2 10]
 [ 4  2  4  6]
 [82 10  6 82]]

Note how the putmask procedure updates the values of B, without the explicit need to make a new assignment.

There are three additional commands that create arrays in the form of meshes. The arange and linspace commands create uniformly spaced values between two numbers. In arange, we specify the spacing between elements; in linspace, we specify the desired number of elements in the mesh. The logspace command creates uniformly spaced values in a logarithmic scale between the logarithms of two numbers to the base 10. The user could think of these outputs as the support of univariate functions.

The following is a sample code for the numpy.arrange() command:

>>> L1=numpy.arange(-1,1,0.3)
>>> print (L1)

The output for the preceding lines of code is shown as follows:

[-1.  -0.7 -0.4 -0.1  0.2  0.5  0.8]

The following is a sample code for the numpy.linspace() command:

>>> L2=numpy.linspace(-1,1,4)
>>> print (L2)

The output is shown as follows:

[-1.         -0.33333333  0.33333333  1.        ]

The following is an example for the numpy.logspace() command:

>>> L3= numpy.logspace(-1,1,4)
>>> print (L3)

The output for the preceding lines of code is shown as follows:

[  0.1          0.46415888   2.15443469  10.        ]

Finally, meshgrid, mgrid, and ogrid create two two-dimensional arrays of dimensions n x m, containing the elements of two given one-dimensional arrays of dimensions n and m. It accomplished this by repeating the values of each array as necessary. The user could think of these outputs as the support of functions of two variables.

The first of these routines, meshgrid, accepts only arrays as input. The other two routines, mgrid and ogrid, accept only indexing objects (for example, slices). The difference between these last two is a matter of memory allocation; while mgrid allocates full arrays with all the data, ogrid only creates enough sets so that the corresponding mgrid command could be obtained by a proper Cartesian product.

Let's take a look at the following meshgrid command:

>>> print (numpy.meshgrid(L2,L3))

The output is shown as follows:

(array([[-1.        , -0.33333333,  0.33333333,  1.        ],
       [-1.        , -0.33333333,  0.33333333,  1.        ],
       [-1.        , -0.33333333,  0.33333333,  1.        ],
       [-1.        , -0.33333333,  0.33333333,  1.        ]]), array([[  
0.1       ,   0.1       ,   0.1       ,   0.1       ],
       [  0.46415888,   0.46415888,   0.46415888,   0.46415888],
       [  2.15443469,   2.15443469,   2.15443469,   2.15443469],
       [ 10.        ,  10.        ,  10.        ,  10.        ]]))

Let's take a look at the following mgrid command:

>>> print (numpy.mgrid[0:5,0:5])

The output is shown as follows:

[[[0 0 0 0 0]
  [1 1 1 1 1]
  [2 2 2 2 2]
  [3 3 3 3 3]
  [4 4 4 4 4]]

 [[0 1 2 3 4]
  [0 1 2 3 4]
  [0 1 2 3 4]
  [0 1 2 3 4]
  [0 1 2 3 4]]]

Let's take a look at the following ogrid command:

>>> print (numpy.ogrid[0:5,0:5])

The output is shown as follows:

[array([[0],
       [1],
       [2],
       [3],
       [4]]), array([[0, 1, 2, 3, 4]])]

We would like to finish the subsection on creations of arrays by showing one of the most useful routines for image processing and differential equations—the tile command. Its syntax is very simple, and is shown as follows:

tile(A, reps)

This routine presents a very effective method of tiling an array A following some repetition pattern reps (a tuple, a list, or another array) to create larger arrays. The following checkerboards exercise shows its potential.

Start with two small binary arrays—B=numpy.ones((3,3)) and checker2by2=numpy.zeros((6,6)) and create a checkerboard using tile and as few operations as possible.

Let's perform some operations using these commands:

>>> checker2by2[0:3,0:3]=checker2by2[3:6,3:6]=B
>>> numpy.tile(checker2by2,(4,4))

The output is too long to be shown here. Please refer to the How to open IPython Notebooks section in Chapter 1, Introduction to SciPy, to run the IPython Notebook corresponding to this chapter.

Routines for the combination of two or more arrays

On occasion, we need to combine the data of two or more arrays together to solve a specific problem. The core NumPy libraries contain extremely efficient routines to carry out these computations, and we urge the reader to get familiar with them. They are constructed with state-of-the-art algorithms, and they make sure that usage of memory is minimum and the complexity optimal. Most relevant are the routines that operate on arrays as if they were matrices. These include matrix products (outer, inner, dot, vdot, tensordot, cross, and kron), array correlations (correlate and convolve), array stacking (concatenate, vstack, hstack, column_stack, row_stack, and dstack), and array comparison (allclose).

If you are well-versed in linear algebra, you will surely enjoy the matrix products included in NumPy. We will postpone their usage and analysis until we cover the SciPy module on linear algebra in Chapter 3, SciPy for Linear Algebra.

An excellent use for correlation of arrays is basic pattern-matching. For instance, the image in the following example (the text array) contains an image of a paragraph extracted from the Wikipedia page about Don Quixote, while the second array, letterE, contains an image of the letter e, which is actually a subarray obtained from the text array and represents the pattern to be matched.

First, we load the text image and performs some preprocessing on it in order to bring the image to the right format (as close as possible to the grayscale approximation) to have better performance on this naive approach of pattern matching. We do this by executing the following lines of code in a Python console:

>>> import scipy.ndimage
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> text = scipy.ndimage.imread('Chap_02_text_image.png')
>>> text = np.mean(text.astype(float)/255,-1)*2-1

Second, the pattern for the letter e is identified:

>>> letterE = text[37:53,275:291]

Next, a fraction of the maximum value of the correlation of both arrays offers the location of all the e letters contained in the array text:

>>> corr = scipy.ndimage.correlate(text,letterE)
>>> eLocations = (corr >= 0.95 * corr.max())

The positions in the image of the pattern found for x are as follows:

>>> CorrLocIndex = np.where(eLocations==True)
>>> x=CorrLocIndex[1]
>>> x 

The output is shown as follows:

array([ 283,  514,  583,  681,  722,  881,  929, 1023,   64,  188,  452, 
        504,  892,  921, 1059, 1087, 1102, 1133,  118,  547,  690, 1066, 
       1110,  330,  363,  519,  671,  913,  951, 1119,  120,  292,  441, 
        516,  557,  602,  649,  688,  717,  747,  783,  813,  988, 1016, 
        250,  309,  505,  691,  769,  876,  904, 1057,  224,  289,  470, 
        596,  626,  780, 1027,  112,  151,  203,  468,  596,  751,  817, 
        867,  203,  273,  369,  560,  599,  888, 1111,  159,  221,  260, 
        352,  427,  861,  901, 1034, 1146,  325,  506,  558]) 

The positions in the image of the found pattern for y are as follows:

>>> y=CorrLocIndex[0] 
>>> y 

The output is shown as follows:

array([ 45,  45,  45,  45,  45,  45,  45,  45,  74,  74,  74,  74,  74, 
        74,  74,  74,  74,  74, 103, 103, 103, 103, 103, 132, 132, 132, 
       132, 132, 132, 132, 161, 161, 161, 161, 161, 161, 161, 161, 161, 
       161, 161, 161, 161, 161, 190, 190, 190, 190, 190, 190, 190, 190, 
       219, 219, 219, 219, 219, 219, 219, 248, 248, 248, 248, 248, 248, 
       248, 248, 277, 277, 277, 277, 277, 277, 277, 306, 306, 306, 306, 
       306, 306, 306, 306, 306, 335, 335, 335]) 

There are 86 elements, which are in fact the total number of the occurrence of the letter e in the text image, as can be verified by counting them. Whether the matching has been done correctly can be verified graphically, superposing each pair (x,y) of the pattern on the text image, as follows:

>>> thefig=plt.figure()
>>> plt.subplot(211)
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb9b2390110>
>>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest')
<matplotlib.image.AxesImage object at 0x7fb9b1f29410>
>>> plt.axis('off')

The output for plt.axis() is shown as follows:

(-0.5, 1199.5, 359.5, -0.5) 

Now, let's move further in the code:

>>> plt.subplot(212) 
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb9b1f29890>  
>>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') 
<matplotlib.image.AxesImage object at 0x7fb9b1637e10> 
>>> plt.autoscale(False) 
>>> plt.plot(x,y,'wo',markersize=10) 
[<matplotlib.lines.Line2D object at 0x7fb9b1647290>] 
>>> plt.axis('off') 

The output for plt.axis() is shown as follows:

(-0.5, 1199.5, 359.5, -0.5) 

Finally, in the following show() command, we display a figure that superposes each pair (x,y) of the pattern on the text image:

>>> plt.show() 

This results in the following screenshot (the first image is the text and the next is the text where all occurrences of letter e have been crossed out):

Routines for the combination of two or more arrays

A few words about stacking operations; we have a basic concatenation routine, concatenate, which joins a sequence of arrays together along a pre-determined axis. Of course, all arrays in the sequence must have the same dimensions, otherwise it obviously doesn't work. The rest of the stack operations are syntactic sugar for special cases of concatenatevstack to glue arrays vertically, hstack to glue arrays horizontally, dstack to glue arrays in the third dimension, and so on.

Another impressive set of routines are set operations. They allow the user to handle one-dimensional arrays as if they were sets and perform the Boolean operations of intersection (intersect1d), union (union1d), set difference (setdiff1d), and set exclusive or (setxor1d). The results of these set operations return sorted arrays. Note that it is also possible to test whether all the elements in one array belong to a second array (in1d).

Routines for array manipulation

There is a sequence of splitting routines, designed to break up arrays into smaller arrays, in any given dimension—array_split, split (both are basic splitting along the indicated axis), hsplit (horizontal split), vsplit (vertical split), and dsplit (in the third axis). Let's illustrate these with a simple example:

>>> import numpy 
>>> B = numpy.ones((3,3)) 
>>> checker2by2 = numpy.zeros((6,6)) 
>>> checker2by2[0:3,0:3] = checker2by2[3:6,3:6] = B 
>>> print(checker2by2)

The output is shown as follows:

[[ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  1.  1.]
 [ 0.  0.  0.  1.  1.  1.]
 [ 0.  0.  0.  1.  1.  1.]]

Now, let's perform the vertical split:

>>> numpy.vsplit(checker2by2,3)

The output is shown as follows:

[array([[ 1.,  1.,  1.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  0.,  0.,  0.]]),
 array([[ 1.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  1.,  1.]]),
 array([[ 0.,  0.,  0.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  1.,  1.,  1.]])]

Applying a Python function on an array usually means applying the function to each element of the array. Note how the NumPy function sin works on an array, for example:

>>> a=numpy.array([-numpy.pi, numpy.pi])
>>> print (numpy.vstack((a, numpy.sin(a))))

The output is shown as follows:

[[ -3.14159265e+00   3.14159265e+00]
 [ -1.22464680e-16   1.22464680e-16]]

Note that the sin function was computed on each element of the array.

This works provided the function has been properly vectorized (which is the case with numpy.sin). Notice the behavior with non-vectorized Python functions. Let's define such a function for computing, for each value of x, the maximum between x and 100 without using any routine from the NumPy libraries:

# function max100
>>> def max100(x):
            return(x)

If we try to apply this function to the preceding array, the system raises an error, as follows:

>>> max100(a)

The output is an error which is shown as:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

We need to explicitly indicate to the system when we desire to apply one of our functions to arrays, as well as scalars. We do that with the vectorize routine, as follows:

>>> numpy.vectorize(max100)(a)

The output is shown as follows:

array([100, 100])

For our benefit, the NumPy libraries provide a great deal of already-vectorized mathematical functions. Some examples are round_, fix (to round the elements of an array to a desired number of decimal places), and angle (to provide the angle of the elements of an array, provided they are complex numbers) and any basic trigonometric (sin, cos, tan, sic), exponential (exp, exp2, sinh, cosh), and logarithmic functions (log, log10, log2).

We also have mathematical functions that treat the array as an output of multidimensional functions, and offer relevant computations. Some useful examples are diff (to emulate differentiation along any specified dimension, by performing discrete differences), gradient (to compute the gradient of the corresponding function), or cov (for the covariance of the array).

Sorting the whole array according to the values of the first axis is also possible with the msort and sort_complex routines.

Routines to extract information from arrays

Most of the routines to extract information are statistical in nature, which include average (which acts exactly as the mean method), median (to compute the statistical median of the array on any of its dimensions, or the array as a whole), and computation of histograms (histogram, histogram2d, and histogramdd, depending on the dimensions of the array). The other important set of routines in this category deal with the concept of bins for arrays of dimension one. This is more easily explained by means of examples. Take the array A=numpy.array([5,1,1,2,1,1,2,2,10,3,3,4,5]), the unique command finds the unique values in the array and presents them as sorted:

>>> numpy.unique(A)

The output is shown as follows:

array([ 1, 2, 3, 4, 5, 10])

For arrays such as A, in which all the entries are nonnegative integers, we can visualize the array A as a sequence of eleven bins labeled with numbers from 0 to 10 (the maximum value in the array). Each bin with label n contains the number of n's in the array:

>>> numpy.bincount(A)

The output is shown as follows:

array([0, 4, 3, 2, 1, 2, 0, 0, 0, 0, 1])

For arrays where some of the elements are not numbers (nan), NumPy has a set of routines that mimic methods to extract information, but disregard the conflicting elements—nanmax, nanmin, nanargmax, nanargmin, nansum, and so on:

>>> A=numpy.fromfunction((lambda i,j: (i+1)*(-1)**(i*j)), (4,4))
>>> print (A)

The output is shown as follows:

[[ 1.  1.  1.  1.]
 [ 2. -2.  2. -2.]
 [ 3.  3.  3.  3.]
 [ 4. -4.  4. -4.]]

Let's see the effect of log2 on array A:

>>> B=numpy.log2(A)
__main__:1: RuntimeWarning: invalid value encountered in log2
>>> print (B)

The output is shown as follows:

[[ 0.         0.         0.         0.       ]
 [ 1.               nan  1.               nan]
 [ 1.5849625  1.5849625  1.5849625  1.5849625]
 [ 2.               nan  2.               nan]]

Let's take a look at the sum and nansum commands in the following line of code:

>>> numpy.sum(B), numpy.nansum(B)

The output is shown as follows:

(nan, 12.339850002884624)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.62