Chapter 4. NumPy Core and Libs Submodules

After covering so many NumPy ufuncs in the previous chapter, I hope you still remember the very core of NumPy, which is the ndarray object. We are going to finish the last important attribute of ndarray: strides, which will give you the full picture of memory layout. Also, it's time to show you that NumPy arrays can deal not only with numbers but also with various types of data; we will talk about record arrays and date time arrays. Lastly, we will show how to read/write NumPy arrays from/to files, and start to do some real-world analysis using NumPy.

The topics that will be covered in this chapter are:

  • The core of NumPy arrays: memory layout
  • Structure arrays (record arrays)
  • Date-time in NumPy arrays
  • File I/O in NumPy arrays

Introducing strides

Strides are the indexing scheme in NumPy arrays, and indicate the number of bytes to jump to find the next element. We all know the performance improvements of NumPy come from a homogeneous multidimensional array object with fixed-size items, the numpy.ndarray object. We've talked about the shape (dimension) of the ndarray object, the data type, and the order (the C-style row-major indexing arrays and the Fortran style column-major arrays.) Now it's time to take a closer look at strides.

Let's start by creating a NumPy array and changing its shape to see the differences in the strides.

  1. Create a NumPy array and take a look at the strides:

          In [1]: import numpy as np
          In [2]: x = np.arange(8, dtype = np.int8)
          In [3]: x
          Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7])
          In [4]: x.strides
          Out[4]: (1,)
          In [5]: str(x.data)
          Out[5]: 'x00x01x02x03x04x05x06x07'

    A one-dimensional array x is created and its data type is NumPy integer 8, which means each element in the array is an 8-bit integer (1 byte each, and a total of 8 bytes). The strides represent the tuple of bytes to step in each dimension when traversing an array. In the previous example, it's one dimension, so we obtain the tuple as (1, ). Each element is 1 byte apart from its previous element. When we print out x.data, we can get the Python buffer object pointing to the start of the data, which is from x01 to x07 in the example.

  2. Change the shape and see the stride change:
          In [6]: x.shape = 2, 4 
          In [7]: x 
          Out[7]: 
          array([[0, 1, 2, 3], 
                 [4, 5, 6, 7]], dtype=int8) 
          In [8]: x.strides 
          Out[8]: (4, 1) 
          In [9]: str(x.data) 
          Out[9]: 'x00x01x02x03x04x05x06x07' 
          In [10]: x.shape = 1,4,2 
          In [11]: x.strides 
          Out[11]: (8, 2, 1) 
          In [12]: str(x.data) 
          Out[12]: 'x00x01x02x03x04x05x06x07' 
    

    Now we change the dimensions of x to 2 by 4 and check the strides again. We can see it changes to (4, 1), which means the elements in the first dimension are four bytes apart, and the array need to jump four bytes to find the next row, but the elements in the second dimension are still 1 byte apart, jumping one byte to find the next column. Let's print out x.data again, and we can see that the memory layout of the data remains the same, but only the strides change. The same behavior occurs when we change the shape to be three dimensional: 1 by 4 by 2 arrays. (What if our arrays are constructed by the Fortran style order? How will the strides change due to changing the shapes? Try to create a column-major array and do the same exercise to check this out.)

  3. So now we know what a stride is, and its relationship to an ndarray object, but how can the stride improve our NumPy experience? Let's do some stride manipulation to get a better sense of this: two arrays are with same content but have different strides:
          In [13]: x = np.ones((10000,)) 
          In [14]: y = np.ones((10000 * 100, ))[::100] 
          In [15]: x.shape, y.shape 
          Out[15]: ((10000,), (10000,)) 
          In [16]: x == y 
          Out[16]: array([ True,  True,  True, ...,  True,  True,
          True], dtype=bool) 
    
  4. We create two NumPy Arrays, x and y, and do a comparison; we can see that the two arrays are equal. They have the same shape and all the elements are one, but actually the two arrays are different in terms of memory layout. Let's simply use the flags attribute you learned about in Chapter 2The NumPy ndarray Object to check the two arrays' memory layout.
          In [17]: x.flags 
          Out[17]: C_CONTIGUOUS : True 
                   F_CONTIGUOUS : True 
                   OWNDATA : True 
                   WRITEABLE : True 
                   ALIGNED : True 
                   UPDATEIFCOPY : False 
     
          In [18]: y.flags 
          Out[18]: C_CONTIGUOUS : False 
                   F_CONTIGUOUS : False 
                   OWNDATA : False 
                   WRITEABLE : True 
                   ALIGNED : True 
                   UPDATEIFCOPY : False 
    
  5. We can see that the x array is continuous in both the C and Fortran order while y is not. Let's check the strides for the difference:
          In [19]: x.strides, y.strides 
          Out[19]: ((8,), (800,)) 
    

    Array x is created continuously, so in the same dimension each element is eight bytes apart (the default dtype of numpy.ones is a 64-bit float); however, y is created from a subset of 10000 * 100 for every 100 elements, so the index schema in the memory layout is not continuous.

  6. Even though x and y have the same shape, each element in y is 800 bytes apart from each other. When you use NumPy arrays x and y, you might not notice the difference in indexing, but the memory layout does affect the performance. Let's use the %timeit function in IPython to check this out:
          In [18]: %timeit x.sum() 
          100000 loops, best of 3: 13.8 µs per loop 
          In [19]: %timeit y.sum() 
          10000 loops, best of 3: 25.9 µs per loop 
    

Typically with a fixed cache size, when the stride size gets larger, the hit rate (the fraction of memory accessed that finds data in the cache) will be lower, comparatively, while the miss rate (the fraction of memory accessed that has to go to the memory) will be higher. The cache hit time and miss time compose the average data access time. Let's try to look at our example again from the cache perspective. Array x with smaller strides is faster than the larger strides of y. The reason for the difference in performance is that the CPU is pulling data from the main memory to its cache in blocks, and the smaller stride means fewer transfers are needed. See the following graph for details, where the red line represents the size of the CPU cache, and blue boxes represent the memory layout containing the data.

It's obvious that if x and y are both required, 100 blue boxes of data, the required cache time for x will be less.

Introducing strides
Cache and the x, y array in the memory layout

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.10.162