Structured arrays

Structured arrays or record arrays are useful when you perform computations, and at the same time you could keep closely related data together. For example, when you process incident data and each incident contains geographic coordinates and the occurrence time, while you calculate the final result, you can easily find the associated geographic locations and timepoint for further visualization. NumPy also provides powerful capabilities to create arrays of records, as multiple data types live in one NumPy array. However, one principle in NumPy that still needs to be honored is that the data type in each field (you can think of this as a column in the records) needs to be homogeneous. Here are some simple examples that show you how it works:

In [20]: x = np.empty((2,), dtype = ('i4,f4,a10')) 
In [21]: x[:] = [(1,0.5, 'NumPy'), (10,-0.5, 'Essential')] 
In [22]: x 
Out[22]: 
array([(1, 0.5, 'NumPy'), (10, -0.5, 'Essential')], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', 'S10')]) 

In the previous example, we created a one-dimensional record array using numpy.empty() and specified the data types for the elements-the first element is i4 (a 32-bit integer, where i stands for a signed integer, and 4 means 4 bytes, like np.int32), the second element is a 32-bit float (f stands for float and also 4 bytes), and the third element is a string of length less than or equal to 10. We assign the values to the defined array following the data type orders we specified.

You can see the print-out of x, which now contains three different types of records, and we also get a default field name in dtype:f0f1, and f2. Of course, you may specify your field names, as we'll show you in the following examples.

One thing to note here is that we used the print-out data type-there is a < in front of i4 and f4, and < stands for byteorder big-endian (indicating the memory address increase order):

In [23]: x[0] 
Out[23]: (1, 0.5, 'NumPy') 
In [24]: x['f2'] 
Out[24]: 
array(['NumPy', 'Essential'], dtype='|S10') 

The way we retrieve data remains the same, we use the index to obtain the record, but moreover, we can use the field name to obtain the value of certain fields, so in the previous example, we used f2 to obtain the string field. In the following example, we are going to create a view of x, named y, and see how it interacts with the original record array:

In [25]: y = x['f0'] 
In [26]: y 
Out[26]: array([ 1, 10]) 
In [27]: y[:] = y * 10 
In [28]: y 
Out[28]: array([ 10, 100]) 
In [29]: y[:] = y + 0.5 
In [30]: y 
Out[30]: array([ 10, 100]) 
In [31]: x 
Out[31]: 
array([(10, 0.5, 'NumPy'), (100, -0.5, 'Essential')], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', 'S10')]) 

Here, y is the view of field f0 in x. In the record arrays, the characteristics of NumPy arrays still remain. When you multiply the scalar 10, it still applies to whole array of y (the broadcasting rule), and it always honors the data type. You can see after the multiplication, we add 0.5 to y, but since the data type of field f0 is a 32-bit integer, the result is still [10, 100]. Also, y is a view of f0 in x, so they share the same memory block. When we print out x after the calculation in y, we can find that the values in x have also changed.

Before we go further into the record arrays, let's sort out how to define a record array. The easiest way is as shown in the previous example, where we initialize a NumPy array and use the string argument to specify the data type of fields.

There are many forms of string argument NumPy can accept (see http://docs.scipy.org/doc/numpy/user/basics.rec.html for details); the most preferred can be chosen from one of these:

Data types

Representation

b1

Bytes

i1,i2,i4,i8

Signed integers with 1, 2, 4, and 8 bytes corresponding to them

u1,u2,u4,u8

Unsigned integers with 1, 2, 4, and 8 bytes

f2,f4,f8

Floats with 2, 4, and 8 bytes

c8,c16

Complex with 8 and 16 bytes

a<n>

Fixed length strings of length n

You may also prefix the string arguments with a repeated number or a shape to define the dimension of the field, but it's still considered as just one field in the record arrays. Let's try using the shape as prefix to the string arguments in the following example:

In [32]: z = np.ones((2,), dtype = ('3i4, (2,3)f4')) 
In [32]: z 
Out[32]: 
array([([1, 1, 1], [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]), 
       ([1, 1, 1], [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]])], 
      dtype=[('f0', '<i4', (3,)), ('f1', '<f4', (2, 3))]) 

In the previous example, field f0 is a one-dimensional array with size 3 and f1 is a two-dimensional array with shape (2, 3). Now, we are clear about the structure of a record array and how to define it. You might be wondering whether the default field name can be changed to something meaningful in your analysis? Of course it can! This is how:

In [33]: x.dtype.names 
Out[33]: ('f0', 'f1', 'f2') 
In [34]: x.dtype.names = ('id', 'value', 'note') 
In [35]: x 
Out[35]: 
array([(10, 0.5, 'NumPy'), (100, -0.5, 'Essential')], 
      dtype=[('id', '<i4'), ('value', '<f4'), ('note', 'S10')]) 

By assigning the new field names back to the names attribute in the dtype object, we can have our customized field names. Or you can do this when you initialize the record arrays by using a list with a tuple, or a dictionary. In the following examples, we are going to create two identical record arrays with customized field names using a list, and a dictionary:

In [36]: list_ex = np.zeros((2,), dtype = [('id', 'i4'), ('value', 'f4', (2,))]) 
In [37]: list_ex 
Out[37]: 
array([(0, [0.0, 0.0]), (0, [0.0, 0.0])], 
      dtype=[('id', '<i4'), ('value', '<f4', (2,))]) 
In [38]: dict_ex = np.zeros((2,), dtype = {'names':['id', 'value'], 'formats':['i4', '2f4']}) 
In [39]: dict_ex 
Out[39]: 
array([(0, [0.0, 0.0]), (0, [0.0, 0.0])], 
      dtype=[('id', '<i4'), ('value', '<f4', (2,))]) 

In the list example, we make a tuple of (field name, data type, and shape) for each field. The shape argument is optional; you may also specify the shape with the data type argument. While using a dictionary to define the field, there are two required keys (names and formats) and each key has an equally sized list of values.

Before we go on to the next section, we are going to show you how to access multiple fields in your record array all at once. The following example still uses the array x that we created at beginning with a customized field: idvalue, and note:

In [40]: x[['id', 'note']] 
Out[40]: 
array([(10, 'NumPy'), (100, 'Essential')], 
      dtype=[('id', '<i4'), ('note', 'S10')]) 

You may find this example too simple; if so, you can try to create a new record array from a real-life example containing the country name, population, and rank using the data from Wikipedia: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population . This will be more fun!

Dates and time in NumPy

Dates and times are important when you are doing time series analytics, from something as simple as accumulating daily visitors in a museum to something as complicated as trending regression for a crime forecast. Starting from NumPy 1.7, the NumPy core supports date time types (though it's still experimental, and might be subject to change). In order to differentiate from the datetime object in Python, the data type is called datetime64.

This section will cover numpy.datetime64 creation, time delta arithmetic, and the conversion between units and the Python datetime. Let's create a numpy.datetime64 object by using an ISO string:

In [41]: x = np.datetime64('2015-04-01') 
In [42]: y = np.datetime64('2015-04') 
In [43]: x.dtype, y.dtype 
Out[43]: (dtype('<M8[D]'), dtype('<M8[M]')) 

x and y are both numpy.datetime64 objects and are constructed from an ISO 8601 string (the universal date format-for details see https://en.wikipedia.org/wiki/ISO_8601). But the input string for x contains a days unit while the string for y does not. While creating the NumPy datetime64, it will automatically select from the form of the input string, so when we print out the dtype for both x and y, we can see that x with unit D stands for days while y with unit M stands for months. The< is also the byteorder, here it is the big-endian, and M8 is the short notation of datetime64 (implemented from np.int64). The default date units supported by numpy.datetime64 are years (Y), months (M), weeks (W), and days (D), while the time units are hours (h), minutes (m), seconds (s), and milliseconds (ms).

Of course we can specify the units when we create the array and also use the numpy.arange() method to create the sequence of the array. See the following examples:

In [44]: y = np.datetime64('2015-04', 'D') 
In [45]: y, y.dtype 
Out[45]: (numpy.datetime64('2015-04-01'), dtype('<M8[D]')) 
In [46]: x = np.arange('2015-01', '2015-04', dtype = 'datetime64[M]') 
In [47]: x 
Out[47]: array(['2015-01', '2015-02', '2015-03'], dtype='datetime64[M]') 

However, it's not allowed to specify a time unit when the ISO string only contains date units. A TypeError will be triggered, since conversion between date units and time units requires a choice of time zone and the particular time of day on a given date:

In [48]: y = np.datetime64('2015-04-01', 's') 
TypeError: Cannot parse "2015-04-01" as unit 's' using casting rule 'same_kind' 

Next, we are going to do a subtraction of two numpy.datetime64 arrays, and you will see that the broadcasting rules are still valid as long as the date/time units between two arrays are convertible. We use the same array x created earlier and create a new y for the following example:

In [49]: x 
Out[49]: array(['2015-01', '2015-02', '2015-03'], dtype='datetime64[M]') 
In [50]: y = np.datetime64('2015-01-01') 
In [51]: x - y 
Out[51]: array([ 0, 31, 59], dtype='timedelta64[D]') 

Interestingly enough, the result array of x subtracting y is [0, 31, 59], not the date anymore, and the dtype has changed to timedelta64[D]. Because NumPy doesn't have a physical quantities system in its core, the timedelta64 data type was created to complement datetime64. In the previous example,[0, 31, 59] is the units from 2015-01-01 in each element in x, and the units are days (D). You may also do the arithmetic between datetime64 and timedelta64, as shown in the following examples:

In [52]: np.datetime64('2015') + np.timedelta64(12, 'M') 
Out[52]: numpy.datetime64('2016-01') 
In [53]: np.timedelta64(1, 'W') / np.timedelta64(1, 'D') 
Out[53]: 7.0 

In the last part of this section, we are going to talk about the conversion between numpy.datetime64 and Python the datetime. Although the datetime64 object inherits many traits from a NumPy array, there are still some benefits from using the Python datetime object (such as the date and year attributes, isoformat, and more) or vice versa. For example, you may have a list of datetime objects, and you may want to convert them to numpy.datetime64 for the arithmetic or other NumPy ufuncs. In the following example, we are going to convert the existing datetime64 array x to a list of Python datetime in two ways:

In [54]: x 
Out[54]: array(['2015-01', '2015-02', '2015-03'], dtype='datetime64[M]') 
In [55]: x.tolist() 
Out[55]: 
[datetime.date(2015, 1, 1), 
 datetime.date(2015, 2, 1), 
 datetime.date(2015, 3, 1)] 
In [56]: [element.item() for element in x] 
Out[56]: 
[datetime.date(2015, 1, 1), 
 datetime.date(2015, 2, 1), 
 datetime.date(2015, 3, 1)] 

We can see that numpy.datetime64.tolist() and numpy.datetime64.item() with the for loop can achieve the same goal, that is, converting the array to a list of Python datetime objects. But needless to say, we all know which is the more preferred method to do the conversion (if you don't know the answer, have a quick look at Chapter 3Using NumPy Arrays.) On the other hand, if you already have a list of Python datetime and want to convert it to NumPy datetime64 arrays, simply use the numpy.array() function.

File I/O and NumPy

Now we have the ability to perform NumPy array computation and manipulation and we know how to construct a record array, it's time for us to do some real-world analysis by reading files into a NumPy array and outputing the result array to a file for further analysis.

We should talk about reading the file first and then exporting the file. But now, we are going to reverse the process, and create a record array first and then output the array to a CSV file. We read the exported CSV file into the NumPy record arrays and compared it with our original record array. The sample array we're going to create will contain an id field with consecutive integers, a value field containing random floats, and a date field with numpy.datetime64['D']. This exercise will use all the knowledge you gained from the previous sections and chapters. Let's start creating the record array:

In [57]: id = np.arange(1000) 
In [58]: value = np.random.random(1000) 
In [59]: day = np.random.random_integers(0, 365, 1000) * np.timedelta64(1,'D') 
In [60]: date = np.datetime64('2014-01-01') + day 
In [61]: rec_array = np.core.records.fromarrays([id, value, date], names='id, value, date', formats='i4, f4, a10') 
In [62]: rec_array[:5] 
Out[62]: 
rec.array([(0, 0.07019801437854767, '2014-07-10'), 
       (1, 0.4863224923610687, '2014-12-03'), 
       (2, 0.9525277614593506, '2014-03-11'), 
       (3, 0.39706873893737793, '2014-01-02'), 
       (4, 0.8536589741706848, '2014-09-14')], 
      dtype=[('id', '<i4'), ('value', '<f4'), ('date', 'S10')]) 

We first create three NumPy arrays representing the fields we need: idvalue, and date. When creating the date field, we combine the numpy.datetime64 with a random NumPy array with size 1000 to simulate random dates in the range from 2014-01-01 to 2014-12-31 (365 days).

Then we use the numpy.core.records.fromarrays() function to merge the three arrays into one record array and assign the names (field name) and the formats (data type). One thing to notice here is that the record array doesn't support the numpy.datetime64 object, so we stored it in the array as a date/time string with a length of 10.

If you are using Python 3, you will find a prefix b added to the front of the date/time string in the record array such as b'2014-09-25'b here stands for "bytes literals" meaning it only contains ASCII characters (all string types in Python 3 are Unicode ,which is one major change between Python 2 and 3). Therefore in Python 3, converting an object (datetime64) to a string will add the prefix to differentiate between the normal string type. However, it doesn't affect what we are going to do next-exporting the record array to a CSV file:

In [63]: np.savetxt('./record.csv', rec_array, fmt='%i,%.4f,%s') 

We use the numpy.savetxt() function to handle the exporting, and we specify the first argument as the exported file location, the array name, and the format using the fmt argument. We have three fields with three different data types and we want to add , in between each field in the CSV file. If you prefer any other delimiters, replace the comma in the fmt argument. We also get rid of redundant digits in the value field, so we specify only four digits after the decimal points to the file by using %.4f. Now you may go to the file location we specified in the first argument to check the CSV file. Open it in a spreadsheet software program and you can see the following:

File I/O and NumPy

Next, we are going to read the CSV file to a record array and use the value field to generate a mask field, named mask, which represents a value larger than or equal to 0.75. Then we will append the new mask field to the record array. Let's read the CSV file first:

In [64]: read_array = np.genfromtxt('./record.csv', dtype='i4,f4,a10', delimiter=',', skip_header=0) 
In [65]: read_array[:5] 
Out[65]: 
array([(0, 0.07020000368356705, '2014-07-10'), 
       (1, 0.486299991607666, '2014-12-03'), 
       (2, 0.9524999856948853, '2014-03-11'), 
       (3, 0.3971000015735626, '2014-01-02'), 
       (4, 0.8536999821662903, '2014-09-14')], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', 'S10')]) 

We use numpy.genfromtxt() to read the file into NumPy record array. The first argument is still the file location we want to access, and the dtype argument is optional. If we didn't specify this, NumPy will determine the dtype argument using the contents of each column individually. Since we clearly know about the data, it's recommended to specify every time when you read the file.

The delimiter argument is also optional, and by default, any consecutive whitespaces act as delimiters. However, we used ","for the CSV file. The last optional argument we use in the method is the skip_header. Although we didn't have the field name on top of the records in the file, NumPy provides the functionality to skip a number of lines at the beginning of the file.

Other than skip_header, the numpy.genfromtext() function supports 22 more operation parameters to fine-tune the array, such as defining missing and filling values. For more details, please refer to  http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html .

Now the data is read in to the record array, you will find that the second field is more than four digits after the decimal points as we specified in exporting the CSV. The reason for this is because we use f4 as its data type when read in. The empty digits will be filled by NumPy, but the valid four digits remain the same as in the file. You may also notice we lost the field name, so let's specify it:

In [66]: read_array.dtype.names = ('id', 'value', 'date') 

The last part of this exercise is to create a mask variable with values based on the value field larger than or equal to 0.75. We append the new mask array to the read_array as a new column:

In [68]: mask = read_array['value'] >= 0.75 
In [69]: from numpy.lib.recfunctions import append_fields 
In [70]: read_array = append_fields(read_array, 'mask', data=mask, dtypes='i1') 
In [71]: read_array[:5] 
Out[71]: 
masked_array(data = [(0, 0.07020000368356705, '2014-07-10', 0) 
 (1, 0.486299991607666, '2014-12-03', 0)
 
 (2, 0.9524999856948853, '2014-03-11', 1) 
 (3, 0.3971000015735626, '2014-01-02', 0) 
dtype = [('id', '<i4'), ('value', '<f4'), ('date', 'S10'), ('mask','i1')]) 

numpy.lib.recfunctions can only be accessed when you import it directly, and the append_field() function is in the module. Appending a record array is as simple as appending a NumPy array: the first argument is the base array; the second argument is the new field name mask, and the data associated with it; and the last argument is the data type. Because a mask is a Boolean array, NumPy will apply the mask automatically to the record array, but we can still see a new field added to the read_array and the value of the mask reflects the value threshold (>= 0.75) in the value field. This is just the beginning of showing you how to hook up NumPy array with your data file. Now it's time to do some real analysis with your data!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.148.177