netCDF4 is the fourth version of the netCDF library that's implemented on top of HDF5 (Hierarchical Data Format, designed to store and organize large amounts of data), which makes it possible to manage extremely large and complex multidimensional data. The greatest advantage of netCDF4 is that it is a completely portable file format with no limit on the number or size of data objects in a collection, and it's appendable while being archivable as well. Many scientific research organizations use it for data storage. Python also has an interface to access and create this type of data format.
You can download and install the module from its official documentation page at http://unidata.github.io/netcdf4-python/, or clone it from its GitHub repository at https://github.com/Unidata/netcdf4-python. It's not included in the standard Python Scientific distribution, but it's built into NumPy and can build with Cython (this is recommended but not required).
For the following example, we are going to use the sample netCDF4
file on the Unidata website at
http://www.unidata.ucar.edu/software/netcdf/examples/files.html, and we will use the climate system model as an example: sresa1b_ncar_ccsm3-example.nc
First, we will use the netCDF4
module to explore the dataset a bit, and extract the values we need for further analysis:
In [1]: import netCDF4 as nc In [2]: dataset = nc.Dataset('sresa1b_ncar_ccsm3-example.nc', 'r') In [3]: variables = [var for var in dataset.variables] In [4]: variables Out[4]: ['area', 'lat', 'lat_bnds', 'lon', 'lon_bnds', 'msk_rgn', 'plev', 'pr', 'tas', 'time', 'time_bnds', 'ua']
We imported the python netCDF4
module, and we used the Dataset()
function to read the sample netCDF4
file. The r
parameter means the file is in read-only mode, so we can also specify a
when we want to append the file or w
to create a new file. Then, we obtained all the variables stored in the dataset and saved them to a list called variables (note that the variables attribute will return a Python dictionary of the object of the variables). Lastly, we printed out the variables in the dataset using this command:
In [5]: precipitation = dataset.variables['pr'] In [6]: precipitation.standard_name Out[6]: 'precipitation_flux' In [7]: precipitation.missing_value Out[7]: 1e+20 In [8]: precipitation.ndim Out[8]: 3 In [9]: precipitation.shape Out[9]: (1, 128, 256) In [10]: precipitation[:, 1, :10] Out[10]: array([[ 8.50919207e-07, 8.01471970e-07, 7.74396426e-07, 7.74230614e-07, 7.47181844e-07, 7.21426375e-07, 7.19294349e-07, 6.99790974e-07, 6.83397502e-07, 6.74683179e-07]], dtype=float32)
In the preceding example, we picked a variable named pr
and saved it to precipitation
. As we all know netCDF4
is a self-describing file format; you can create and access any user-defined attribute stored in the variable, though the most common one is standard_name
, which tells us that the variable represents the precipitation flux. We checked another commonly used attribute, missing_value
, which represents the no-data value stored in the netCDF4
file. Then, we printed the number of dimensions of the precipitation variable by its ndim
and the shape by the shape
attribute. Lastly, we want to get the value of row 1, that is, the first 10 columns in the netCDF4
file; to do this, just use the indexing as we always do.
Next, we are going to cover the basics of creating a netCDF4
file and storing a three-dimensional NumPy ndarray
as a variable:
In [11]: import numpy as np In [12]: time = np.arange(10) In [13]: lat = 54 + np.random.randn(8) In [14]: lon = np.random.randn(6) In [15]: data = np.random.randn(480).reshape(10, 8, 6)
First, we prepared a three-dimensional ndarray
(data) to store in the netCDF4
file; the data is built in three dimensions, which are time (time
, size of 10), latitude (lat
, size of 8), and longitude (lon
, size of 6). In netCDF4
, time is not a datetime
object, but the number of time units (these can be seconds, hours, days, and so on) from the defined start time (specified in the unit
attribute; we will explain this to you later). Now, we have all the data we want to store in the file, so let's build the netCDF structure:
In [16]: output = nc.Dataset('test_output.nc', 'w') In [17]: output.createDimension('time', 10) In [18]: output.createDimension('lat', 8) In [19]: output.createDimension('lon', 6) In [20]: time_var = output.createVariable('time', 'f4', ('time',)) In [21]: time_var[:] = time In [22]: lat_var = output.createVariable('lat', 'f4', ('lat',)) In [23]: lat_var[:] = lat In [24]: lon_var = output.createVariable('lon', 'f4', ('lon',)) In [25]: lon_var[:] = lon
We initialized the netCDF4
file by specifying the file path and using the w
write mode. Then, we built the structure using createDimension()
to specify the dimensions: time
, lat
, and lon
. Each dimension has a variable to represent its values, just like the scales for an axis. Next, we are going to save the three-dimensional data to the file:
In [26]: var = output.createVariable('test', 'f8', ('time', 'lat', 'lon')) In [27]: var[:] = data
The creation of a variable always starts with the createVariable()
function and specifies the variable name, variable datatype, and the dimensions associated with it. The second step is to pass the same shape of ndarray
into the declared variable. Now that we have the entire data store in the file, we can specify the attribute to help describe the dataset. The following example uses the time
variable to show how we can specify the attribute:
In [28]: time_var.standard_name = 'Time' In [29]: time_var.units = 'days since 2015-01-01 00:00:00' In [30]: time_var.calendar = 'gregorian'
So, now that the time variable has the unit and calendar associated with it, the ndarray
time will be converted to a date based on the unit and calendar that we specified; this is similar to all the variables. When the creation of netCDF4
file is done, the last step is to close the file connection:
In [31]: output.close()
The preceding code shows you the usage of Python netCDF4 API in order to read and create a netCDF4
file. This module doesn't include any scientific computations (so it's not included in any Python scientific distribution), but the target is in the interface for the file I/O, which can be the very first or last stage in your research and analytics.
3.12.76.164