In the previous chapter, we generalized our import function by adding the capability to import more types of data supported by OGR.
Now, we will improve it again, make it handle some errors, make it compatible with our objects, and add two new capabilities. We will also convert the data in order to produce uniform objects.
To achieve our goals, we will analyze what kind of information is stored in the files that we want to open. We will use OGR to inspect the files and return some information that may help us with the data conversion.
First, let's alter our open_vector_file
function, allowing it to handle incorrect paths and filenames, which is a very a common error. Perform the following steps:
utils
folder and open the geo_functions.py
file.import
statements at the beginning of the file:# coding=utf-8 import ogr import osr import gdal import os from pprint import pprint
open_vector_file
function via the following code:def open_vector_file(file_path): """Opens an vector file compatible with OGR, get the first layer and returns the ogr datasource. :param str file_path: The full path to the file. :return: The ogr datasource. """ datasource = ogr.Open(file_path) # Check if the file was opened. if not datasource: if not os.path.isfile(file_path): message = "Wrong path." else: message = "File format is invalid." raise IOError( 'Error opening the file {} {}'.format( file_path, message)) layer = datasource.GetLayerByIndex(0) print("Opening {}".format(file_path)) print("Number of features: {}".format( layer.GetFeatureCount())) return datasource
In this step, we added a verification to check whether the file was correctly opened. If the file doesn't exist or if there are any other problems, OGR will be silent and the datasource will be empty. So, if the datasource is empty (None), we will know that something went wrong and perform another verification to see whether there was a mistake with the file path or something else happened. In either case, the program will raise an exception, preventing it from continuing with bad data.
open_vector_file
function, add the get_datasource_information
function with the following code:def get_datasource_information(datasource, print_results=False): """Get informations about the first layer in the datasource. :param datasource: An OGR datasource. :param bool print_results: True to print the results on the screen. """ info = {} layer = datasource.GetLayerByIndex(0) bbox = layer.GetExtent() info['bbox'] = dict(xmin=bbox[0], xmax=bbox[1], ymin=bbox[2], ymax=bbox[3]) srs = layer.GetSpatialRef() if srs: info['epsg'] = srs.GetAttrValue('authority', 1) else: info['epsg'] = 'not available' info['type'] = ogr.GeometryTypeToName(layer.GetGeomType()) # Get the attributes names. info['attributes'] = [] layer_definition = layer.GetLayerDefn() for index in range(layer_definition.GetFieldCount()): info['attributes'].append( layer_definition.GetFieldDefn(index).GetName()) # Print the results. if print_results: pprint(info) return info
Here, we will use a number of OGR's methods and functions to get information from the datasource and layer on it. This information is put in a dictionary, which is returned by the function. If we have print_results = True
, the dictionary is printed with the pprint
function (pretty print). This function tries to print Python objects in a more human-friendly way.
if __name__ == '__main__':
block at the end of the file, as follows:if __name__ == "__main__": gdal.PushErrorHandler('CPLQuietErrorHandler') datasource = open_vector_file("../../data/geocaching.gpx") info = get_datasource_information( datasource, print_results=True)
There is a new element here: gdal.PushErrorHandler('CPLQuietErrorHandler')
. Geocaching files normally contain features with empty date fields. When OGR finds this situation, it prints a warning message. This could get pretty annoying when we have a lot of features. This command tells OGR/GDAL to suppress these messages so that we can have a clean output with only what we want to see.
Opening ../../data/geocaching.gpx Number of features: 130 {'attributes': ['ele', 'time', 'magvar', 'geoidheight', 'name', 'cmt', 'desc', 'src', 'url', 'urlname', 'sym', 'type', 'fix', 'sat', 'hdop', 'vdop', 'pdop', 'ageofdgpsdata', 'dgpsid'], 'bbox': {'xmax': -73.44602, 'xmin': -79.3536, 'ymax': 44.7475, 'ymin': 40.70558}, 'epsg': '4326', 'type': 'Point'} Process finished with exit code 0
The attributes
key of the dictionary contains the field names that could be read from the data. Every feature on our GPX file (that is, every point) contains this set of attributes. The bbox
code is the bounding box of the data, which are the coordinates of the upper-left and lower-right corners of the rectangle that comprises the geographical extent of the data. The epsg
code contains the code for the coordinate system of the data. Finally, type
is the type of geometry identified by OGR.
Take a look at the attributes (field names) found by OGR in the previous example; we have a name, a description, and the time. We have some technical data about the GPS solution (pdop, hdop, sat, fix, and many more) and some other fields, but none of them contains in-depth information about the geocache.
In order to take a look at what information the GPX file contains that OGR is not displaying, let's open it in PyCharm:
geopy
project, go to the data
folder.geocaching.gpx
. To open it, either drag and drop it in the editor area or double-click on the filename.PyCharm will open it for editing but won't recognize the file format and will display it in a single color; so, let's inform it that this is an XML file.
geocaching.gpx
file. In the menu, select Associate with File Type, and a window with a list will pop up. Select XML Files and then click on the OK button.Now, the contents of the GPX file should appear with colors differentiating the various elements of the extended markup language. PyCharm is also capable of recognizing the file structure, as it does with Python. Let's take a look via the following steps:
Since we can't access these attributes directly with OGR, we will program an alternative. The objective is to read this information and flatten it in a single level of key/value pairs in a dictionary. GPX files are XML files, so we can use an XML parser to read them. The choice here is the xmltodict
package; it will simply convert the XML file into a Python dictionary, making it easier to manipulate as we are very familiar with dictionaries. Now, perform the following steps:
xmltodict
at the beginning of the geo_functions.py
file by executing the following code:# coding=utf-8 import xmltodict import ogr import osr import gdal import os from pprint import pprint
open_vector_file
and add the following code:def read_gpx_file(file_path): """Reads a GPX file containing geocaching points. :param str file_path: The full path to the file. """ with open(file_path) as gpx_file: gpx_dict = xmltodict.parse(gpx_file.read()) print("Waypoint:") print(gpx_dict['gpx']['wpt'][0].keys()) print("Geocache:") print(gpx_dict['gpx']['wpt'][0]['geocache'].keys())
if __name__ == '__main__':
block to test the code:if __name__ == "__main__": gdal.PushErrorHandler('CPLQuietErrorHandler') read_gpx_file("../../data/geocaching.gpx")
Waypoint: [u'@lat', u'@lon', u'time', u'name', u'desc', u'src', u'url', u'urlname', u'sym', u'type', u'geocache'] Geocache: [u'@status', u'@xmlns', u'name', u'owner', u'locale', u'state', u'country', u'type', u'container', u'difficulty', u'terrain', u'summary', u'description', u'hints', u'licence', u'logs', u'geokrety'] Process finished with exit code 0
With the print(gpx_dict['gpx']['wpt'][0].keys())
statement, we obtained the value of gpx
and then that of wpt
, which is a list. Then, we got the keys of the first element on this list and printed it.
Next, through print(gpx_dict['gpx']['wpt'][0]['geocache'].keys())
, we got the value of geocache
and also printed its keys.
Look at the output and note that it's the same thing that we did when we were exploring the GPX file structure in PyCharm. The structure is now available as a dictionary, including the tag's properties, which are represented in the dictionary with an @
symbol.
Now that we have a nice and easy way to handle the dictionary of the GPX file, let's extract and flatten the relevant information and make the function return it. Edit the read_gpx_file
function, as follows:
def read_gpx_file(file_path): """Reads a GPX file containing geocaching points. :param str file_path: The full path to the file. """ with open(file_path) as gpx_file: gpx_dict = xmltodict.parse(gpx_file.read()) output = [] for wpt in gpx_dict['gpx']['wpt']: geometry = [wpt.pop('@lat'), wpt.pop('@lon')] # If geocache is not on the dict, skip this wpt. try: geocache = wpt.pop('geocache') except KeyError: continue attributes = {'status': geocache.pop('@status')} # Merge the dictionaries. attributes.update(wpt) attributes.update(geocache) # Construct a GeoJSON feature and append to the list. feature = { "type": "Feature", "geometry": { "type": "Point", "coordinates": geometry}, "properties": attributes} output.append(feature) return output
Note that here, we used the dictionary's pop
method; this method returns the value of a given key and removes the key from the dictionary. The objective is to have two dictionaries only with attributes (properties) that can be merged into a single dictionary of attributes; the merging is done with the update
method.
Some waypoints doesn't have the geocache key, when this happens, we catch the exception and skip this point.
Finally, the information is combined in a dictionary with a GeoJSON-like structure. You can do this as follows:
if __name__ == '__main__':
block using the following code:if __name__ == "__main__": gdal.PushErrorHandler('CPLQuietErrorHandler') points = read_gpx_file("../../data/geocaching.gpx") print points[0]['properties'].keys()
['status', u'logs', u'locale', u'terrain', u'sym', u'geokrety', u'difficulty', u'licence', u'owner', u'urlname', u'desc', u'@xmlns', u'src', u'container', u'name', u'url', u'country', u'description', u'summary', u'state', u'time', u'hints', u'type'] Process finished with exit code 0
That's great! Now, all the geocache attributes are contained inside the properties of the feature.
We have a read_gpx_file
function that returns a list of features in a dictionary and an open_vector_file
function that returns an OGR datasource. We also have a get_datasource_information
function that returns the information that we need about the file.
Now, it's time to combine these functions in order to be able to read multiple types of data (GPX, Shapefiles, and many more). To do this, we will change the open_vector_file
function so that it can make decisions depending on the file format and convert the data in order to always return the same structure. Perform the following steps:
geo_function.py
are in the correct order; if not, rearrange them to be in this order:def read_gpx_file(file_path): def get_datasource_information(datasource, print_results=False): def open_vector_file(file_path): def create_transform(src_epsg, dst_epsg): def transform_geometries(datasource, src_epsg, dst_epsg): def transform_points(points, src_epsg=4326, dst_epsg=3395):
open_vector_file
, as follows:def read_ogr_features(layer): """Convert OGR features from a layer into dictionaries. :param layer: OGR layer. """ features = [] layer_defn = layer.GetLayerDefn() layer.ResetReading() type = ogr.GeometryTypeToName(layer.GetGeomType()) for item in layer: attributes = {} for index in range(layer_defn.GetFieldCount()): field_defn = layer_defn.GetFieldDefn(index) key = field_defn.GetName() value = item.GetFieldAsString(index) attributes[key] = value feature = { "type": "Feature", "geometry": { "type": type, "coordinates": item.GetGeometryRef().ExportToWkt()}, "properties": attributes} features.append(feature) return features
open_vector_file
function via the following code:def open_vector_file(file_path): """Opens an vector file compatible with OGR or a GPX file. Returns a list of features and informations about the file. :param str file_path: The full path to the file. """ datasource = ogr.Open(file_path) # Check if the file was opened. if not datasource: if not os.path.isfile(file_path): message = "Wrong path." else: message = "File format is invalid." raise IOError('Error opening the file {} {}'.format( file_path, message)) metadata = get_datasource_information(datasource) file_name, file_extension = os.path.splitext(file_path) # Check if it's a GPX and read it if so. if file_extension in ['.gpx', '.GPX']: features = read_gpx_file(file_path) # If not, use OGR to get the features. else: features = read_ogr_features( datasource.GetLayerByIndex(0)) return features, metadata
if __name__ == '__main__':
block, as follows:if __name__ == "__main__": gdal.PushErrorHandler('CPLQuietErrorHandler') points, metadata = open_vector_file( "../../data/geocaching.shp") print points[0]['properties'].keys() points, metadata = open_vector_file( "../../data/geocaching.gpx") print points[0]['properties'].keys()
['src', 'dgpsid', 'vdop', 'sat', 'name', 'hdop', 'url', 'fix', 'pdop', 'sym', 'ele', 'ageofdgpsd', 'time', 'urlname', 'magvar', 'cmt', 'type', 'geoidheigh', 'desc'] ['status', u'logs', u'locale', u'terrain', u'sym', u'geokrety', u'difficulty', u'licence', u'owner', u'urlname', u'desc', u'@xmlns', u'src', u'container', u'name', u'url', u'country', u'description', u'summary', u'state', u'time', u'hints', u'type'] Process finished with exit code 0
So far, we have defined the Geocache class; it has the latitude and longitude properties and a method to return this pair of coordinates. PointCollection
is a collection of geocaches. We also have the open_vector_file
function that returns a list of dictionaries representing features.
Now, we will reach a higher level of abstraction by implementing the process of importing data into the PointCollection
class by making use of the open_vector_file
function. Perform the following steps:
models.py
file and edit the imports at the beginning of the file by executing the following code:# coding=utf-8 Import gdal import os from pprint import pprint from utils.geo_functions import open_vector_file
PointCollection
automatically import a file when it's instantiated. Go to the models.py
file, change your class __init__
method, and add the import_data
and _parse_data
methods. Run this script:class PointCollection(object): def __init__(self, file_path=None): """This class represents a group of vector data.""" self.data = [] self.epsg = None if file_path: self.import_data(file_path) def import_data(self, file_path): """Opens an vector file compatible with OGR and parses the data. :param str file_path: The full path to the file. """ features, metadata = open_vector_file(file_path) self._parse_data(features) self.epsg = metadata['epsg'] print("File imported: {}".format(file_path)) def _parse_data(self, features): """Transforms the data into Geocache objects. :param features: A list of features. """ for feature in features: geom = feature['geometry']['coordinates'] attributes = feature['properties'] cache_point = Geocache(geom[0], geom[1], attributes = attributes) self.data.append(cache_point)
Geocache
class to receive and store the attributes. Replace it with the following code:class Geocache(object): """This class represents a single geocaching point.""" def __init__(self, lat, lon, attributes=None): self.lat = lat self.lon = lon self.attributes = attributes @property def coordinates(self): return self.lat, self.lon
The attribute arguments are called keyword arguments. Keyword arguments are optional, and the default value is the value defined after the equal symbol.
As at this moment there is no standardization in data format for geocaching, we will store all the attributes that are read from the source file unchanged.
In Python, you are not obliged to define which properties a class instance will have in advance; the properties can be added during the code's execution. However, it's good practice to define them in the __init __
method because it avoids mistakes, such as trying to access undefined properties. PyCharm can track these properties and warn you about typos. It also serves as documentation.
PointCollection
class and add a method that shows some information for us, as follows:#... def describe(self): print("SRS EPSG code: {}".format(self.epsg)) print("Number of features: {}".format(len(self.data)))
if __name__ == '__main__'
block via the following lines of code:if __name__ == '__main__':
gdal.PushErrorHandler('CPLQuietErrorHandler')
vector_data = PointCollection("../data/geocaching.gpx")
vector_data.print_information()
File imported: ../data/geocaching.gpx SRS EPSG code: 4326 Number of features: 112 Process finished with exit code 0
Now that our data is in the form of PointCollection
containing Geocache objects, merging data from multiple files or multiple PointCollection
data should be easy. Perform the following steps:
if __name__ == '__main__'
block of the models.py
file. Execute the following code:if __name__ == '__main__': gdal.PushErrorHandler('CPLQuietErrorHandler') vector_data = PointCollection("../data/geocaching.gpx") vector_data.describe() vector_data.import_data("../data/geocaching.shp") vector_data.describe()
File imported: ../data/geocaching.gpx SRS EPSG code: 4326 Number of features: 112 File imported: ../data/geocaching.shp SRS EPSG code: None Number of features: 242 Process finished with exit code 0
PointCollection
class so that we can merge the content of two instances.PointCollection
class and add the __add__
method just after the __init__
method via the following code:class PointCollection(object): def __init__(self, file_path=None): """This class represents a group of vector data.""" self.data = [] self.epsg = None if file_path: self.import_data(file_path) def __add__(self, other): self.data += other.data return self
Similar to the __init__
method, the __add__
method is one of Python's magic methods. These methods are not called directly; they are automatically called when something specific happens. The __init__
method is called when the class is instantiated, and the __add__
method is called when the plus (+
) operator is used. So, to merge the data of two PointCollection
instances, we just need to sum them. Here's what we need to do for this:
if __name__ == '__main__':
block, as follows:if __name__ == '__main__': gdal.PushErrorHandler('CPLQuietErrorHandler') my_data = PointCollection("../data/geocaching.gpx") my_other_data = PointCollection("../data/geocaching.shp") merged_data = my_data + my_other_data merged_data.describe()
File imported: ../data/geocaching.gpx File imported: ../data/geocaching.shp SRS EPSG code: 4326 Number of features: 242 Process finished with exit code 0
3.142.42.176