Importing geocaching data

In the previous chapter, we generalized our import function by adding the capability to import more types of data supported by OGR.

Now, we will improve it again, make it handle some errors, make it compatible with our objects, and add two new capabilities. We will also convert the data in order to produce uniform objects.

To achieve our goals, we will analyze what kind of information is stored in the files that we want to open. We will use OGR to inspect the files and return some information that may help us with the data conversion.

First, let's alter our open_vector_file function, allowing it to handle incorrect paths and filenames, which is a very a common error. Perform the following steps:

  1. Go to the utils folder and open the geo_functions.py file.
  2. Add the following import statements at the beginning of the file:
    # coding=utf-8
    
    import ogr
    import osr
    import gdal
    import os
    from pprint import pprint
  3. Now, edit the open_vector_file function via the following code:
    def open_vector_file(file_path):
        """Opens an vector file compatible with OGR,
         get the first layer and returns the ogr datasource.
    
        :param str file_path: The full path to the file.
        :return: The ogr datasource.
        """
        datasource = ogr.Open(file_path)
        # Check if the file was opened.
        if not datasource:
            if not os.path.isfile(file_path):
                message = "Wrong path."
            else:
                message = "File format is invalid."
            raise IOError(
                'Error opening the file {}
    {}'.format(
                    file_path, message))
        layer = datasource.GetLayerByIndex(0)
        print("Opening {}".format(file_path))
        print("Number of features: {}".format(
            layer.GetFeatureCount()))
        return datasource

    In this step, we added a verification to check whether the file was correctly opened. If the file doesn't exist or if there are any other problems, OGR will be silent and the datasource will be empty. So, if the datasource is empty (None), we will know that something went wrong and perform another verification to see whether there was a mistake with the file path or something else happened. In either case, the program will raise an exception, preventing it from continuing with bad data.

  4. Now, we will add another function to print some information about the datasource for us. After the open_vector_file function, add the get_datasource_information function with the following code:
    def get_datasource_information(datasource, print_results=False):
        """Get informations about the first layer in the datasource.
    
        :param datasource: An OGR datasource.
        :param bool print_results: True to print the results on
          the screen.
        """
        info = {}
        layer = datasource.GetLayerByIndex(0)
        bbox = layer.GetExtent()
        info['bbox'] = dict(xmin=bbox[0], xmax=bbox[1],
                            ymin=bbox[2], ymax=bbox[3])
        srs = layer.GetSpatialRef()
        if srs:
            info['epsg'] = srs.GetAttrValue('authority', 1)
        else:
            info['epsg'] = 'not available'        
        info['type'] = ogr.GeometryTypeToName(layer.GetGeomType())
        # Get the attributes names.
        info['attributes'] = []
        layer_definition = layer.GetLayerDefn()
        for index in range(layer_definition.GetFieldCount()):
            info['attributes'].append(
                layer_definition.GetFieldDefn(index).GetName())
        # Print the results.
        if print_results:
            pprint(info)
        return info

    Here, we will use a number of OGR's methods and functions to get information from the datasource and layer on it. This information is put in a dictionary, which is returned by the function. If we have print_results = True, the dictionary is printed with the pprint function (pretty print). This function tries to print Python objects in a more human-friendly way.

  5. Now, to test our code, edit the if __name__ == '__main__': block at the end of the file, as follows:
    if __name__ == "__main__":
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        datasource = open_vector_file("../../data/geocaching.gpx")
        info = get_datasource_information(
            datasource, print_results=True)

    There is a new element here: gdal.PushErrorHandler('CPLQuietErrorHandler'). Geocaching files normally contain features with empty date fields. When OGR finds this situation, it prints a warning message. This could get pretty annoying when we have a lot of features. This command tells OGR/GDAL to suppress these messages so that we can have a clean output with only what we want to see.

  6. Run the code, press Alt + Shift + F10, and select geo_functions. You should get the following output showing the information that is collected:
    Opening ../../data/geocaching.gpx
    Number of features: 130
    {'attributes': ['ele',
                    'time',
                    'magvar',
                    'geoidheight',
                    'name',
                    'cmt',
                    'desc',
                    'src',
                    'url',
                    'urlname',
                    'sym',
                    'type',
                    'fix',
                    'sat',
                    'hdop',
                    'vdop',
                    'pdop',
                    'ageofdgpsdata',
                    'dgpsid'],
     'bbox': {'xmax': -73.44602,
              'xmin': -79.3536,
              'ymax': 44.7475,
              'ymin': 40.70558},
     'epsg': '4326',
     'type': 'Point'}
    
    Process finished with exit code 0

    Note

    The attributes key of the dictionary contains the field names that could be read from the data. Every feature on our GPX file (that is, every point) contains this set of attributes. The bbox code is the bounding box of the data, which are the coordinates of the upper-left and lower-right corners of the rectangle that comprises the geographical extent of the data. The epsg code contains the code for the coordinate system of the data. Finally, type is the type of geometry identified by OGR.

Reading GPX attributes

Take a look at the attributes (field names) found by OGR in the previous example; we have a name, a description, and the time. We have some technical data about the GPS solution (pdop, hdop, sat, fix, and many more) and some other fields, but none of them contains in-depth information about the geocache.

In order to take a look at what information the GPX file contains that OGR is not displaying, let's open it in PyCharm:

  1. In your geopy project, go to the data folder.
  2. Locate geocaching.gpx. To open it, either drag and drop it in the editor area or double-click on the filename.

    PyCharm will open it for editing but won't recognize the file format and will display it in a single color; so, let's inform it that this is an XML file.

  3. Right-click on the geocaching.gpx file. In the menu, select Associate with File Type, and a window with a list will pop up. Select XML Files and then click on the OK button.

Now, the contents of the GPX file should appear with colors differentiating the various elements of the extended markup language. PyCharm is also capable of recognizing the file structure, as it does with Python. Let's take a look via the following steps:

  1. Press Alt + 7 or navigate to the View | Tool Windows | Structure menu.
    Reading GPX attributes
  2. This is the GPX file structure. Note that after some initial tags, it contains all the waypoints. Click on the arrow to the left of any waypoint to expand it. Then, locate the waypoint's geocache tag and expand it too.
    Reading GPX attributes
  3. As you can note, the geocaching point contains much more information than OGR is capable of reading, including the status attribute of the geocache tag.
  4. Before we proceed, explore the file to get familiar with its notation. Click on some of the tags and look at the code editor to see the contents.

Since we can't access these attributes directly with OGR, we will program an alternative. The objective is to read this information and flatten it in a single level of key/value pairs in a dictionary. GPX files are XML files, so we can use an XML parser to read them. The choice here is the xmltodict package; it will simply convert the XML file into a Python dictionary, making it easier to manipulate as we are very familiar with dictionaries. Now, perform the following steps:

  1. Add the import of xmltodict at the beginning of the geo_functions.py file by executing the following code:
    # coding=utf-8
    
    import xmltodict
    import ogr
    import osr
    import gdal
    import os
    from pprint import pprint
  2. Create a new function before open_vector_file and add the following code:
    def read_gpx_file(file_path):
        """Reads a GPX file containing geocaching points.
    
        :param str file_path: The full path to the file.
        """
        with open(file_path) as gpx_file:
            gpx_dict = xmltodict.parse(gpx_file.read())
        print("Waypoint:")
        print(gpx_dict['gpx']['wpt'][0].keys())
        print("Geocache:")
        print(gpx_dict['gpx']['wpt'][0]['geocache'].keys())
  3. Now, edit the if __name__ == '__main__': block to test the code:
    if __name__ == "__main__":
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        read_gpx_file("../../data/geocaching.gpx")
  4. Run the code again with Shift + F10 and look at the results:
    Waypoint:
    [u'@lat', u'@lon', u'time', u'name', u'desc', u'src', u'url', u'urlname', u'sym', u'type', u'geocache']
    Geocache:
    [u'@status', u'@xmlns', u'name', u'owner', u'locale', u'state', u'country', u'type', u'container', u'difficulty', u'terrain', u'summary', u'description', u'hints', u'licence', u'logs', u'geokrety']
    
    Process finished with exit code 0

With the print(gpx_dict['gpx']['wpt'][0].keys())statement, we obtained the value of gpx and then that of wpt, which is a list. Then, we got the keys of the first element on this list and printed it.

Next, through print(gpx_dict['gpx']['wpt'][0]['geocache'].keys()), we got the value of geocache and also printed its keys.

Look at the output and note that it's the same thing that we did when we were exploring the GPX file structure in PyCharm. The structure is now available as a dictionary, including the tag's properties, which are represented in the dictionary with an @ symbol.

Now that we have a nice and easy way to handle the dictionary of the GPX file, let's extract and flatten the relevant information and make the function return it. Edit the read_gpx_file function, as follows:

def read_gpx_file(file_path):
    """Reads a GPX file containing geocaching points.

    :param str file_path: The full path to the file.
    """
    with open(file_path) as gpx_file:
        gpx_dict = xmltodict.parse(gpx_file.read())
    output = []
    for wpt in gpx_dict['gpx']['wpt']:
        geometry = [wpt.pop('@lat'), wpt.pop('@lon')]
        # If geocache is not on the dict, skip this wpt.
        try:
            geocache = wpt.pop('geocache')
        except KeyError:
            continue
        attributes = {'status': geocache.pop('@status')}
        # Merge the dictionaries.
        attributes.update(wpt)
        attributes.update(geocache)
        # Construct a GeoJSON feature and append to the list.
        feature = {
            "type": "Feature",
            "geometry": {
                "type": "Point",
                "coordinates": geometry},
            "properties": attributes}
        output.append(feature)    
    return output

Note that here, we used the dictionary's pop method; this method returns the value of a given key and removes the key from the dictionary. The objective is to have two dictionaries only with attributes (properties) that can be merged into a single dictionary of attributes; the merging is done with the update method.

Some waypoints doesn't have the geocache key, when this happens, we catch the exception and skip this point.

Finally, the information is combined in a dictionary with a GeoJSON-like structure. You can do this as follows:

  1. Edit the if __name__ == '__main__': block using the following code:
    if __name__ == "__main__":
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        points = read_gpx_file("../../data/geocaching.gpx")
        print points[0]['properties'].keys()
  2. Run the code, and you will see the following output:
    ['status', u'logs', u'locale', u'terrain', u'sym', u'geokrety', u'difficulty', u'licence', u'owner', u'urlname', u'desc', u'@xmlns', u'src', u'container', u'name', u'url', u'country', u'description', u'summary', u'state', u'time', u'hints', u'type']
    
    Process finished with exit code 0

That's great! Now, all the geocache attributes are contained inside the properties of the feature.

Returning the homogeneous data

We have a read_gpx_file function that returns a list of features in a dictionary and an open_vector_file function that returns an OGR datasource. We also have a get_datasource_information function that returns the information that we need about the file.

Now, it's time to combine these functions in order to be able to read multiple types of data (GPX, Shapefiles, and many more). To do this, we will change the open_vector_file function so that it can make decisions depending on the file format and convert the data in order to always return the same structure. Perform the following steps:

  1. First, make sure that the functions inside geo_function.py are in the correct order; if not, rearrange them to be in this order:
    def read_gpx_file(file_path):
    
    def get_datasource_information(datasource, print_results=False):
    
    def open_vector_file(file_path):
    
    def create_transform(src_epsg, dst_epsg):
    
    def transform_geometries(datasource, src_epsg, dst_epsg):
    
    def transform_points(points, src_epsg=4326, dst_epsg=3395):
  2. Now, add a new function to transform OGR features into dictionaries as we did with the GPX file. This function can be inserted anywhere before open_vector_file, as follows:
    def read_ogr_features(layer):
        """Convert OGR features from a layer into dictionaries.
    
        :param layer: OGR layer.
        """
        features = []
        layer_defn = layer.GetLayerDefn()
        layer.ResetReading()
        type = ogr.GeometryTypeToName(layer.GetGeomType())
        for item in layer:
            attributes = {}
            for index in range(layer_defn.GetFieldCount()):
                field_defn = layer_defn.GetFieldDefn(index)
                key = field_defn.GetName()
                value = item.GetFieldAsString(index)
                attributes[key] = value
            feature = {
                "type": "Feature",
                "geometry": {
                    "type": type,
                    "coordinates": item.GetGeometryRef().ExportToWkt()},
                "properties": attributes}
            features.append(feature)
        return features
  3. Now, edit the open_vector_file function via the following code:
    def open_vector_file(file_path):
        """Opens an vector file compatible with OGR or a GPX file.
        Returns a list of features and informations about the file.
    
        :param str file_path: The full path to the file.    
        """    
        datasource = ogr.Open(file_path)
        # Check if the file was opened.
        if not datasource:
            if not os.path.isfile(file_path):
                message = "Wrong path."
            else:
                message = "File format is invalid."
            raise IOError('Error opening the file {}
    {}'.format(
                file_path, message))
        metadata = get_datasource_information(datasource)
        file_name, file_extension = os.path.splitext(file_path)
        # Check if it's a GPX and read it if so.
        if file_extension in ['.gpx', '.GPX']:
            features = read_gpx_file(file_path)
        # If not, use OGR to get the features.
        else:
            features = read_ogr_features(
                datasource.GetLayerByIndex(0))
        return features, metadata
  4. Just to make sure that everything is fine, let's test the code by opening two different file types. Edit the if __name__ == '__main__': block, as follows:
    if __name__ == "__main__":
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        points, metadata = open_vector_file(
        "../../data/geocaching.shp")
        print points[0]['properties'].keys()
        points, metadata = open_vector_file(
        "../../data/geocaching.gpx")
        print points[0]['properties'].keys()
  5. Run the code and take a look at the following output:
    ['src', 'dgpsid', 'vdop', 'sat', 'name', 'hdop', 'url', 'fix', 'pdop', 'sym', 'ele', 'ageofdgpsd', 'time', 'urlname', 'magvar', 'cmt', 'type', 'geoidheigh', 'desc']
    ['status', u'logs', u'locale', u'terrain', u'sym', u'geokrety', u'difficulty', u'licence', u'owner', u'urlname', u'desc', u'@xmlns', u'src', u'container', u'name', u'url', u'country', u'description', u'summary', u'state', u'time', u'hints', u'type']
    
    Process finished with exit code 0

Converting the data into Geocache objects

So far, we have defined the Geocache class; it has the latitude and longitude properties and a method to return this pair of coordinates. PointCollection is a collection of geocaches. We also have the open_vector_file function that returns a list of dictionaries representing features.

Now, we will reach a higher level of abstraction by implementing the process of importing data into the PointCollection class by making use of the open_vector_file function. Perform the following steps:

  1. Open your models.py file and edit the imports at the beginning of the file by executing the following code:
    # coding=utf-8
    
    Import gdal
    import os
    from pprint import pprint
    from utils.geo_functions import open_vector_file
  2. Now, let's make PointCollection automatically import a file when it's instantiated. Go to the models.py file, change your class __init__ method, and add the import_data and _parse_data methods. Run this script:
    class PointCollection(object):
        def __init__(self, file_path=None):
            """This class represents a group of vector data."""
            self.data = []
            self.epsg = None
    
            if file_path:
                self.import_data(file_path)
    
        def import_data(self, file_path):
            """Opens an vector file compatible with OGR and parses
             the data.
    
            :param str file_path: The full path to the file.
            """
            features, metadata = open_vector_file(file_path)
            self._parse_data(features)
            self.epsg = metadata['epsg']
            print("File imported: {}".format(file_path))
    
        def _parse_data(self, features):
            """Transforms the data into Geocache objects.
    
            :param features: A list of features.
            """
            for feature in features:
                geom = feature['geometry']['coordinates']
                attributes = feature['properties']
                cache_point = Geocache(geom[0], geom[1],
                                       attributes = attributes)
                self.data.append(cache_point)
  3. Now, we will just need to adapt the Geocache class to receive and store the attributes. Replace it with the following code:
    class Geocache(object):
        """This class represents a single geocaching point."""
    
        def __init__(self, lat, lon, attributes=None):
            self.lat = lat
            self.lon = lon
            self.attributes = attributes      
    
        @property
        def coordinates(self):
            return self.lat, self.lon

The attribute arguments are called keyword arguments. Keyword arguments are optional, and the default value is the value defined after the equal symbol.

As at this moment there is no standardization in data format for geocaching, we will store all the attributes that are read from the source file unchanged.

In Python, you are not obliged to define which properties a class instance will have in advance; the properties can be added during the code's execution. However, it's good practice to define them in the __init __ method because it avoids mistakes, such as trying to access undefined properties. PyCharm can track these properties and warn you about typos. It also serves as documentation.

  1. Before we test the code, edit the PointCollection class and add a method that shows some information for us, as follows:
    #...
        def describe(self):
            print("SRS EPSG code: {}".format(self.epsg))
            print("Number of features: {}".format(len(self.data)))
  2. In order to test your code, edit the if __name__ == '__main__' block via the following lines of code:
    if __name__ == '__main__':
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        vector_data = PointCollection("../data/geocaching.gpx")
    vector_data.print_information()
  3. Now, run the code. You should see the following output:
    File imported: ../data/geocaching.gpx
    SRS EPSG code: 4326
    Number of features: 112
    
    Process finished with exit code 0

Merging multiple sources of data

Now that our data is in the form of PointCollection containing Geocache objects, merging data from multiple files or multiple PointCollection data should be easy. Perform the following steps:

  1. Make another test. First, we will see whether we can import multiple files and edit the if __name__ == '__main__' block of the models.py file. Execute the following code:
    if __name__ == '__main__':
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        vector_data = PointCollection("../data/geocaching.gpx")
        vector_data.describe()
        vector_data.import_data("../data/geocaching.shp")
        vector_data.describe()
  2. Run the code again. Now, you should see the number of features double after you import another file, as follows:
    File imported: ../data/geocaching.gpx
    SRS EPSG code: 4326
    Number of features: 112
    File imported: ../data/geocaching.shp
    SRS EPSG code: None
    Number of features: 242
    
    Process finished with exit code 0
  3. Let's implement something very elegant. We will add a magic method to our PointCollection class so that we can merge the content of two instances.
  4. Edit the PointCollection class and add the __add__ method just after the __init__ method via the following code:
    class PointCollection(object):
        def __init__(self, file_path=None):
            """This class represents a group of vector data."""
            self.data = []
            self.epsg = None
    
            if file_path:
                self.import_data(file_path)
        
        def __add__(self, other):
            self.data += other.data
            return self

Similar to the __init__ method, the __add__ method is one of Python's magic methods. These methods are not called directly; they are automatically called when something specific happens. The __init__ method is called when the class is instantiated, and the __add__ method is called when the plus (+) operator is used. So, to merge the data of two PointCollection instances, we just need to sum them. Here's what we need to do for this:

  1. Edit the if __name__ == '__main__': block, as follows:
    if __name__ == '__main__':
        gdal.PushErrorHandler('CPLQuietErrorHandler')
        
        my_data = PointCollection("../data/geocaching.gpx")
        my_other_data = PointCollection("../data/geocaching.shp")
        merged_data = my_data + my_other_data    
        merged_data.describe()
  2. Then, run the code and take a look at the results:
    File imported: ../data/geocaching.gpx
    File imported: ../data/geocaching.shp
    SRS EPSG code: 4326
    Number of features: 242
    
    Process finished with exit code 0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.42.176