Importing the data

We are now ready to import our four sets of data into the DISTAL database. We will be using the techniques discussed in Chapter 3, Python Libraries for Geo-Spatial Development and Chapter 5, Working with Geo-Spatial Data in Python to read the data from these data sets, and then insert them into the database using the techniques we discussed in Chapter 6,

Let's work through each of these files in turn.

World Borders Dataset

The World Borders Dataset consists of a Shapefile containing the outline of each country along with a variety of metadata, including the country's name in Latin-1 character encoding. We can import this directly into our countries table using the following Python code for MySQL:

import os.path
import MySQLdb
import osgeo.ogr

connection = MySQLdb.connect(user="...", passwd="...")
cursor = connection.cursor()

cursor.execute("USE distal")
cursor.execute("DELETE FROM countries")
cursor.execute("SET GLOBAL max_allowed_packet=50000000")

srcFile = os.path.join("DISTAL-data", "TM_WORLD_BORDERS-0.3",
                       "TM_WORLD_BORDERS-0.3.shp")
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name = feature.GetField("NAME").decode("Latin-1")
    wkt = feature.GetGeometryRef().ExportToWkt()

    cursor.execute("INSERT INTO countries (name,outline) " +
                   "VALUES (%s, PolygonFromText(%s, 4326))",
                   (name.encode("utf8"), wkt))

connection.commit()

The only unusual thing here is the SET GLOBAL max_allowed_packet instruction. This command (which works with MySQL versions 5.1 and later) allows us to insert larger geometries into the database. If you are using an earlier version of MySQL, you will have to edit the my.cnf file and set this variable manually before running the program.

Notice that we are following the recommended best practice of associating the spatial reference with the polygon. In most cases, we will be dealing with unprojected coordinates on the WGS84 datum (SRID 4326), though stating this explicitly can save us some trouble when we come to dealing with data that uses other spatial references.

Here is what the equivalent code would look like for PostGIS:

import os.path
import psycopg2
import osgeo.ogr

connection = psycopg2.connect("dbname=... user=...")
cursor = connection.cursor()

cursor.execute("DELETE FROM countries")

srcFile = os.path.join("DISTAL-data", "TM_WORLD_BORDERS-0.3",
                       "TM_WORLD_BORDERS-0.3.shp")
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name = feature.GetField("NAME").decode("Latin-1")
    wkt = feature.GetGeometryRef().ExportToWkt()

cursor.execute("INSERT INTO countries (name,outline) " +
                   "VALUES (%s, ST_PolygonFromText(%s, " +
                   "4326))", (name.encode("utf8"), wkt))

connection.commit()

And for SpatiaLite:

import os, os.path
from pysqlite2 import dbapi2 as sqlite
import osgeo.ogr

db = sqlite.connect("distal.db")
db.enable_load_extension(True)
db.execute('SELECT load_extension("...")')
cursor = db.cursor()

cursor.execute("DELETE FROM countries")

srcFile = os.path.join("DISTAL-data", "TM_WORLD_BORDERS-0.3",
                       "TM_WORLD_BORDERS-0.3.shp")
shapefile = osgeo.ogr.Open(srcFile)
layer = shapefile.GetLayer(0)

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name = feature.GetField("NAME").decode("Latin-1")
    wkt = feature.GetGeometryRef().ExportToWkt()

    cursor.execute("INSERT INTO countries (name,outline) " +
                   "VALUES (?, ST_PolygonFromText(?, " +
                   "4326))", (name, wkt))

db.commit()

Note

SpatiaLite doesn't know about UTF-8 encoding, so in this case we store the country names directly as Unicode strings.

GSHHS

The GSHHS shoreline database consists of five separate Shapefiles defining the land/water boundary at five different resolutions. For the DISTAL application, we want to import the four levels of GSHHS data (coastline, lake, island-in-lake, pond-in-island-in-lake) at full resolution. We can directly import these Shapefiles into the shorelines table within our DISTAL database.

For MySQL, we use the following code:

import os.path
import MySQLdb
import osgeo.ogr

connection = MySQLdb.connect(user="...", passwd="...")
cursor = connection.cursor()

cursor.execute("USE distal")
cursor.execute("DELETE FROM shorelines")
cursor.execute("SET GLOBAL max_allowed_packet=50000000")

for level in [1, 2, 3, 4]:
    srcFile = os.path.join("DISTAL-data", "GSHHS_shp", "f",
                           "GSHHS_f_L" + str(level) + ".shp")
    shapefile = osgeo.ogr.Open(srcFile)
    layer = shapefile.GetLayer(0)

    for i in range(layer.GetFeatureCount()):
        feature = layer.GetFeature(i)
        wkt = feature.GetGeometryRef().ExportToWkt()

        cursor.execute("INSERT INTO shorelines " +
                       "(level,outline) VALUES " +
                       "(%s, PolygonFromText(%s, 4326))",
                       (level, wkt))

    connection.commit()

Note that this might take a minute or two to complete as we are importing more than 180,000 polygons into the database.

The equivalent code for PostGIS would look like this:

import os.path
import psycopg2
import osgeo.ogr

connection = psycopg2.connect("dbname=... user=...")
cursor = connection.cursor()

cursor.execute("DELETE FROM shorelines")

for level in [1, 2, 3, 4]:
    srcFile = os.path.join("DISTAL-data", "GSHHS_shp", "f",
                           "GSHHS_f_L" + str(level) + ".shp")
    shapefile = osgeo.ogr.Open(srcFile)
    layer = shapefile.GetLayer(0)

    for i in range(layer.GetFeatureCount()):
        feature = layer.GetFeature(i)
        wkt = feature.GetGeometryRef().ExportToWkt()

        cursor.execute("INSERT INTO shorelines " +
                       "(level,outline) VALUES " +
                       "(%s, ST_PolygonFromText(%s, 4326))",
                       (level, wkt))

    connection.commit()

And using SpatiaLite:

import os.path
from pysqlite2 import dbapi2 as sqlite
import osgeo.ogr

db = sqlite.connect("distal.db")
db.enable_load_extension(True)
db.execute('SELECT load_extension("...")')
cursor = db.cursor()

cursor.execute("DELETE FROM shorelines")

for level in [1, 2, 3, 4]:
    srcFile = os.path.join("DISTAL-data", "GSHHS_shp", "f",
                           "GSHHS_f_L" + str(level) + ".shp")
    shapefile = osgeo.ogr.Open(srcFile)
    layer = shapefile.GetLayer(0)

    for i in range(layer.GetFeatureCount()):
        feature = layer.GetFeature(i)
        wkt = feature.GetGeometryRef().ExportToWkt()

        cursor.execute("INSERT INTO shorelines " +
                       "(level,outline) VALUES " +
                       "(?, ST_PolygonFromText(?, 4326))",
                       (level, wkt))

    db.commit()

US placename data

The list of US placenames is stored in the large text file you downloaded named NationalFile_YYYYMMDD.txt (where YYYYMMDD is a timestamp). This is a pipe-delimited file, meaning that each column is separated by a | character such as this:

FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|...|DATE_EDITED
399|Agua Sal Creek|Stream|AZ|…|02/08/1980
400|Agua Sal Wash|Valley|AZ|…|02/08/1980
…

The first line contains the names of the various fields. Notice the third field in each line, labelled FEATURE_CLASS. This tells us what type of feature we are dealing with, in this case a Stream or a Valley. There are a lot of features we don't need for the DISTAL application, for example the names of bays, beaches, bridges, oilfields, and so on. In fact, there is only one feature class we are interested in: Populated Place.

Each feature includes the latitude and longitude of the associated place, in the 10th and 11th columns, respectively. According to the documentation, these coordinates use the NAD83 datum rather than the WGS84 datum used by the other data we are importing. Unprojected lat/long coordinates in the NAD83 datum have an SRID value of 4269.

One way of approaching all this would be to create a temporary database table, import the entire NationalFile_YYYYMMDD.txt file into it, extract the features with our desired feature classes, translate them from NAD83 to WGS84, and finally insert the features into our places table. However, this approach has two disadvantages:

  • It would take a long time to insert all 2+ million features into the database, when we only want a small percentage of these features in our places table
  • MySQL doesn't support on-the-fly transformation of geometries, so we would have to read the geometry from the database, convert it into an OGR Geometry object, transform the geometry using OGR, and then convert it back to WKT format for adding back into the database

To avoid all this, we'll take a slightly different approach:

  1. Extract all the features from the file.
  2. Ignore features with the wrong feature class.
  3. Use pyproj to convert from NAD83 to WGS84.
  4. Insert the resulting features directly into the places table.

With the exception of this final step, this approach is completely independent of the database. This means that the same code can be used regardless of which database you are using:

import os.path
import pyproj

srcProj = pyproj.Proj(proj='longlat', ellps='GRS80',
                      datum='NAD83')
dstProj = pyproj.Proj(proj='longlat', ellps='WGS84',
                      datum='WGS84')

f = file(os.path.join("DISTAL-data",
                      "NationalFile_YYYYMMDD.txt"), "r")
heading = f.readline() # Ignore field names.
for line in f.readlines():
    parts = line.rstrip().split("|")
    featureName = parts[1]
    featureClass = parts[2]
    lat = float(parts[9])
    long = float(parts[10])

    if featureClass == "Populated Place":
        long,lat = pyproj.transform(srcProj, dstProj,
                                    long, lat)
        ...
f.close()

Note

Strictly speaking, the above code is being somewhat pedantic. We are using pyproj to transform coordinates from NAD83 to WGS84. However, the data we are importing is all within the United States, and these two datums happen to be identical for points within the United States. Because of this, pyproj won't actually change the coordinates at all. But, we will do this anyway, following the recommended practice of knowing the spatial reference for our data and transforming when necessary—even if that transformation is a no-op at times.

We can now add the database-specific code to add the feature into our places table. For MySQL, this would involve the following:

import MySQLdb
connection = MySQLdb.connect(user="...", passwd="...")
cursor = connection.cursor()
cursor.execute("USE distal")
...
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES (%s, " +
                       "GeomFromWKB(Point(%s, %s), 4326))",
                       (featureName, long, lat))
...
connection.commit()

Note that our INSERT statement creates a new Point object out of the translated latitude and longitude values, and then uses GeomFromWKB() to assign an SRID value to the geometry. The result is stored into the position column within the places table.

The same code using PostGIS would look like this:

import psycopg2
connection = psycopg2.connect("dbname=... user=...")
cursor = connection.cursor()
cursor.execute("SET NAMES 'utf8'")
...
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES (%s, " +
                       "ST_MakePoint(%s,%s, 4326)",
                       (featureName, long, lat))
...
connection.commit()

Because the PostGIS function ST_MakePoint()allows us to specify an SRID value directly, we don't need to use GeomFromWKB to add the SRID after the geometry has been created.

Finally, the SpatiaLite version would look like this:

from pysqlite2 import dbapi2 as sqlite
db = sqlite.connect("distal.db")
db.enable_load_extension(True)
db.execute('SELECT load_extension("...")')
cursor = db.cursor()
...
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES "
                       "(?, MakePoint(?, ?, 4326))",
                       (featureName, long, lat))
...
db.commit()

Worldwide placename data

The list of non-US placenames is stored in the geonames_dd_dms_date_YYYYMMDD file you downloaded earlier. This is a tab-delimited text file in UTF-8 character encoding, and will look something like this:

RC  UFI     ... FULL_NAME_ND_RG  NOTE       MODIFY_DATE
1  -1307834 ... Pavia                       1993-12-21
1  -1307889 ... Santa Anna       gjgscript  1993-12-21
...

As with the US placename data, there are many more features here than we need for the DISTAL application. Since we are only interested in the official names for towns and cities, we need to filter this data in the following way:

  • The FC (Feature Classification) field tells us what type of feature we are dealing with. We want features with an FC value of P (populated place).
  • The NT (Name Type) field tells us the status of this feature's name. We want names with an NT value of N (approved name).
  • The DSG (Feature Designation Code) field tells us the type of feature, in more detail than the FC field. A full list of all the feature designation codes can be found at http://geonames.nga.mil/ggmagaz/feadesgsearchhtml.asp. We are interested in features with a DSG value of PPL (populated place), PPLA (administrative capital), or PPLC (capital city).

There are also several different versions of each placename; we want the full name in normal reading order, which is in the field named FULL_NAME_RO. Knowing this, we can write some Python code to extract the features we want from the file:

f = file(os.path.join("DISTAL-data",
                      "geonames_dd_dms_date_YYYYMMDD.txt"),
         "r")
heading = f.readline() # Ignore field names.
for line in f.readlines():
    parts = line.rstrip().split("	")
    lat = float(parts[3])
    long = float(parts[4])
    featureClass = parts[9]
    featureDesignation = parts[10]
    nameType = parts[17]
    featureName = parts[22]

    if (featureClass == "P" and nameType == "N" and
        featureDesignation in ["PPL", "PPLA", "PPLC"]):
        ...
f.close()

Now that we have the name, latitude, and longitude for each of the features we want, we can re-use the code from the previous section to insert these features into the database. For example, for MySQL we would do the following:

import MySQLdb
connection = MySQLdb.connect(user="...", passwd="...")
cursor = connection.cursor()
cursor.execute("USE distal")
...
        cursor.execute("INSERT INTO places " +
                       "(name, position) VALUES (%s, " +
                       "GeomFromWKB(Point(%s, %s), 4326))",
                       (featureName, long, lat))
...
connection.commit()

Tip

Because we are dealing with worldwide data here, the lat/long values already use the WGS84 datum, so there is no need to translate the coordinates before adding them to the database.

If you are using PostGIS or SpatiaLite, simply copy the equivalent code from the previous section. Note that, because there are over two million features we want to add to the database, it can take several minutes for this program to complete.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.88.54