Back in chapter 3, you learned how to iterate through all of the features in a layer and use attribute values for each one to determine if it was interesting. You’ve got easier ways to throw out features that you don’t want, however, and that’s where filters come in. With filters you can easily select features that match specific criteria, such as all animal GPS locations from a certain day or all crabapple trees from a city tree inventory. Filters also let you limit features by spatial extent, so you could limit your crabapple trees to a specific neighborhood, or GPS locations to those within a kilometer of an animal feeding station. Filtering your data like this makes it easy to extract or process only the features you’re interested in. I’ve used these techniques to extract features such as city boundaries for a single county from a larger dataset, or to extract highways and freeways from road datasets, while ignoring the smaller residential roads.
You can also use SQL queries to join attribute tables together from different layers. For example, if you had a layer containing all of the locations of your store franchises and each feature had an attribute denoting the city that the store was in, then you could join this layer with one containing cities. If the city layer contained demographic information for each city, then that data would be associated with the store data, and you could easily compare demographics between stores.
SQL is short for Structured Query Language, although you’ll rarely see it written out like that. If you’ve used a relational database, then you’ve probably used SQL, even if you didn’t realize it. For example, if you build a graphical query in Microsoft Access, it still builds a SQL query behind the scenes, and you can see it if you switch to SQL View. SQL is featured more prominently in other database software such as PostgreSQL.
If you need to limit the features by values contained in one or more attribute fields, then you want an attribute filter. To set one of these filters, you need to come up with a conditional statement that’s much like the WHERE clause in a SQL statement. You compare the value of an attribute field to another value, and then all features where that comparison is true are returned. The standard logical operators, such as =, !=, <>, >, <, >=, and <=, allow you to use statements such as the following:
You can probably guess what these comparisons do; they all test for equality or inequality. Notice that if you’re comparing strings, you need to put quotes around the string values, but they can be either single or double. Make sure they’re different from the quotes you use to surround the entire query string, or else you’ll end your string prematurely and get a syntax error. Don’t use quotes with numbers, because that turns them into string values, and you won’t get the comparison you were expecting. Another thing you might have noticed is that you use a single equal sign to test for equality, which isn’t the way programming languages typically work. But that’s the way SQL does things, so who are we to argue? In addition, if you want to test if something doesn’t equal another value, you can use either != or <>.
You can also combine statements using AND or OR:
'(Population > 25000) AND (Population < 50000)' '(Population > 50000) OR (Place_type = "County Seat")'
The first of these selects features with a population value greater than 25,000 but less than 50,000. The second selects features that either have a population greater than 50,000 or are county seats (or both).
Conditions can be negated using NOT, and NULL is used to indicate a null or no data value in the attribute table:
'(Population < 50000) OR NOT (Place_type = "County Seat")' 'County NOT NULL'
That first example selects features that either have a population less than 50,000 or aren’t county seats. Again, a feature will be selected if it meets one or both of those conditions. The second example selects features that have a value for the County attribute.
If you want to check if a value is between two other values, you can use BETWEEN instead of two different comparisons joined with AND. For example, the following two statements are equivalent, and both select features with a population between 25,000 and 50,000:
'Population BETWEEN 25000 AND 50000' '(Population > 25000) AND (Population < 50000)'
You have an easy way to check if a value is equal to one of several different values. Once again, both of these select features where the Type_code value is 4, 3, or 7:
'Type_code IN (4, 3, 7)' '(Type_code = 4) OR (Type_code = 3) OR (Type_code = 7)'
This also works for strings:
'Place_type IN ("Populated Place", "County Seat")'
Last, you can compare strings using the normal logical operators (a is less than c), or you can do fancier, case-insensitive, string matching using LIKE. This allows you to use wildcards to match any character in a string. An underscore matches any single character and a percent sign matches any number of characters. Table 5.1 shows examples, and this is how you’d use them:
'Name LIKE "%Seattle%"'
Pattern |
Matches |
Doesn’t match |
---|---|---|
_eattle | Seattle | Seattle WA |
Seattle% | Seattle, Seattle WA | North Seattle |
%Seattle% | Seattle, Seattle WA, North Seattle | Tacoma |
Sea%le | Seattle | Seattle WA |
Sea_le | Seatle (note misspelling) | Seattle |
If you want to read more about the SQL syntax available in OGR, check out the online documentation at http://www.gdal.org/ogr_sql.html and http://www.gdal.org/ogr_sql_sqlite.html. But for now, let’s see how to put this newfound information to use. It will definitely be more fun if you fire up a Python interactive window for testing this out, because you can use the VectorPlotter class to interactively draw your selections. After configuring an interactive vector plotter, open the global data folder and grab the low-resolution countries layer:
>>> ds = ogr.Open(r'D:osgeopy-dataglobal') >>> lyr = ds.GetLayer('ne_50m_admin_0_countries')
Then plot out the features, but be patient if it takes it a few seconds to draw the output shown in figure 5.1, since it has a fair amount of data to plot. Remember that setting fill=False tells it to draw only country outlines.
>>> vp.plot(lyr, fill=False)
Now inspect the layer attributes by printing out the names of the first few features:
>>> pb.print_attributes(lyr, 4, ['name'], geom=False) FID name 0 Aruba 1 Afghanistan 2 Angola 3 Anguilla 4 of 241 features
Notice that the feature IDs (FIDs) are in order and also the fact that there are 241 features in the layer. Now find out how many of those are in Asia by using an attribute filter. To do this, pass a conditional statement to SetAttributeFilter:
>>> lyr.SetAttributeFilter('continent = "Asia"') 0 >>> lyr.GetFeatureCount() 53
Now the layer thinks it has only 53 features. The zero that got spit out when you called SetAttributeFilter means that the query executed successfully. Now that you have selected the countries in Asia, try drawing them in yellow; your result should look like figure 5.2:
>>> vp.plot(lyr, 'y')
Many graphics in this book are best viewed in color. The eBook versions display the color graphics, so they should be referred to as you read. To get your free eBook in PDF, ePub, and Kindle formats, go to https://www.manning.com/books/geoprocessing-with-python to register your print book.
You can look a little more closely at what’s happening with the filter by printing attributes for the first few features:
>>> pb.print_attributes(lyr, 4, ['name'], geom=False) FID name 1 Afghanistan 7 United Arab Emirates 9 Armenia 17 Azerbaijan 4 of 53 features
Huh. Now you’re missing a bunch of FIDs. That’s because those features aren’t in Asia, so they’re ignored while iterating through the layer. Getting features by specific FID doesn’t honor the filter, however, because features aren’t truly being deleted, and therefore the FID values don’t change. You can prove it to yourself by getting a feature or two using FIDs:
>>> lyr.GetFeature(2).GetField('name') 'Angola'
You can see from this that even though Angola doesn’t show up when you iterate through the filtered layer, it’s still there. It should be obvious to you now that looping through a filtered layer using specific FIDs is a bad idea and you won’t get the desired results. Instead, you need to iterate through the layer using a for loop.
If you set another attribute filter, it doesn’t create a subset of the currently filtered features. Instead, the new filter is applied to the entire layer. To illustrate this, try applying a new filter that selects the countries in South America, and then draw them in blue, which results in the shading you see in figure 5.3.
>>> lyr.SetAttributeFilter('continent = "South America"') >>> vp.plot(lyr, 'b')
You can, however, use both attribute and spatial filters together to refine your results, and you’ll see an example of that in the next section. To clear out the attribute filter and get all 241 features back, simply pass None to SetAttributeFilter:
>>> lyr.SetAttributeFilter(None) >>> lyr.GetFeatureCount() 241
Removing the filter also resets the current feature back to the beginning, as if you had just opened the layer.
Spatial filters let you limit the features by spatial extent rather than attribute value. These filters can be used to select features within another geometry or inside a bounding box. For example, if you had a dataset of global cities with no attribute indicating the country that the cities are in, but you had another dataset with the same spatial reference system that contained the boundary of Germany, you could use a spatial filter to select the German cities.
The geometries or coordinates used for spatial filtering must use the same spatial reference system as the layer you’re trying to filter. Why is this? Pretend for a moment that you have a layer that uses a Universal Transverse Mercator (UTM) spatial reference system. Coordinates in that layer would be large numbers, much different than the latitude and longitude values we’re all familiar with. This means that they wouldn’t align if plotted on top of each other, and they’d appear to have non-overlapping spatial extents. For example, the UTM easting and northing coordinates for the capitol building in Salt Lake City, UT, are approximately 425045 and 4514422, but the corresponding longitude and latitude are -111.888 and 40.777. Those coordinates are awfully different from each other, and they wouldn’t overlay on each other unless one of them was transformed to the same spatial reference system as the other.
Try selecting cities in Germany using the natural earth shapefiles. After setting up a vector plotter in an interactive window, open the folder data source and get the countries layer. Then use an attribute filter to limit the countries to Germany and grab the corresponding feature and geometry:
>>> ds = ogr.Open(r'D:osgeopy-dataglobal') >>> country_lyr = ds.GetLayer('ne_50m_admin_0_countries') >>> vp.plot(country_lyr, fill=False) >>> country_lyr.SetAttributeFilter('name = "Germany"') >>> feat = country_lyr.GetNextFeature() >>> germany = feat.geometry().Clone()
You can assume, in this case, that the attribute filter will return one and only one feature, so using GetNextFeature will get the first and only feature in the filtered results. Then you grab the geometry and clone it so that you can use the geometry even after the feature is removed from memory. Oh, and you also plot the world countries before applying the filter so that you have context for the cities later on. Now open the populated places layer and plot all cities (see figure 5.4) as yellow dots:
>>> city_lyr = ds.GetLayer('ne_50m_populated_places') >>> city_lyr.GetFeatureCount() 1249 >>> vp.plot(city_lyr, 'y.')
The call to GetFeatureCount indicates there are 1,249 city features in the full layer. Now try applying a spatial filter by passing the germany geometry that you got earlier to SetSpatialFilter, and then plot the resulting cities as large dots:
>>> city_lyr.SetSpatialFilter(germany) >>> city_lyr.GetFeatureCount() 5 >>> vp.plot(city_lyr, 'bo')
Now the layer claims to have only five features, so five cities fall within the German boundary polygon. You can also see from your plot that the circles fall in the correct geographical area. You can use the Zoom to rectangle tool on the bottom of the plot window to zoom in on Germany if you’d like (figure 5.5).
Geometry objects have a Clone function, which makes a copy of the object. Why would you want to use this? When you get a geometry from a feature, that geometry is still associated with that feature. If that feature is then deleted (or the variable is populated with a different feature), then the geometry is no longer useable. In fact, if you try to use it, Python will crash instead of spit out an error. This problem is easy to solve, however, by cloning the geometry. Now you can store a copy of the feature or geometry that’s no longer associated with other objects and will live on even if the parent objects disappear. Want to see this in action? Try this in an interactive window:
In this example, the geom variable holds a Geometry object that’s still owned by the Feature object stored in the feat variable, but the geom_clone variable holds a geometry that has been disassociated from that feature. After you populate the feat variable with a different feature, you can still use the geom_clone geometry, but not the object stored in the geom variable, because you no longer have a handle to the feature that it came from.
Incidentally, this is related to why all of these examples would also cause Python to crash: feat = ogr.Open(fn, 0).GetLayer(0).GetNextFeature() # or lyr = ogr.Open(fn, 0).GetLayer(0) feat = lyr.GetNextFeature() # or ds = ogr.Open(fn, 0) lyr = ds.GetLayer(0) del ds feat = lyr.GetNextFeature()
In each case, the data source has gone out of scope or been deleted before you try to use the layer. But the layer is associated with the data source and becomes unusable once the data source is gone, the same way a geometry becomes unusable if its parent feature disappears. You should never close your data source if you still need access to the layer.
As promised, you now get to combine a spatial and an attribute query. Further refine your selection by finding the cities with a population over 1,000,000, and draw them as the squares shown in figure 5.6:
>>> city_lyr.SetAttributeFilter('pop_min > 1000000') >>> city_lyr.GetFeatureCount() 3 >>> vp.plot(city_lyr, 'rs')
Judging from these results, there are three German cities with populations of more than 1,000,000 people. Figure 5.6 shows the output plot zoomed in on Germany so you can see these features. But what if you decide that you want to know how many cities exist in the entire world with a population that large? All you have to do is remove the spatial filter by passing None to SetSpatialFilter. Note that the attribute filter will still be in effect. Go ahead and try it, drawing the results as triangles:
>>> city_lyr.SetSpatialFilter(None) >>> city_lyr.GetFeatureCount() 246 >>> vp.plot(city_lyr, 'm^', markersize=8)
And now you know where the largest cities in the world are (figure 5.7).
You’re not completely out of luck if you’d like to filter features spatially but don’t have a geometry to use. You can also use a rectangular extent by providing the minimum and maximum x and y coordinates:
SetSpatialFilterRect(minx, miny, maxx, maxy)
You can use this to select the countries that fall within the box shown in figure 5.8. Again, start by plotting all of the countries:
>>> vp.clear() >>> country_lyr.SetAttributeFilter(None) >>> vp.plot(country_lyr, fill=False)
Now plug in the bounding coordinates shown in figure 5.8:
>>> country_lyr.SetSpatialFilterRect(110, -50, 160, 10) >>> vp.plot(country_lyr, 'y')
Now you should have a plot that looks similar to figure 5.9, with Australia and a few surrounding countries shaded in.
To clear a spatial filter, whether it was created with a geometry or a bounding box, pass None to SetSpatialFilter. You can’t clear the filter using SetSpatialFilterRect.
If you’re familiar with SQL, or are willing to learn, you can create more-complicated queries and do fun stuff using the ExecuteSQL function on a data source. This function applies to a data source instead of a layer because it allows you to use multiple layers if desired. It requires a SQL query and can optionally use a geometry as a spatial filter. In addition, you can also specify a different SQL dialect, but more on that later. Here’s the signature:
ExecuteSQL(statement, [spatialFilter], [dialect])
This function is different from the filtering functions in that it returns a new layer containing the result set rather than only filtering features out of the existing layer. Let’s look at a few examples using this technique, starting with a simple one that returns global countries sorted by population in descending order:
>>> ds = ogr.Open(r'D:osgeopy-dataglobal') >>> sql = '''SELECT ogr_geom_area as area, name, pop_est ... FROM 'ne_50m_admin_0_countries' ORDER BY POP_EST DESC''' >>> lyr = ds.ExecuteSQL(sql) >>> pb.print_attributes(lyr, 3) FID Geometry area name pop_est 41 MULTIPOLYGON 950.9810937547769 China 1338612970.0 98 MULTIPOLYGON 278.3474038553223 India 1166079220.0 226 MULTIPOLYGON 1115.1781907153158 United States 313973000.0 3 of 241 features
As you can see from these results, the three most populous countries in the world are China, India, and the United States, in that order. The query returns each country’s name and population attributes because you request them in the SQL statement. You also use the special ogr_geom_area field to get the area of each geometry (table 5.2), and the FID and geometry itself are returned automatically. This example uses the default OGR SQL dialect because shapefiles don’t have any built-in SQL support.
Field |
Returns |
---|---|
FID | The feature ID. |
OGR_GEOMETRY | An OGR geometry type constant (see table 3.1). This is especially useful for data formats that support multiple geometry types in one layer. |
OGR_GEOM_WKT | The well-known text (WKT) representation of the feature’s geometry. |
OGR_GEOM_AREA | The area of the feature’s geometry. Returns zero for geometries with no area (for example, points or lines). |
OGR_STYLE | The style string for the feature, if it exists. Very few applications use this. |
If you’re querying a data source that has its own SQL support, that native SQL version will be used. For example, if you have the SQLite driver, you could get the same information from the natural_earth_50m.sqlite database using the SQLite version of SQL. This dialect also allows you to limit the number of returned features, so you could limit the result set to the three countries with the highest populations:
>>> ds = ogr.Open(r'D:osgeopy-dataglobal atural_earth_50m.sqlite') >>> sql = '''SELECT geometry, area(geometry) AS area, name, pop_est ... FROM countries ORDER BY pop_est DESC LIMIT 3''' >>> lyr = ds.ExecuteSQL(sql) >>> pb.print_attributes(lyr) FID Geometry area name pop_est 0 MULTIPOLYGON 950.9810937547769 China 1338612970.0 1 MULTIPOLYGON 278.3474038553223 India 1166079220.0 2 MULTIPOLYGON 1115.1781907153158 United States 313973000.0 3 of 3 features
This time you could print attributes for the entire layer, because only three features are returned. You should also notice that now you use the area function instead of a special field name, and if you don’t rename it with the AS area syntax, then it would be called area(geometry) instead. You also have to specifically request the geometry because the SpatiaLite engine doesn’t return the geometry by default.
You can also use ExecuteSQL to join attributes from multiple layers. Take a look at this code and see if you can figure out what it’s doing:
The first thing to notice is that you use the ne_50m_populated_places and ne_50m_admin_0_countries shapefiles and rename them to pp and c, respectively. You do this by putting the alias directly after the layer name. This isn’t necessary, of course, but does make your SQL a lot shorter because those layer names are pretty long. You also link these two layers together by using a join, which allows you to link tables using a shared attribute. Here you use a LEFT JOIN to keep all records in the table on the left (populated places), and if a matching record exists in the table on the right (countries), then you’ll also get data from that record. But how does it figure out what matches? That’s where the ON clause comes in. For each feature in pp, it takes the adm0_a3 attribute value and tries to find a feature in the countries layer that has the same value for its adm0_a3 field. See figure 5.10 for an illustration.
Now that you know what tables the data are coming from, go back to the beginning of the SQL statement and look at what attribute fields are being requested. You ask for the NAME and POP_MIN fields from the populated places layer, as well as the NAME and POP_EST fields from the countries layer. Because the fields from the two layers have the same names, it makes sense to rename them so that you can tell what’s what. Last, you use a WHERE clause to limit the results to features that represent capital cities (adm0cap = 1).
This technique is handy if you want to see related data from multiple layers at the same time. Without this, you could query city populations and country populations separately, but now you can see the country’s population right beside the city’s. To see this, look at the layer returned by this query:
pb.print_attributes(lyr, 3, geom=False) FID city city_pop country country_pop 7 Vatican City 832 Vatican 832.0 48 San Marino 29000 San Marino 30324.0 51 Vaduz 5342 Liechtenstein 34761.0 3 of 200 features
I didn’t print the geometry column because it wouldn’t fit comfortably on the page, but because this uses the OGR SQL dialect, the geometry is returned automatically. But which one: the city or the country? It’s the city, because that’s the main table being used in the join, and corresponding country information is returned only if it existed for a city. You could plot the layer to prove it to yourself if you’d like.
Now check out a similar example using the SQLite dialect, but still shapefile data sources (you could use a SQLite database, of course, but I want to prove that the SQLite dialect will work with other data source types). See if you can spot the differences:
ds = ogr.Open(r'D:osgeopy-dataglobal') sql = '''SELECT pp.name AS city, pp.pop_min AS city_pop, c.name AS country, c.pop_est AS country_pop FROM ne_50m_populated_places pp LEFT JOIN ne_50m_admin_0_countries c ON pp.adm0_a3 = c.adm0_a3 WHERE pp.adm0cap = 1 AND c.continent = "South America"''' lyr = ds.ExecuteSQL(sql, dialect='SQLite') pb.print_attributes(lyr, 3)
The most obvious difference is the inclusion of the dialect parameter to the ExecuteSQL function. But you also add one thing to the SQL that doesn’t work with the OGR dialect. This time the results are limited to cities in South America by checking the value of the continent field in the countries layer. The OGR dialect doesn’t support using fields from the joined table in the WHERE clause, so the only attributes allowed would be ones from the populated places layer. Also, because you need to specifically request geometries if you want them when using the SQLite dialect, no geometries are returned by this particular query. You could add them in by specifying pp.geometry along with the other fields.
If your version of OGR was built with SpatiaLite support (not only SQLite), you can also manipulate geometries within your SQL. Be warned that this could take a while, depending on what you try to do. As an example, if you have SpatiaLite support, try merging all of the counties in California into one big geometry. Start with drawing the individual counties so you have something to compare your results with:
>>> ds = ogr.Open(r'D:osgeopy-dataUS') >>> sql = 'SELECT * FROM countyp010 WHERE state = "CA"' >>> lyr = ds.ExecuteSQL(sql) >>> vp.plot(lyr, fill=False)
This will draw a map of the counties in California, as shown in figure 5.11A. Now try using the SpatiaLite st_union function to merge all of the county polygons into one, as shown in figure 5.11B:
>>> sql = 'SELECT st_union(geometry) FROM countyp010 WHERE state = "CA"' >>> lyr = ds.ExecuteSQL(sql, dialect='SQLite') >>> vp.plot(lyr, 'w')
Geometry operations also work with data sources that have their own native SQL flavor and the ability to perform geometry manipulations. SpatiaLite and PostGIS are two obvious examples of this. For example, this is how you’d do the same thing with a PostGIS data source:
conn_str = 'PG:host=localhost user=chrisg password=mypass dbname=geodata' ds = ogr.Open(conn_str) sql = "SELECT st_union(geom) FROM us.counties WHERE state = 'CA'" lyr = ds.ExecuteSQL(sql) vp.plot(lyr)
Don’t worry if you want to perform operations like this but aren’t using PostGIS or SpatiaLite, because you’ll learn how to do it without databases in the next chapter.
Remember back in chapter 3 when you copied all of the capital cities in a global shapefile into a new shapefile? You looped through each feature in the shapefile, checked the appropriate attribute, and copied the feature if it was a capital city. This whole process can be made much easier if the features you want can be selected with filters. Do you remember the CopyLayer method that was introduced in section 4.2.4? As a reminder, it copies an existing layer into a new data source. How do you think you could use this to do something similar to the code back in listing 3.3, but much easier? Think about this problem for a minute and then look at the next example:
ds = ogr.Open(r'D:osgeopy-dataglobal', 1) in_lyr = ds.GetLayer('ne_50m_populated_places') in_lyr.SetAttributeFilter("FEATURECLA = 'Admin-0 capital'") out_lyr = ds.CopyLayer(in_lyr, 'capital_cities2')
Here the call to CopyLayer makes a copy of in_lyr in the ds data source. In this case, it happens to be the same data source as the original layer, but it could be any data source. Because you’ve already set an attribute filter on in_lyr, only the filtered features are copied. That’s certainly easier than checking each one.
If you only want certain attributes, you could use a layer created using ExecuteSQL. Write a SQL query that pulls out the attributes you want and copy the results to a new layer:
sql = """SELECT NAME, ADM0NAME FROM ne_50m_populated_places WHERE FEATURECLA = 'Admin-0 capital'""" in_lyr2 = ds.ExecuteSQL(sql) out_lyr2 = ds.CopyLayer(in_lyr2, 'capital_cities3')
It should be obvious by now that you can simplify your life by taking advantage of filters and the ExecuteSQL function whenever possible.
3.12.123.189