Designing a row key to store geographic events in Accumulo

The Armed Conflict Location Event Data (ACLED) dataset is a collection of individual events that occurred across a wide range of geographic areas. This recipe will show how we can leverage Accumulo's sorted key ordering to group ACLED event records into geographic ranges. Furthermore, each geographic range will be subgrouped in the descending order of event occurrence. Specifically, the code in this recipe shows the generation logic that we can turn around and use to build ACLED keys from our records. To verify that the key generator works as expected, we will build and run unit tests with some sample row data.

Getting ready

To run the unit tests, you will need TestNG (testng-jdk15.jar) on the environment classpath. Some basic familiarity with the TestNG testing API will help make sense of the unit tests.

This recipe makes use of a specific type of quadtree data structure that is useful for grouping geospatial data into indexed ranges. It will help to have some familiarity with a Z-order curve (a.k.a. Morton curve) to build this type of quadtree for use over 2D geospatial data.

How to do it...

Follow these steps to implement a geospatial, reverse chronological row key generator:

  1. Open the Java IDE editor of your choice.
  2. Create the package example.accumulo and create the interface RowIDGenerator.java with the following content:
    package examples.accumulo;
    
    import javax.security.auth.login.Configuration;
    import java.io.IOException;
    
    public interface RowIDGenerator {
    public String getRowID(String[] parameters)
                throws IllegalArgumentException;
    }
  3. Under the same package example.accumulo, create a class named ACLEDRowIDGenerator.java with the following content:
    package examples.accumulo;
    
    import java.text.DateFormat;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.Date;
    import java.util.regex.Pattern;
    
    public class ACLEDRowIDGenerator implements RowIDGenerator {
    
        private DateFormat dateFormat = new 
                                   SimpleDateFormat("yyyy-MM-dd");
        private static final Pattern decimalPattern = 
                                   Pattern.compile("[.]");
  4. We write the method getRowID() to take a list of String[] parameters.
        @Override
        public String getRowID(String[] parameters)
                throws IllegalArgumentException {
            if(parameters.length != 3)
                throw new IllegalArgumentException("Required: 
                                                 {lat, lon, dtg}")
            StringBuilder builder = new StringBuilder();
            builder.append(getZOrderedCurve(parameters[0], 
                                            parameters[1]));
            builder.append("_");
            builder.append(getReverseTime(parameters[2]));
            return builder.toString();
        }
  5. We add the public method getZOrderedCurve() to build the geospatial portion of our rowID. The public accessibility will help with unit testing:
        public String getZOrderedCurve(String lat, String lon)
                throws IllegalArgumentException {
            StringBuilder builder = new StringBuilder();
            lat = cleanAndValidatePoint(lat);
            lon = cleanAndValidatePoint(lon);
            int ceiling = Math.max(lat.length(), lon.length());
            for (int i = 0; i < ceiling; i++) {
                if(lat.length() <= i) {
                    builder.append("0");
                } else {
                    builder.append(lat.charAt(i));
                }
                if(lon.length() <= i) {
                    builder.append("0");
                } else {
                    builder.append(lon.charAt(i));
                }
            }
            return builder.toString();
        }
  6. The private method cleanAndValidatePoint()will help validate and sanitize lat/lon points into an appropriate form for Z-order shuffling:
        private String cleanAndValidatePoint(String point)
                throws IllegalArgumentException {
    
            String[] pointPieces = decimalPattern.split(point);
            if(pointPieces.length > 2) {
                throw new IllegalArgumentException("Malformed 
                                                point: " + point);
            }
            String integralStr = null;
            int integral = 0;
            try {
                //offset any negative integral portion
                integral = Integer.parseInt(pointPieces[0]) + 90;
                if(integral > 180 | integral < 0) {
                   throw new IllegalArgumentException("Invalid integral: " + integral + " for point: " + point);
                }
                integralStr = "" + integral;
                if(pointPieces.length > 1)
                    integralStr += 
                                 Integer.parseInt(pointPieces[1]);
                if(integral < 10)
                    integralStr = "00" + integralStr;
                else if (integral >= 10 && integral < 100)
                    integralStr = "0" + integralStr;
                return  integralStr;
            } catch (NumberFormatException e) {
                throw new IllegalArgumentException("Point: " + 
                    point + " contains non-numeric characters");
            }
        }
  7. The public method getReverseTime() helps build the timestamp portion of the row key. The public accessibility will help with unit testing:
        public long getReverseTime(String dateTime)
                throws IllegalArgumentException {
            Date date = null;
            try {
                date = dateFormat.parse(dateTime);
            } catch (ParseException e) {
                throw new IllegalArgumentException(dateTime + 
                                   "Could not be parsed to a " +
                      "valid date with the supplied DateFormat " + dateFormat.toString());
            }
            return Long.MAX_VALUE - date.getTime();
        }
    }
  8. In the package examples.accumulo, create a TestNG unit test class named ValidatingKeyGenTest.java with the following content:
    package examples.accumulo;
    
    import org.apache.hadoop.hbase.thrift.generated.IllegalArgument;
    import org.testng.annotations.BeforeClass;
    import org.testng.annotations.Test;
    import static org.testng.Assert.*;
    
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.Date;
    
    public class ValidatingKeyGenTest {
    
        private ACLEDRowIDGenerator keyGen;
        private SimpleDateFormat dateFormatter = new 
                                SimpleDateFormat("yyyy-MM-dd");
  9. Use the @BeforeClassannotation to create an instance of ACLEDRowIDGenerator.
        @BeforeClass
        public void setup() {
            keyGen = new ACLEDRowIDGenerator();
        }
  10. Add the validZOrder() unit test method.
        @Test
        public void validZOrder() {
            try {
                // +90 = 123.22,134.55
                String zpoint = keyGen.getZOrderedCurve("33.22", 
                            "44.55");
                assertEquals(zpoint, "1123342525");
    
                // +90 = 123, 134.55
                zpoint = keyGen.getZOrderedCurve("33", "44.55"); 
                assertEquals(zpoint, "1123340505");
    
                // +90 = 123.55, 134
                zpoint = keyGen.getZOrderedCurve("33.55", "44"); 
                assertEquals(zpoint, "1123345050");
    
    
                // +90 = 123.1234, 134.56
                zpoint = 
                keyGen.getZOrderedCurve("33.1234","44.56");
                assertEquals(zpoint, "11233415263040");
    
                // +90 = 00.11, 134.56
                zpoint = keyGen.getZOrderedCurve("-90.11", 
                                                 "44.56"); 
                assertEquals(zpoint, "0103041516");
    
                // +90 = 005.11, 134.56
                zpoint = keyGen.getZOrderedCurve("-85.11", 
                                                 "44.56"); 
                assertEquals(zpoint, "0103541516");
    
                // +90 = 011.11, 134.56
                zpoint = keyGen.getZOrderedCurve("-79.11", 
                                                 "44.56"); 
                assertEquals(zpoint, "0113141516");
    
                // +90 = 095, 134.56
                zpoint = keyGen.getZOrderedCurve("5", "44.56");
                assertEquals(zpoint, "0193540506");
    
            } catch (Exception e) {
                fail("EXCEPTION fail: " + e.getMessage());
            }
    
        }
  11. Add the invalidZOrder() unit test method.
        @Test
        public void invalidZOrder() {
            String zpoint = null;
            try {
                zpoint = keyGen.getZOrderedCurve("98.22", 
                                                 "33.44");
                fail("Should not parse. Too big an integral 
                      value.");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("invalid integral"));
            }
    
            try {
                zpoint = keyGen.getZOrderedCurve("78.22", 
                                                 "-91.44");
                fail("Should not parse. Too big an integral 
                      value.");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("invalid integral"));
            }
    
            try {
                zpoint = 
           keyGen.getZOrderedCurve("332.22.33","33.44.33.22");
                fail("Should not parse. Too many split values.");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("Malformed 
                                                    point"));
            }
    
            try {
                zpoint = keyGen.getZOrderedCurve("33.22a", 
                                                 "33.33");
                fail("Should not parse. Contains bad characters.");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("contains non-numeric characters"));
            }
    
            try {
                zpoint = keyGen.getZOrderedCurve("33.22",
                                                 "3c.33");
                fail("Should not parse. Contains bad characters.");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("contains non-numeric characters"));
            }
        }
  12. Add the testValidReverseTime() unit test method.
        @Test
        public void testValidReverseTime() {
            String dateStr = "2012-05-23";
            long reverse = keyGen.getReverseTime(dateStr);
            try {
                Date date = dateFormatter.parse(dateStr);
                assertEquals(reverse, (Long.MAX_VALUE – 
                             date.getTime()));
            } catch (ParseException e) {
                fail(e.getMessage());
            }
        }
  13. Add the testInvalidReverseTime() unit test method.
        @Test
        public void testInvalidReverseTime() {
            try {
                long reverse = keyGen.getReverseTime("201a-22-
                                                      22");
                fail("Should not reverse invalid date for 
                      DateFormat");
            } catch (IllegalArgumentException e) {
                assertTrue(e.getMessage().contains("could not be parsed to a valid date with the supplied DateFormat"));
            }
        }
  14. Add the testFullKey() unit test method.
        @Test
        public void testFullKey() {
            try {
                String dateStr = "2012-03-13";
                Date date = dateFormatter.parse(dateStr);
                long reverse = Long.MAX_VALUE - date.getTime();
    
                // +90 = 123.55, 156.77
                String key = keyGen.getRowID(new String[]{"33.55", "66.77", dateStr});
                assertEquals(key, "1125365757_" + reverse);
            } catch (ParseException e) {
                fail(e.getMessage());
            } catch (IllegalArgumentException e) {
                fail(e.getMessage());
            }
        }
    
    }
  15. Run the unit tests in your environment. Every test should pass.

How it works...

This code will serve as the basis for generating geospatial/reverse chronological row keys. It exists as an independent component outside of any code that loads data to Accumulo. It is designed specifically to build row keys that will sort in a very particular order once persisted to an Accumulo table.

First, we define a general interface RowIDGenerator.java that could be re-used to build different key generator implementations. All implementing classes must fulfill a simple contract for getRowID(). It takes an array of arbitrary strings and returns a single string representing the rowID. Should any errors occur, throw an IllegalArgumentException exception. The class ACLEDRowIDGenerator.java requires an array of at least three strings for input. We then start to build the Z-order structure necessary for the rowID strategy.

The getZOrderedCurve() method takes lat and lon strings as arguments. Building an effect quadtree using the lat/lon point requires the points to adhere to strict formatting guidelines, thus before we shuffle the points, we must validate and format the points using the function cleanAndValidatePoint().

The function cleanAndValidatePoint() first separates the integral portion to the left-hand side of the decimal from the fractional portion to the right-hand side of the decimal. A point is not required to contain a decimal portion, but it must at least contain an integral portion. Additionally, there should not be multiple fraction portions. Therefore, we throw an IllegalArgumentException exception if splitting the point on decimal does not return a one- or two-element array. Moving on, we offset each point by +90 to avoid negative numbers, which would otherwise corrupt the Z-order interpretation. If after applying the offset we contain a point with integral greater than 180 or less than 0, we can conclude that the point either started at a number greater than 90 or a number less than -90. Both these conditions flag an invalid point, and we throw an IllegalArgumentException exception indicating such. If after these checks our point is still considered valid, it is time to start formatting it for proper Z-order interpretation. Depending on the length of the point, we want to zero-pad the beginning such that the integral portion is always of length 3. This will make more sense when we examine how getZOrderCurve() uses the result. If applicable, add the fractional portion back to the reformatted string representation without the decimal place. If at any time we get a NumberFormatException exception, throw an IllegalArgumentException exception.

Once both latitude and longitude points have been properly formatted, we are ready to start shuffling the numbers to build the quadtree. As our loop control variable, we'll take the greater of the two lengths when comparing latitude and longitude, and use that as our loop's max variable. As we cycle through i going from 0 to max, we start with lat and print the ith character, followed by the ith character of lon. Should we reach the length of lat before lon, or vice versa, print 0 for an interleaved spot in the iteration. This helps generate a consistent key for a given lat/lon pair no matter the discrepancy in precision between the latitude and longitude (that is, lat/lon: 1.23/4.56789 can be interpreted as 1.23000/4.56780).

The general idea is to interleave the points such that the most significant digits are arranged in the left-to-right order. Accumulo will maintain the sorted order of our geospatial keys using lexicographical byte-order sorting. This means that the points pertaining to similar geographic regions will be arranged contiguously for effective range scans. We can quickly look up the given points for a particular lat/lon bounding region by building the offset Z-order representation of both the lower and bound bounding parameters and setting the start and end key ranges to the lower and upper bounding parameters respectively. For example, searching for all of the points between lat/lon: 30.1/60.2 and 40.8/70.9 would produce 120.1/150.2 and 130.8/160.9 (offset +90). The Z-order representation would thus have the lower-bound (start-key) value of 11250012 and an upper-bound (end-key) value of 11360089. This is why it is critical to zero-pad the integral portion of the lat/lon points. Without doing so, the application would incorrectly place 1.23 near 10.3 in the table, since the Z-order shuffle for both points would yield a row key that started with 1.

The geospatial portion is only half of our rowID. When storing ACLED event data, we would like to arrange events that lie in similar lat/lon regions in reverse chronological order. The function getReverseTime() achieves this by appending a reverse timestamp for the given item to the already calculated Z-order curve, separated by an underscore token. This allows us to use the same table in Accumulo to further restrict queries by temporal ranges (that is, 100 most recent, last 3 months', and so on). Events with the exact same lat/lon values will sort the records in the Accumulo table in the ascending order, but more recent events when converted to milliseconds from epoch will have larger long values. To counter this, we subtract the long value from the maximum possible long value. If incoming date strings do not match the simple date format yyyy-MM-dd, throw an exception.

The resulting keys take the form of zOrderPoint_reverseTimestamp.

The unit tests are designed to test the error handling of getZOrderCurve() and getReverseTime() as well as validate the expected output. We run this suite of tests to perform a stress test on our rowID generator before using it to load new ACLED event records into our Accumulo table.

There's more...

The rowID generation strategy listed in this recipe is designed to accommodate lat/lon geospatially bound queries with an optional time restriction for events. While this sounds very open-ended, there really is no one-size-fits-all solution for rowIDs when it comes to BigTable-designed columnar datastores. Depending on the types of queries you wish to perform over Accumulo, your rowID strategy might differ entirely. The following are a few sections that further expand on the design choices made in this recipe.

Lexicographic sorting of keys

Accumulo arranges key-value pairs stored in a table in the lexicographical sorted order of the key contents. This means that keys are arranged in terms of their respective byte contents, which does not always conform to an expected natural ordering pattern. For example, consider that we wanted to persist the sequence {1,2,10} as rowIDs. The lexicographic order would sort 10 after 1, but before 2, which is not what we expected for our sequence. This recipe circumvents this limitation by using zero-padding points to create a fixed-length string representation where the byte sorted order matches the expected natural ordering. Zero-padding the mentioned sequence produces 01, 02, and 10; which, when sorted lexicographically, maintains the sequence 01, 02, and 10.

This technique plays a key role in the previous recipe. Without using fixed-length points, the Z-order shuffle of the points 1.23, 9.88 and 10.23, 9.88 would order them closer in an overall sorted order of the dataspace than they technically belong. The Z-order representation would produce 192838 and 19082830 respectively, which gives an inaccurate appearance of the two points being close together. In this recipe, the offset of +90 means that no point can exceed 180, implying a maximum integral length of three digits. By zero-padding every integral out to three characters (001.23 instead of 1.23, 010.23 instead of 10.23, and so on), the left-to-right digit ordering of the rowID more accurately reflects the point separation.

Z-order curve

Z-order curve is a technique to generate quadtrees that represent a flattened, 2-dimensional view of geospatial data. A more in-depth explanation can be found in Wikipedia at http://en.wikipedia.org/wiki/Z-order_curve

Specifically, this recipe uses the technique to produce rowIDs that are flexible for range queries involving lat/lon points where the precision of the upper/lower bounding parameters can vary. The left-to-right placement of significant digits in the rowID means that a shorter Z-order queryID will match on more rows that begin with the supplied queryID pattern than would a longer queryID pattern. Take, for example, the lat/lon bounded query 30.1/40.2 and 50.7/60.8; when interleaved, this produces a start-key of 340012 and an end-key of 560078. However, the same table could be used for a more precise bounded range query such as 30.123/40.234 and 50.789/60.891, which yields start keys 3400122334 and 5600788991. The former, less verbose start- or end-key range will return more rows than the latter, which is what you would expect.

See also

  • Using MapReduce to bulk import geographic event data into Accumulo
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.61