Limiting query results using the regex filtering iterator

This recipe will use the built-in RegExFilter class in Accumulo to return only key-value pairs, where the qualifier is of a particular source value. The filtering will be distributed across the different TabletServers that house the table acled.

Getting ready

This recipe will be the easiest to test over a pseudo-distributed Hadoop cluster with Accumulo 1.4.1 and Zookeeper 3.3.3 installed. The shell script in this recipe assumes that Zookeeper is running on the host localhost and on the port 2181; you can change this to suit your environment needs. The Accumulo installation's bin folder needs to be on your environment path.

For this recipe you'll need to create an Accumulo instance named test with the user as root and password as password.

To see the filtered results from this recipe, you will need to complete the Using MapReduce to bulk import geographic event data into Accumulo recipe listed earlier in this chapter. This will give you some sample data to experiment with.

How to do it...

Follow these steps to use the Regex filtering iterator:

  1. Open your Java IDE of choice. You will need to configure the Accumulo core and Hadoop classpath dependencies.
  2. Create a build template that produces a JAR file named accumulo-examples.jar.
  3. Create the package example.accumulo and add the class SourceFilterMain.java with the following content:
    package examples.accumulo;
    
    import org.apache.accumulo.core.client.Connector;
    import org.apache.accumulo.core.client.IteratorSetting;
    import org.apache.accumulo.core.client.Scanner;
    import org.apache.accumulo.core.client.ZooKeeperInstance;
    import org.apache.accumulo.core.data.Key;
    import org.apache.accumulo.core.data.Value;
    import org.apache.accumulo.core.iterators.user.RegExFilter;
    import org.apache.accumulo.core.security.Authorizations;
    import org.apache.hadoop.io.Text;
    
    import java.util.HashMap;
    import java.util.Map;
    
    public class SourceFilterMain {
    
        public static final String TEST_TABLE = "acled";
    
        public static final Text COLUMN_FAMILY = new Text("cf");
        public static final Text SRC_QUAL = new Text("src");
  4. The main() method handles argument parsing and querying with the filter:
        public static void main(String[] args) throws Exception {
            if(args.length < 5) {
                System.err.println("usage: <src> <instance 
                    name> <user> <password> <zookeepers>");
                System.exit(0);
            }
            String src = args[0];
            String instanceName = args[1];
            String user = args[2];
            String pass = args[3];
            String zooQuorum = args[4];
            ZooKeeperInstance ins = new 
                    ZooKeeperInstance(instanceName, zooQuorum);
            Connector connector = ins.getConnector(user, pass);
            Scanner scan = connector.createScanner(TEST_TABLE, 
                           new Authorizations());
            scan.fetchColumn(COLUMN_FAMILY, SRC_QUAL);
            IteratorSetting iter = new IteratorSetting(15, 
                             "regexfilter", RegExFilter.class);
            iter.addOption(RegExFilter.VALUE_REGEX, src);
            scan.addScanIterator(iter);
            int count = 0;
            for(Map.Entry<Key, Value> row : scan) {
              System.out.println("row: " + 
                             row.getKey().getRow().toString());
              count++;
            }
            System.out.println("total rows: " + count);
        }
    }
  5. Save and build the JAR file accumulo-examples.jar.
  6. In the base working folder where accumulo-examples.jar is located, create a new shell script named run_src_filter.sh with the following commands. Be sure to change ACCUMULO-LIB, HADOOP_LIB, and ZOOKEEPER_LIB to match your local paths:
    ACCUMULO_LIB=/opt/cloud/accumulo-1.4.1/lib/*
    HADOOP_LIB=/Applications/hadoop-0.20.2-
    cdh3u1/*:/Applications/hadoop-0.20.2-cdh3u1/lib/*
    ZOOKEEPER_LIB=/opt/cloud/zookeeper-3.4.2/*
    java -cp $ACCUMULO_LIB:$HADOOP_LIB:$ZOOKEEPER_LIB:accumulo-examples.jar examples.accumulo.SourceFilterMain
     'Panafrican News Agency'
     test
     root
     password
     localhost:2181
  7. Save and run the script. You should see 49 rows returned for the source Panafrican News Agency.

How it works...

The script takes in the required parameters necessary to connect to the Accumulo table acled, plus an additional parameter for a source qualifier value to filter on. We set up a Scanner instance with blank authorizations and configure an IteratorSetting of type RegExFilter to do the regex comparison on the TabletServer. Our regex is a very simple direct match on the supplied source argument.

We then iterate over the result set and printout the rowID for any matching key-value pairs. At the end, we print a tally of how many key-value pairs were found matching that source.

The responsibility of filtering key-value pairs based on the value is distributed across the various TabletServers that hold tablets for the acled table. The client only sees rows that match the filter, and can immediately begin processing.

See also

  • Using MapReduce to bulk import geographic event data into Accumulo
  • Enforcing cell-level security on scans using Accumulo
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.79.176