HBase filters

As the name suggests, filter means to extract or take out only required data and discard useless or excess data. HBase provides a good number of filters, which we can use in get and scan operations to extract or fetch only the needed data from HBase, preventing scanning-not-required data.

HBase filters are a powerful feature that can greatly enhance effectiveness while working with data stored in tables. The two read functions for HBase, get() and scan(), support direct access to data and the use of a start and end key, respectively. We can limit the data retrieved by adding limiting selectors to the HBase query. These include column families, column qualifiers, timestamps, ranges, and version numbers.

We can represent HBase filter uses as shown in the following diagram, where we specify filters in get or scan. It fetches data from different RegionServers where these filters are shipped using RPC calls and compared with the local data at RegionServers:

HBase filters

Types of filters

Now, let's see different types of filters and their uses. Before discussing this, we will see the operator on which filters depend for comparison:

Operator type

Description

BitComparator.BitwiseOp

This performs the bitwise comparison. The following are the enum constants:

  • AND (and)
  • OR (or)
  • XOR (xor)

You can read more on this operator at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BitComparator.BitwiseOp.html .

CompareFilter.CompareOp

This is a generic type of filter that is to be used to compare. It can take operators such as equal, greater and not equal. This is also a byte [] comparator. The following are the enum constants:

  • EQUAL
  • GREATER
  • GREATER_OR_EQUAL
  • LESS
  • LESS_OR_EQUAL
  • NO_OP
  • NOT_EQUAL

You can read more on this operator at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html.

Filter.ReturnCode

These are the return code for the filter value. The following are the enum constants:

  • INCLUDE: This is used to include the cell
  • INCLUDE_AND_NEXT_COL: This is used to seek the next column by skipping and also include the cell
  • NEXT_COL: This is used to move to the next column by skipping
  • NEXT_ROW: This is used to move to the next row by skipping
  • SEEK_NEXT_USING_HINT: This is used to move to the next key that's given as a hint using a filter
  • SKIP: This is the skip cell

You can read more on this operator at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html.

FilterList.Operator

These are the conditions for more than one filter in a filter list. The following are the enum constants:

  • MUST_PASS_ALL
  • !AND
  • MUST_PASS_ONE
  • !OR

You can read more on this operator at http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.Operator.html.

We have seen the operators used in combination with filters. We will see the use of this in example code; now, let's understand the list of filters available:

Filter types

Description

BinaryComparator

This filter is used for binary comparison lexicographically. It compares against the given byte array, using Bytes.compareTo (byte[], byte[]).

Have a look at the following example:

SingleColumnValueFilter colValFilterbinary = new SingleColumnValueFilter(Bytes.toBytes("detail"), Bytes.toBytes("name") ,CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes("shash")));

BinaryPrefixComparator

This is a binary comparator filter that compares byte arrays at the prefix level.

BitComparator

This filter comparator performs the given bitwise operation on each of the bytes with the given byte array.

ByteArrayComparable

This is the base class for byte array comparators.

ColumnCountGetFilter

This is the filter that gives the first N number of columns on rows only.

ColumnPaginationFilter

This is based on the ColumnCountGetFilter; it takes two arguments, limit and offset, and is used for pagination.

ColumnPrefixFilter

This filter is used to get keys with columns that match a specified prefix.

ColumnRangeFilter

This filter is used to select columns between the min-column and max-column values.

CompareFilter

This is a generic filter used to filter by comparison.

DependentColumnFilter

This filter is used to add intercolumn timestamp matching cells with a corresponding timestamp.

FamilyFilter

This filter is based on column families.

Filter

This is the interface for row and column filters, which can be directly applied within RegionServer.

FilterList

Using this, we can implement a logical comparison. This is an ordered list or a set of other filters accompanied by comparison operators that must satisfy the conditions implied in the filter list while comparison. The following are the comparison operators:

  • FilterList.Operator.MUST_PASS_ALL (AND)
  • FilterList.Operator.MUST_PASS_ONE (OR)

FirstKeyOnlyFilter

This filter returns only the first KeyValue from each row.

FirstKeyValueMatchingQualifiersFilter

This filter checks for the specified columns in KeyValue.

FuzzyRowFilter

This filter is based on fuzzy row keys.

InclusiveStopFilter

This filter stops after the given row.

KeyOnlyFilter

This filter will only return the key component of each KeyValue.

MultipleColumnPrefixFilter

This is used to select keys with columns that match a given prefix.

NullComparator

This is a binary comparator; it lexicographically compares against the given byte array using Bytes.compareTo (byte[], byte[]).

PageFilter

This filter limits results to a specific page size.

ParseConstants

This holds a set of constants related to parsing filter strings used by ParseFilter.

ParseFilter

This allows users to specify a filter via a string.

PrefixFilter

This passes results that have same row prefixes.

QualifierFilter

This is a filter based on column qualifiers.

RandomRowFilter

This includes rows based on a chance.

RegexStringComparator

This is a regular expression-based filter.

RowFilter

This is used to filter based on the row key.

SingleColumnValueExcludeFilter

This checks a single column value, but does not return the tested column.

SingleColumnValueFilter

This is used to filter cells based on value.

SkipFilter

This is a filter that filters an entire row if any one of the row cell checks do not pass the comparison.

SubstringComparator

This is a filter based on substrings in a value.

TimestampsFilter

This is a filter based on timestamps of the data.

ValueFilter

This filter is based on column values.

WhileMatchFilter

This filter will continue till the match is found.

So, we have seen the list of filters that can be used in read methods; they are Get() and Scan(), which are used to filter out the unnecessary data and fetch only the required data. The following is a sample code that contains the use of filters in read methods:

import java.io.IOException;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.SubstringComparator;
import org.apache.hadoop.hbase.filter.ValueFilter;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.util.Bytes.toBytes;

public class FilterExample {
  public static void main(String[] arguments) throws IOException {
    Configuration config = HBaseConfiguration.create();
    HTable hbaseTableObj = new HTable(config, "logTable");
    Scan scanObj = new Scan();
    scanObj.setFilter(new ValueFilter(CompareOp.EQUAL, new SubstringComparator("shash")));
    ResultScanner resultScannerObj = hbaseTableObj.getScanner(scanObj);
      for ( Result result : resultScannerObj){
        byte [] value = result.getValue(toBytes("ColFamily"), toBytes("columnName"));
          System.out.println(Bytes.toString(value));
      }
        resultScannerObj.close();
        hbaseTableObj.close();
  }
}

The following example shows how we can use a list of filters that is not a single filter but a combination of many, and this is done using a filter list:

import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FilterList.Operator;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import java.io.IOException;
import org.apache.hadoop.hbase.util.Bytes.toBytes;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.util.Bytes;
public class ExampleOfFilterList {
  public static void main(String[] arguments) throws IOException {
    Configuration config = HBaseConfiguration.create();
    HTable hbaseTableObj = new HTable(config, "logTable");
    Scan scanObj = new Scan();
    FilterList filterListObj = new FilterList(Operator.MUST_PASS_ALL);
    filterListObj.addFilter(new KeyOnlyFilter());
    filterListObj.addFilter(new FirstKeyOnlyFilter());
    scanObj.setFilter(filterListObj);
    ResultScanner resultScannerObj = hbaseTableObj.getScanner(scanObj);
    for ( Result result : resultScannerObj){
      byte [] value = result.getValue(toBytes("colFamName"), toBytes("colName"));
      System.out.println("Value found :" +Bytes.toString(value));
    }
    resultScannerObj.close();
    hbaseTableObj.close();
  }
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.74.25