Apache Blur is a distributed search engine that can work with Apache Hadoop. It is different from the traditional big data system in that it provides a relational data model-like storage, on top of HDFS. Apache Blur does not use Apache Solr; however, it consumes Apache Lucene APIs. Blur provides faster data ingestion using MapReduce and advanced searches such as a faceted search, fuzzy, pagination, and a wildcard search.
Apache Blur provides a row-based data model (similar to RDBMS), with unique row IDs. Records should have a unique record ID, row ID, and column family. Column family is a group of logical columns. For example, the personal information column family will have columns such as name, companies with which the person works, and contact information. The following figure shows how Apache Blur works closely with Apache Hadoop:
Apache Blur uses Hadoop to store its indexes in a distributed manner. It uses Thrift APIs for all interprocess communication. Blur Shard Server is responsible for managing shards, their availability, and so on, by using Apache ZooKeeper. Blur Controller provides a single point of access to the Apache Blur cluster.
The current version of Apache Blur (0.2.3) works with Hadoop 1.x and 2.x. However, 2.x is not yet validated for scalability. We will set up Apache Blur with Apache Hadoop 1.2.1, load Hadoop with data, index it using Apache Blur, and search for it:
hrishi@nova:~$ tar –xvzf apache-blur-<version>-hadoop1-bin.tar.gz
start-all.sh
command.education-info.csv
) provided in the blur folder of this book, and load it in Hadoop DFS with the following command. This CSV file contains sample data with pre-seeded row IDs and record IDs. In case you do not have these, you can provide –A
(to generate row IDs), and –a
(to generate record IDs):hrishi@nova:~/hadoop $ ./bin/hadoop dfs -copyFromLocal blur/education-info.csv hdfs://<ip-address>:<port>/education/sample
In this case, -c
indicates the number of shards to be created. You will find the details of all shell commands at https://incubator.apache.org/blur/docs/0.2.3/using-blur.html#shell_table_commands.
hrishi@nova:~/blur $ ./bin/blur csvloader -t educationinfo -c localhost:40010 –I localhost:9000/education/sample –d education degree school year –d presonalinfo personname company phone
Blur (default)> query educationinfo personalinfo.personname:Hrishikesh
18.191.46.60