Distributed search using Apache Blur

Apache Blur is a distributed search engine that can work with Apache Hadoop. It is different from the traditional big data system in that it provides a relational data model-like storage, on top of HDFS. Apache Blur does not use Apache Solr; however, it consumes Apache Lucene APIs. Blur provides faster data ingestion using MapReduce and advanced searches such as a faceted search, fuzzy, pagination, and a wildcard search.

Apache Blur provides a row-based data model (similar to RDBMS), with unique row IDs. Records should have a unique record ID, row ID, and column family. Column family is a group of logical columns. For example, the personal information column family will have columns such as name, companies with which the person works, and contact information. The following figure shows how Apache Blur works closely with Apache Hadoop:

Distributed search using Apache Blur

Apache Blur uses Hadoop to store its indexes in a distributed manner. It uses Thrift APIs for all interprocess communication. Blur Shard Server is responsible for managing shards, their availability, and so on, by using Apache ZooKeeper. Blur Controller provides a single point of access to the Apache Blur cluster.

Setting up Apache Blur with Hadoop

The current version of Apache Blur (0.2.3) works with Hadoop 1.x and 2.x. However, 2.x is not yet validated for scalability. We will set up Apache Blur with Apache Hadoop 1.2.1, load Hadoop with data, index it using Apache Blur, and search for it:

  1. Apache Blur can be downloaded directly from the site http://incubator.apache.org/blur/. Download Hadoop1 Binary.
  2. Unzip the binary in your user folder with the following command:
    hrishi@nova:~$ tar –xvzf apache-blur-<version>-hadoop1-bin.tar.gz
    
  3. Now, download Apache Hadoop 1.2.1 from the following site: http://www.apache.org/dyn/closer.cgi/hadoop/common/.
  4. Now, set up the Hadoop single node or a cluster with the help of Apache Documentation (link: https://hadoop.apache.org/docs/r1.2.1/#Getting+Started) (you will also find the Hadoop 1.X setup in the previous version of this book).
  5. Once Apache Hadoop is set up, you can start the Hadoop cluster with the start-all.sh command.
  6. Start Blur from the command line as shown in the following screenshot:
    Setting up Apache Blur with Hadoop
  7. Take the CSV file (education-info.csv) provided in the blur folder of this book, and load it in Hadoop DFS with the following command. This CSV file contains sample data with pre-seeded row IDs and record IDs. In case you do not have these, you can provide –A (to generate row IDs), and –a (to generate record IDs):
    hrishi@nova:~/hadoop $ ./bin/hadoop dfs -copyFromLocal blur/education-info.csv hdfs://<ip-address>:<port>/education/sample
    
  8. We are going to index this file in Apache Blur, but first, we need to create a table. This can be done in various ways. We will do it through the blur shell:
    Setting up Apache Blur with Hadoop

    In this case, -c indicates the number of shards to be created. You will find the details of all shell commands at https://incubator.apache.org/blur/docs/0.2.3/using-blur.html#shell_table_commands.

  9. Now, create the indexes in Blur by using the CSV loader, the following screenshot shows how you can load it in blur:
    hrishi@nova:~/blur $ ./bin/blur csvloader -t educationinfo -c localhost:40010 –I localhost:9000/education/sample –d education degree school year –d presonalinfo personname company phone
    
    Setting up Apache Blur with Hadoop
  10. Once your table is populated, you can simply run a query on blur to check the matching:
    Blur (default)> query educationinfo personalinfo.personname:Hrishikesh
    
    Setting up Apache Blur with Hadoop
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.46.60