Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Distributed search using Apache Blur

Apache Blur is a distributed search engine that can work with Apache Hadoop. It is different from the traditional big data system in that it provides a relational data model-like storage, on top of HDFS. Apache Blur does not use Apache Solr; however, it consumes Apache Lucene APIs. Blur provides faster data ingestion using MapReduce and advanced searches such as a faceted search, fuzzy, pagination, and a wildcard search.

Apache Blur provides a row-based data model (similar to RDBMS), with unique row IDs. Records should have a unique record ID, row ID, and column family. Column family is a group of logical columns. For example, the personal information column family will have columns such as name, companies with which the person works, and contact information. The following figure shows how Apache Blur works closely with Apache Hadoop:

Apache Blur uses Hadoop to store its indexes in a distributed manner. It uses Thrift APIs for all interprocess communication. Blur Shard Server is responsible for managing shards, their availability, and so on, by using Apache ZooKeeper. Blur Controller provides a single point of access to the Apache Blur cluster.

Setting up Apache Blur with Hadoop

The current version of Apache Blur (0.2.3) works with Hadoop 1.x and 2.x. However, 2.x is not yet validated for scalability. We will set up Apache Blur with Apache Hadoop 1.2.1, load Hadoop with data, index it using Apache Blur, and search for it:

Apache Blur can be downloaded directly from the site http://incubator.apache.org/blur/. Download Hadoop1 Binary.

Unzip the binary in your user folder with the following command:

hrishi@nova:~$ tar –xvzf apache-blur-<version>-hadoop1-bin.tar.gz

Now, download Apache Hadoop 1.2.1 from the following site: http://www.apache.org/dyn/closer.cgi/hadoop/common/.
Now, set up the Hadoop single node or a cluster with the help of Apache Documentation (link: https://hadoop.apache.org/docs/r1.2.1/#Getting+Started) (you will also find the Hadoop 1.X setup in the previous version of this book).
Once Apache Hadoop is set up, you can start the Hadoop cluster with the start-all.sh command.
Start Blur from the command line as shown in the following screenshot:
Take the CSV file (education-info.csv) provided in the blur folder of this book, and load it in Hadoop DFS with the following command. This CSV file contains sample data with pre-seeded row IDs and record IDs. In case you do not have these, you can provide –A (to generate row IDs), and –a (to generate record IDs):
```
hrishi@nova:~/hadoop $ ./bin/hadoop dfs -copyFromLocal blur/education-info.csv hdfs://<ip-address>:<port>/education/sample
```
We are going to index this file in Apache Blur, but first, we need to create a table. This can be done in various ways. We will do it through the blur shell:
In this case, -c indicates the number of shards to be created. You will find the details of all shell commands at https://incubator.apache.org/blur/docs/0.2.3/using-blur.html#shell_table_commands.

Now, create the indexes in Blur by using the CSV loader, the following screenshot shows how you can load it in blur:

hrishi@nova:~/blur $ ./bin/blur csvloader -t educationinfo -c localhost:40010 –I localhost:9000/education/sample –d education degree school year –d presonalinfo personname company phone

Once your table is populated, you can simply run a query on blur to check the matching:
```
Blur (default)> query educationinfo personalinfo.personname:Hrishikesh
```

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Distributed search using Apache Blur

Create new playlist

Sign In

Sign Up

Distributed search using Apache Blur

Setting up Apache Blur with Hadoop

Table of Contents for
Distributed search using Apache Blur