Chapter 10. Scaling Solr Capabilities with Big Data

In today's world, organizations produce gigabytes of information every day from various applications that are actively utilized by employees for various purposes. The data sources can vary from application software databases, online social media, mobile devices, and system logs to factory-based operational subsystem sensors. With such huge, heterogeneous data, it becomes a challenge for IT teams to process it together and provide data analytics. In addition to this, the size of this information is growing exponentially. With such variety and veracity, using standard data-processing applications to deal with large datasets becomes a challenge and the traditional distributed system cannot handle this Big Data. In this chapter, we intend to look at the problem of handling Big Data using Apache Solr and other distributed systems.

We have already seen some information about NOSQL databases and CAP theorem in Chapter 2, Getting Started with Apache Solr. NOSQL databases can be classified into multiple types: key-value based stores or columnar storage, document-oriented storage, graph databases, and so on. In key-value stores, the data gets stored in terms of key and values. Key is a unique identifier that identifies each data unit, and value is your actual data unit (document). There are further subtypes to this store: hierarchical, tabular, volatile (in-memory) and persistent (storage). NOSQL databases provide support for heavy data storage such as Big Data, unlike standard relational database models. Recently, Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which reveals that Big Data and analytics are among the top 10 business priorities for CIOs; similarly, analytics and BI stand top priority for CIO's technical priorities.

Big Data presents three major concerns of any organization: the storage of Big Data, data access or querying, and data analytics. Apache Hadoop provides an excellent implementation framework for the organizations looking to solve these problems. Similarly, there is other software that provides efficient storage and access to Big Data, such as Apache Cassandra and R Statistical. In this chapter, we intend to explore the possibilities of Apache Solr in working with Big Data. We have already seen scaling search with SolrCloud in the previous chapters. In this chapter, we will be focusing on the following topics:

  • Apache Solr and HDFS
  • Using Katta for Big Data search
  • Solr 1045 patch: map-side indexing
  • Solr 1301 patch: reduce-side indexing
  • Apache Solr and Cassandra
  • Advanced Analytics with Solr

Apache Hadoop is designed to work in a completely distributed manner. The Apache Hadoop ecosystem comprises two major components, which are as follows:

  • The MapReduce framework
  • Hadoop Distributed File System (HDFS)

The MapReduce framework splits the input data into small chunks that are forwarded to a mapper, followed by a reducer that reduces them and produces the final outcome. Similarly, the HDFS filesystem manages how the datasets are stored in the Hadoop cluster. Apache Hadoop can be set up in a single proxy node or in a cluster node configuration. Apache Solr can be integrated with the Hadoop ecosystem in different ways. Let's look at each of them.

Apache Solr and HDFS

Apache Solr can utilize HDFS to index and store its indices on the Hadoop system. It does not utilize MapReduce-based framework for indexing. The following diagram shows the interaction pattern between Solr and HDFS. You can read more details about Apache Hadoop at http://hadoop.apache.org/docs/r2.4.0/.

Apache Solr and HDFS

Let's understand how this can be done:

  1. To start with, the first and most important task is getting Apache Hadoop set up on your machine (proxy node configuration) or setting up a Hadoop cluster. You can download the latest Hadoop tarball or ZIP from http://hadoop.apache.org. The newer generation Hadoop uses advanced MapReduce (also known as yarn).
  2. Based on the requirement, you can set up a single node (http://hadoop.apache.org/docs/r<version>/hadoop-project-dist/hadoop-common/SingleCluster.html) or a cluster setup (http://hadoop.apache.org/docs/r<version>/hadoop-project-dist/hadoop-common/ClusterSetup.html).
  3. Typically, you will be required to set up the Hadoop environment and modify different configurations (yarn-site.xml, hdfs-site.xml, master, slaves, and so on). Once it is set up, restart the Hadoop cluster.
  4. Once Hadoop is set up, verify the installation of Hadoop by accessing http://host:port/cluster; you would see the Hadoop cluster status, as shown in the following screenshot:
    Apache Solr and HDFS
  5. Now, using the following HDFS command, create a directory in HDFS to keep your Solr index and Solr logs:
    $ $HADOOP_HOME/bin/hdfs.sh dfs -mkdir /solr
    $ $HADOOP_HOME/bin/hdfs.sh dfs -mkdir /solr-logs
    

    This call will create directories on the / root folder on HDFS. You can verify these by running the following command:

    $ $HADOOP_HOME/bin/hdfs.sh dfs –ls /
    Found 2 items
    drwxr-xr-x   - hrishi supergroup          0 2014-05-11 11:29 /solr
    drwxr-xr-x   - hrishi supergroup          0 2014-05-11 11:27 /solr-logs
    

    You may also browse the directory structure by accessing http://<host>:50070/

  6. Once the directories are created, the next step will be to point Apache Solr to run with Hadoop HDFS. This can be done by passing JVM arguments for DirectoryFactory. If you are running Solr on Jetty, you can run the following command:
    java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.data.dir=hdfs://<host>:19000/solr -Dsolr.updatelog=hdfs:// <host>:19000/solr-logs -jar start.jar
    

    You can validate Solr on HDFS by accessing the Solr admin UI. Consider the following screenshot:

    Apache Solr and HDFS
  7. In case you are using Apache SolrCloud, you can point solr.hdfs.home to your HDFS directory and keep the data and logs on the local machine:
    java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://<host>:19000/solrhdfs -jar start.jar
    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.194.182