Scaling Solr through Storm

Apache Storm is a real time distributed computation framework. It processes humongous data in real time. Recently, Storm has been adapted by Apache as the incubating project and the development for Apache Storm. You can read more information about Apache Storm Features here: http://storm.incubator.apache.org/.

Apache Storm can be used to process massive streams of data in a distributed manner. It therefore provides excellent batch-oriented processing capabilities for time-sensitive analytics. With Apache Solr and Storm together, organizations can process big data in real time: for example, such industrial plants that would like to extract information from their plant system, which is emitting raw data continuously, and process it to facilitate real-time analytics such as identifying the top problematic systems or looking for recent errors/failures. Apache Solr and Storm can work together to execute such batch processing for big data in real time.

Apache Storm runs in a cluster mode where multiple nodes participate in performing computation in real time. It supports two types of nodes: a master node (also called Nimbus) and a worker node (also called a slave). As the name describes, Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures, whereas the supervisor listens for work assigned to its machine and starts and stops worker processes as necessary on the basis of what Nimbus has assigned to it. Apache Storm uses ZooKeeper to perform all the co-ordination between Nimbus and the supervisor. The data in Apache Storm is ready as a stream, which is simply a tuple of name value pairs:

{id: 1748, author_name: "hrishi", full_name: "Hrishikesh Karambelkar"}

Apache Storm uses the concept of Spout and Bolts. All work is executed in the Apache Storm topology. The following screenshot shows the Storm topology with an example of word count:

Scaling Solr through Storm

Spouts are data inputs; this is where data arrives in the Storm cluster. Bolts process the streams that get piped into it. They can be fed data from spouts or other bolts. The bolts can form a chain of processing, with each bolt performing a unit task. This concept is similar to MapReduce, which we will discuss in the following chapters.

Getting along with Apache Storm

Let's install Apache Storm and try out a simple word count example:

  1. You will require ZooKeeper to be downloaded first since both Nimbus and the supervisor have dependencies on them. You can download it from http://zookeeper.apache.org/ and unzip it at some place. Copy zoo.cfg from the book's codebase, or rename zoo_sample.cfg to zoo.cfg in your code.
  2. Start the ZooKeeper:
    $ bin/zkServer.sh
    
  3. Make sure ZooKeeper is running. Now, download Apache Storm from http://storm.incubator.apache.org/downloads.html.
  4. Unzip it, and go to the $STORM_HOME/conf folder. Edit storm.yaml and put the correct Nimbus host. You can use the configuration file provided along with the book. If you are running it in a cluster environment, your nimbus_host needs to point to the correct master. In this configuration, you may also provide multiple ZooKeeper servers for failsafe.
  5. Now, set JAVA_HOME and STORM_HOME:
    $ export STORM_HOME=/home/hrishi/storm
    $ export JAVA_HOME=/usr/share/jdk
    
  6. Start the master in a separate terminal by running:
    $ $STORM_HOME/bin/storm nimbus
    
  7. Start workers on machines by calling:
    $ $STORM_HOME/bin/storm supervisor
    
  8. Start the web interface by running:
    $ $STORM_HOME/bin/storm ui
    
  9. Now, access the web user interface by typing http://localhost:8080 from your browser. A screen similar to the following screenshot should be visible now:
    Getting along with Apache Storm
  10. Now that the Storm cluster is working fine, let's try a simple word count example from https://github.com/nathanmarz/storm-starter. You can download the source and compile, or take a pre-compiled jar from the book source code repository.
  11. You also need to install python on your instances where Apache Storm is running, in order to run this example. You can download and install python from https://www.python.org/. Once python is installed and added in the PATH environment, you can run the following command to start the word count task:
    $ bin/storm jar storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology WordCount -c nimbus.host=<host>
    

In the word count example, you will find different classes being mapped to different roles as shown in the following code snippet:

Getting along with Apache Storm
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.220.44