In today's world, organizations produce gigabytes of information every day from various applications that are actively utilized by employees for various purposes. The data sources can vary from application software databases, online social media, mobile devices, and system logs to factory-based operational subsystem sensors. With such huge, heterogeneous data, it becomes a challenge for IT teams to process it together and provide data analytics. In addition to this, the size of this information is growing exponentially. With such variety and veracity, using standard data-processing applications to deal with large datasets becomes a challenge and the traditional distributed system cannot handle this Big Data. In this chapter, we intend to look at the problem of handling Big Data using Apache Solr and other distributed systems.
We have already seen some information about NOSQL databases and CAP theorem in Chapter 2, Getting Started with Apache Solr. NOSQL databases can be classified into multiple types: key-value based stores or columnar storage, document-oriented storage, graph databases, and so on. In key-value stores, the data gets stored in terms of key and values. Key is a unique identifier that identifies each data unit, and value is your actual data unit (document). There are further subtypes to this store: hierarchical, tabular, volatile (in-memory) and persistent (storage). NOSQL databases provide support for heavy data storage such as Big Data, unlike standard relational database models. Recently, Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which reveals that Big Data and analytics are among the top 10 business priorities for CIOs; similarly, analytics and BI stand top priority for CIO's technical priorities.
Big Data presents three major concerns of any organization: the storage of Big Data, data access or querying, and data analytics. Apache Hadoop provides an excellent implementation framework for the organizations looking to solve these problems. Similarly, there is other software that provides efficient storage and access to Big Data, such as Apache Cassandra and R Statistical. In this chapter, we intend to explore the possibilities of Apache Solr in working with Big Data. We have already seen scaling search with SolrCloud in the previous chapters. In this chapter, we will be focusing on the following topics:
Apache Hadoop is designed to work in a completely distributed manner. The Apache Hadoop ecosystem comprises two major components, which are as follows:
The MapReduce framework splits the input data into small chunks that are forwarded to a mapper, followed by a reducer that reduces them and produces the final outcome. Similarly, the HDFS filesystem manages how the datasets are stored in the Hadoop cluster. Apache Hadoop can be set up in a single proxy node or in a cluster node configuration. Apache Solr can be integrated with the Hadoop ecosystem in different ways. Let's look at each of them.
Apache Solr can utilize HDFS to index and store its indices on the Hadoop system. It does not utilize MapReduce-based framework for indexing. The following diagram shows the interaction pattern between Solr and HDFS. You can read more details about Apache Hadoop at http://hadoop.apache.org/docs/r2.4.0/.
Let's understand how this can be done:
yarn-site.xml
, hdfs-site.xml
, master, slaves, and so on). Once it is set up, restart the Hadoop cluster.http://host:port/cluster
; you would see the Hadoop cluster status, as shown in the following screenshot:$ $HADOOP_HOME/bin/hdfs.sh dfs -mkdir /solr $ $HADOOP_HOME/bin/hdfs.sh dfs -mkdir /solr-logs
This call will create directories on the /
root folder on HDFS. You can verify these by running the following command:
$ $HADOOP_HOME/bin/hdfs.sh dfs –ls / Found 2 items drwxr-xr-x - hrishi supergroup 0 2014-05-11 11:29 /solr drwxr-xr-x - hrishi supergroup 0 2014-05-11 11:27 /solr-logs
You may also browse the directory structure by accessing http://<host>:50070/
java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.data.dir=hdfs://<host>:19000/solr -Dsolr.updatelog=hdfs:// <host>:19000/solr-logs -jar start.jar
You can validate Solr on HDFS by accessing the Solr admin UI. Consider the following screenshot:
solr.hdfs.home
to your HDFS directory and keep the data and logs on the local machine:java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://<host>:19000/solrhdfs -jar start.jar
3.142.194.182