Chapter 4. Big Data Search Using Hadoop and Its Ecosystem

Sometime back, Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which revealed that big data and analytics are among the top 10 business priorities for CIOs; similarly, analytics and BI are also at the top of CIO's Technical Priorities. Big data presents three major concerns for any organization: namely the storage of big data, data access or querying, and data analytics. Apache Hadoop provides an excellent implementation framework for the organizations looking to solve these problems. Similarly, there is other software that provides efficient storage and access to big data, such as Apache Cassandra and R Statistical. In this chapter, we will explore the possibilities of Apache Solr in working with big data.

We have already discussed a scaling search with SolrCloud in the previous chapters. In this chapter, we will be focusing on the following topics:

  • Understanding NoSQL
  • Working with Solr HDFS Connector
  • Big data Search using Katta
  • Solr 1045 Patch: Map Side Indexing
  • Solr 1301 Patch: Reduce Side Indexing
  • Distributed Search using Apache Blur
  • Apache Solr and Cassandra
  • Scaling Solr through Storm
  • Advanced Analytics with Solr

Understanding NoSQL

Traditional relational databases allow users to define a strict data structure, and use an SQL-based querying mechanism. NoSQL databases, rather than confining users to define the data structures, allow an open database with which they can store any kind of data and retrieve it by running queries that are not SQL based. In an enterprise, data is generated from all the software used in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires for a storage system that is flexible enough, to accommodate data with varying data models. The NoSQL database, by design is best suited for such storage.

Note

The CAP theorem or Brewer's theorem talks about distributed consistency. It states that it is impossible to achieve all of the following in a distributed system:

  • Consistency: Every client sees the most recently updated data state.
  • Availability: The distributed system functions as expected, even if there are node failures.
  • Partition tolerance: Intermediate network failure among nodes does not impact system functioning.

Achieving all three of these capabilities is a difficult task, so most databases focus on achieving any two. You can read more information on the CAP theorem at http://en.wikipedia.org/wiki/CAP_theorem.

One of the primary objectives of NoSQL is horizontal scaling, that is, achieving the P in the CAP theorem at the cost of sacrificing Consistency or Availability. As we have seen, data models for NoSQL differ completely from those of relational databases. With the flexible data model, it becomes very easy for developers to quickly integrate the NoSQL database and bring in heavy data from different data sources. This makes NoSQL databases ideal for big data storage, since it demands different data types to be brought together under one umbrella.

In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.97.64