Advanced analytics with Solr

Apache Solr provides excellent searching capabilities on the metadata. It is also possible to go beyond a search and faceting with the help of the integration space. As the search industry grows into the next generation, the expectations that search will go beyond a basic search has led to the creation of software such as Apache Solr, which is capable of providing an excellent browsing and filtering experience. It provides basic analytical capabilities. However, for many organizations, this is not sufficient. They would like to bring in capabilities of business intelligence and analytics on top of search engines. Today, it is possible to complement Apache Solr with such advanced analytical capabilities. We will be looking at enabling Solr integration with R.

R is an open source language and environment for statistical computing and graphics. More information about R can be found at http://www.r-project.org/. The development of R started in 1994 as an alternative to SAS, SPSS, and other proprietary statistical environments. R is an integrated suite of software facilities for data manipulation, calculation, and graphic display. There are around 2 million R users worldwide, and it is widely taught in universities. Many corporate analysts know and use R. R provides hundreds of open source packages to enhance productivity, such as:

  • Linear and non-linear modeling
  • Classical statistical tests
  • Time-series analysis
  • Spatial statistics
  • Classification, clustering, and other capabilities
  • Matrix arithmetic, with scalar, vector, matrices, list, and data frame (aka table) structures
  • Extensive library functions (more than 2000) for different graphs/charts

Integrating R with Solr provides organizations with access to these extensive library functions, so that they can perform data analysis on Solr outputs.

Integrating Solr and R

Since R is an analytical engine, it can work on top of Apache Solr to perform a direct analysis on the results of Apache Solr. R can be installed directly through executable installers (.exe/.rpm/bin) that can be downloaded from cran mirrors (http://cran.r-project.org/mirrors.html) for any *nix, Windows, or Mac OS. R can connect to Apache Solr through the CURL utility built in as the RCURL library in R packages. R also provides a library called Solr to use Solr capabilities to search over user data, extracted content, and so on. To enable R with Solr, open the R console from and run:

Integrating Solr and R

Now, to test it, fire a search on your Solr server:

> library(solr)

To test analytics, let us take a simple use case. Assume that there is a multinational job recruitment firm and it is using Apache Solr built on top of candidate resumes. The expectation is to provide facets such as technical capabilities and country. Now, using Apache Solr, they would like to decide which countries to focus their business for certain technology (let's say Solr). So, they would like to classify the countries based on the current available resource pool for Apache Solr. R provides various clustering algorithms, which can provide users with different clusters of data based on characteristics. One of the most widely used algorithms is K-means clustering (More information can be read on http://en.wikipedia.org/wiki/K-means_clustering). To use K-means in R, and plot the graph, you will be required to install the package cluster by calling

> install.packages('cluster')

After the installation of the cluster package, get the facet information using the Solr package of R and process it for K-means. Run the following R script on the console to get the cluster information:

> library(cluster)
> library(solr)
> url <- 'http://localhost:8983/solr/select'
> response1 <- solr_group(q='*:Solr', group.field='Country', rows=10, group.limit=1, base=url)
> m2 <- matrix(response1$numFound,byrow=TRUE)
> rownames(m2) <- response1$groupValue
> colnames(m2) <- 'Available Workforce';
> fit <- kmeans(m2, 2)
> clusplot(m2, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0, xlab="Workforce", ylab="Cluster", main="K-Means Cluster")

Once you run the clusplot() function, you should be able to get a graphical representation of the cluster as shown in the following screenshot:

Integrating Solr and R

The cluster plot in this screenshot demonstrates how Apache Solr search analytics can be used for further advanced analytics using the R statistical language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.112.82