Over the last few years, open source search software has emerged as a very sound option for organizations looking for search applications for websites, specialist ecommerce sites, search-based applications, and enterprise search. At present, it is impossible to get any market information on the extent to which open source search platforms are being used. It seems likely that the majority of current implementations are for website search and for specialized service applications such as LinkedIn and Twitter.
However, organizations are increasingly considering using open source search software for internal intranet and other enterprise search requirements, and this is likely to accelerate quite rapidly now that most of the larger-scale commercial search software vendors have been acquired and integrated into enterprise suites (see Chapter 5). From a search strategy perspective, it is important that the strategy takes into account the potential introduction of open source applications, even if the incumbent application is SharePoint or a standalone commercial search application.
It is far too easy for an organization, and its IT team, to work on the basis that going open source will result in an effective solution in a very short period of time for very little expenditure. Under some circumstances, an open source solution could meet those requirements. However, if the requirements are not fully considered in advance and allowance made for changes even within the development project, let alone post-implementation, then the results will be disappointing and embarrassing.
The initial step toward open source software was taken by Richard Stallman in 1983 with his GNU Manifesto. GNU, which stands for Gnu’s Not Unix, is the name for the complete Unix-compatible software system that he wrote and wished to give away free to everyone who could use it. The concept of open source software dates from 1998 in the wake of the release by Netscape of the code for its browser. The Open Source Initiative was set up in the same year and drafted a definition of open source code that remains the standard.
The Apache Software Foundation (ASF), one of the major players in open source software, was set up in 1999. It is a membership-based, US-based, not-for-profit corporation with the objective of ensuring that the Apache projects continue to exist beyond the participation of individual volunteers. Membership is open to people who have demonstrated a commitment to collaborative open source software development, through sustained participation and contributions within the ASF’s projects. An individual is awarded membership after nomination and approval by a majority of the existing ASF members. Individual Apache projects are, in turn, governed directly by Project Management Committees (PMCs), which are made up of individuals who have shown merit and leadership within those projects.
The Lucene open source search code was written by Doug Cutting in 1999. He had held a number of positions in IT companies working on search projects, including Apple and Xerox PARC. He went on to work for Yahoo! and is currently on the executive team of Cloudera. The name Lucene is Doug Cutting’s wife’s middle name. Cutting also developed Nutch as a web crawler and more recently developed Hadoop. He transferred the rights of Lucene and Nutch to the Apache Foundation in 2001, and it became a top-level Apache project in 2005.
Solr was created in 2004 by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website. It adds functionality such as hit highlighting, faceted search and filtering, caching, geospatial search, and a good web administration interface. Whereas Solr can be installed and used by non-programmers, installing Lucene requires programming expertise.
Yonik Seeley, along with Grant Ingersoll and Erik Hatcher, went on to launch Lucid Imagination, later renamed to LucidWorks, in 2008. Solr was given top-level project status by the Apache Foundation in 2007, and the Lucene and Solr projects were merged in 2010 and are now generally referred to as Apache Lucene/Solr. Yonik Seeley left Lucidworks in 2013 to set up Heliosearch, which offers a variant of Solr.
There is a common misconception that Lucene and Solr are “supported by a global team of developers.” In some senses that is correct, but without good quality control and a sense of a development roadmap, the functionality of the code bases would quickly get out of control. The quality management is the responsibility of committers, who are the only developers permitted to make changes to the main code base. Noncommitters may also submit patches, but these will not become part of the main code base until a committer has reviewed them. The Apache Foundation provides information on the responsibilities of committers, which gives a sense of the way in which the process works. There is also a Lucene/Solr project website that lists Lucene/Solr committers, of which there are currently around 50.
Shay Banon released the first version of Elasticsearch in February 2010 based on work that he had been undertaking on search software development since 2004. During 2013 and 2014, Elasticsearch became an increasingly popular alternative to Lucene/Solr. It is not an Apache Foundation application, though it can be downloaded under the Apache 2 license and uses Lucene code. In 2015, the software was renamed Elastic.
The latest entrant into the market is Heliosearch, founded by Yonik Seeley, the original developer of Solr, together with Joel Bernstein and Erick Erickson, both formerly with LucidWorks.
Underpinning all three platforms is Java, a general-purpose computer programming language that was developed by Sun Microsystems to provide a platform-independent application development environment. In 2007, Sun moved Java to an open source license. Sun was acquired by Oracle a year later, and Oracle maintains the open source approach. The current version is SE8, which was released early in 2014.
The development of both Lucene and Solr has been very rapid over the last few years. Although many of these version releases have been associated with bug fixes and minor enhancements rather than fundamental upgrades to the software, the upgrade to version 4.0 was a major change and added SolrCloud, which allows easier scaling of Solr. In the same period of time, a commercial application would have perhaps been upgraded once or perhaps twice. The speed of version release indicates a responsiveness to user requirements, but it may be quite challenging for IT teams accustomed to the upgrade pattern of commercial enterprise applications. Version 5.0 of Solr was released in 2015.
Elasticsearch continues to be available as a free download, but in 2013, a commercial company was set up by Shay Banon and his colleagues very much along the lines of LucidWorks. In February 2013, the company, which is based in the Netherlands, gained a $24 million Series B round of funding from Index Ventures, Benchmark Capital, and SV Angel. The funds were to be used to improve its ability to meet and support the increase in customer demand for Elasticsearch. Elasticsearch is often referred to as the “ELK Stack” because the company also offers Logstash (a data pipeline management application) and Kibana (a visualization application) as a tightly integrated trio of applications. The company also offers a security management application (Shield) and a management dashboard application (Marvel). This is where commerce meets open source, as Marvel and Shield are available through a range of subscription-based support packages.
In March 2015, Elasticsearch was renamed as Elastic and acquired Found, a Norwegian company that offered hosted and managed search services based on Elasticsearch, putting the investment funding to good use. At that time, Elastic reported that there had been over 20 million downloads of Elasticsearch.
LucidWorks gained a $10M investment in 2010, and in October 2013, announced a further investment from In-Q-Tel, though there was no information about the scale of the investment. At that time, LucidWorks claimed that it was delivering 9,000 downloads a day and had 4,000 customers.
Both the leading commercial suppliers of support and development services are therefore being funded through venture capital investment. Venture capital firms want to see a return on their investment, and working out the value of these two businesses is going to be a challenge when the fundamental intellectual property is open source and free. In effect, both companies are providing consulting services to support the open source applications.
The structure of the open source search business is quite complex. There are many options, at least for Lucene and Solr, that an organization could adopt to implement an application. One of the major advantages of selecting an open source approach is plugging into a large and growing community of developers and users and being able to take advantage of their experiences. There is a vast ecosystem of related projects (plug-ins, filters, administration systems, interfaces, etc.), many formal and informal events that may include local networking groups, and a huge amount of written material. Choosing open source may also make it easier to employ good developers, as often these people prefer open source as a development method. At present, LucidWorks, ElasticSearch, and Heliosearch seem to be struggling to develop a marketing proposition, unsure about whether they are targeting IT departments and the development community or business managers looking for search solutions.
The Java code base of open source search applications is complex and cannot be just picked up from a book or a training course. Many IT departments have very limited experience of using Java as a development platform, and even in large organizations there may only be a few employees with an adequate level of expertise. Both Lucene/Solr and Elasticsearch provide rich and detailed APIs to allow for search applications to be built, and it is unlikely that developers will need to modify the core software itself. The issue is not so much one of finding software development expertise, but of also having the expertise in information retrieval to translate requirements into code, and then continue to support the evolution of the application. Search development is not a project but business as usual.
There are a number of small companies that specialize in developing open source search solutions. These companies have usually been working with open source solutions for some time and have gained a significant amount of practical experience along the way. Most of these companies will work with Lucene, Solr, and ElasticSearch, and some may employ committers to these projects.
There are a small number of search implementation companies that are able to support a range of commercial and open source search applications. Examples in Europe include Flax, Search Technologies (which also has an extensive operation in the United States), Raytion, Findwise, and France Labs. Search implementation companies based in the United States include Open Source Connections, TNR Global, and Sematext. As with independent developers, some of these companies may employ committers.
LucidWorks (formerly LucidImagination) is probably best regarded as a special case of a search implementation company. It claims to have more committers for Lucene and Solr than any other company and over the last few years has worked hard to raise the profile of open source search. There is a substantial amount of information on the LucidWorks website but nothing at all about the pricing of the various support packages. The website also lists apps, such as connectors and language management applications, some of which are proprietary to LucidWorks, and others which come from specialist software companies such as Search Technologies, Raytion, and Raritan. This highlights the fact that open source applications do often need to incorporate additional software components.
A number of companies have gone one stage further than LucidWorks in that they do not offer open source development services but have developed proprietary applications that make extensive use of core open source applications. These include Attivio, IntraFind, and Polyspot. These applications fall somewhere between commercial search applications and open source applications. Perhaps the most striking example is IBM, which launched its Omnifind Yahoo! Edition in 2006 as a free download. At that time, Omnifind was a proprietary application. Although not apparent at the time, this was the start of IBM moving OmniFind onto an Apache Lucene code base.
The open source search community practices what it preaches and works together in an open and cooperative way. There are now many Meetup groups, an increasing number of books and training courses, and hackathons to test out novel approaches to solutions and problems. This is in complete contrast with the commercial vendors, who tended to be very proprietary about not only the software architecture but even how to get the best out of the software.
Open source search applications are primarily being implemented on a standalone basis. The integration of these applications with other enterprise systems, especially when these systems use proprietary software, presents some challenges. The major IT vendors recognize that open source software is here to stay, but they also have to continue to deliver dividends to shareholders.
The question inevitably arises as to which is the “best” application, given that both are based on Lucene. As with so much else around search, the answer is that “it depends.” For some time, Solr was the only game in town, but the arrival of Elasticsearch has resulted in some useful competition between protagonists on both sides to highlight strengths (less so weaknesses!) and to ensure that each remains a leading-edge search application.
There is no point in conducting a feature by feature comparison. First, there are too many features to compare, and second, it is how the development team makes use of the features that is important. For any given search requirement, both Solr and Elasticsearch development will be able to provide a very good solution. It comes down to having a very clear view of what the requirement is and then feeling reassured that the development team has the experience to deliver a robust solution that not only meets the current requirement but is also scalable and extensible for the likely roadmap of future requirements.
The current situation is along the following lines:
Otis Gospodnetic, the founder and CEO of Sematext, explores the similarities and differences in some detail in his blog post “Solr vs. Elasticsearch—How to Decide?”.
The first critical success factor for an open source search implementation is “strategy before specification.” The danger with open source search development is that because there is no initial outlay on a license fee, the project slips under the strategy radar. If the organization is committing to spend perhaps £500K on a commercial product, then quite a number of departments and managers are going to be involved in the procurement and the sign-off. An open source search project can probably be funded out of the IT development budget, especially if the costs are being taken in two different financial reporting periods.
It is therefore all too easy to see a small-scale open source search application as a quick fix, perhaps for an intranet or a website, with a longer-term intention (but no plan) to widen the scale of the project at a later stage. There are two implications. The first is that it may be necessary to rewrite some of the initial code to cope with the change in requirements, or to add in new modules. The second is that the overall project cost may be more expensive than if the requirements had been developed in the initial stages of the development.
In addition, open source search applications need just the same level of care, attention, and support as any other search application, and so once the software is installed, the support costs are not likely to be any lower than for a commercial product.
Although an increasing number of organizations either have a search strategy or are in the process of developing one, the majority of organizations still do not have a search strategy. This is of especial importance in organizations that are planning to move to SharePoint 2013. Although it provides a very powerful and flexible search application for content managed within SharePoint itself, there are issues around the extent to which SharePoint 2013 can be more widely used to search non-SharePoint applications and repositories.
An important aspect of any search implementation is having a well-grounded security strategy. Extending an open source application that is being used for a website or for a web-based service where either all the information is in the public domain or access is limited to subscribers who gain access to a complete collection through a password is very different to managing the complex and often inconsistent content security issues common in organizations of all sizes. The comment that security access is “managed by Active Directory” is easy to state and much more difficult to implement than it might seem to the IT team, especially where there is a current or potential requirement to provide access to users who are not employees of the business and so do not have an AD identity.
The search strategy also has to take into account the IT strategy. An open source search development may be the first major open source project that the IT department has managed, and in addition it may need to develop through a different project management methodology. As the project proceeds, it is quite likely that the scope of the project may change as the functionality and ease of development open up opportunities that were not in the original specification. Open source development is best managed through an Agile project management methodology, with a good grasp of what a minimum viable solution might look like.
It is important to note the difference between fitness to specification and fitness to purpose. This is especially important with a search application because there are no workflow analyses that can be used as the basis for requirements. Implementing any search application without a search manager being in post is extremely risky. Furthermore, it is not until the application has indexed all the required content and is being used in a production situation that issues about search performance arise. Search applications need to be able to be modified easily should production experiences show that there are some issues that need to be addressed. This is a situation that is the same with commercial as well as open source search applications.
The challenges faced by the search team in evaluating potential suppliers of open source search development expertise are very similar to the challenges faced by web managers a few years ago when open source CMSs (e.g., Drupal and Joomla) started to appear on the market. With a commercial product, it is possible to meet several customers who are running the current version of the software and gain useful information not only about the functionality of the product but also the way in which the product was implemented and then supported.
Open source applications are usually custom built; even products such as LucidWorks may well include some additional apps to meet specific requirements. To a much greater extent, companies will need to evaluate the skills of the development team rather than the list of functions and features. For some companies, there may be little or no experience of selecting open source development teams, especially if most of the enterprise applications are from IBM, Microsoft, or Oracle.
Some of the factors that need to be taken into consideration in evaluating developers are listed in the following table. However, this list is not comprehensive and it needs the right set of skills on the purchaser side to understand the significance of the replies.
Previous experience | What projects has the team undertaken that they feel are most similar to your project, and why? What lessons have they learned from these projects that will be of relevance? How close was the final cost to the initial budget? |
Development skills | How do the developers keep up to date with developments in the software? What training have they had either externally or internally? What networks do they participate in and how far are they, in network distance terms, from committers? What processes does the company have for assessing development quality? If required, will the development team have the experience to integrate the search application with other enterprise applications? |
Documentation | Ask to see examples of the code documentation and user manuals that have been provided to other clients. |
Project methodology | Search development needs to be carried out on an Agile basis. Does the team have a best practices document on its development methodology? How will this methodology integrate with the organization’s own project management methodology (e.g., Agile and Waterfall do not work well together)? What skills and support is the team expecting from you at various stages of the project? Does the team use a project management application where all project documents, comments, schedules, and issues are located? How are risks identified and managed? |
Licenses | Although the core search applications may be downloaded under Apache Foundation licenses, there may be additional applications (e.g., document filters) that may be subject to a separate license. |
Team vulnerability | Search development skills are in short supply. To what extent are the skills of members of the team backed by other team members? |
Many books have been written about software project management. This section highlights four issues that can make a difference to the outcome of a search project.
It is very important to agree with all stakeholders what the test regime is going to be. There has to be a clear view on what tests are scalable and what tests (fully loaded files on a production environment) cannot be undertaken until quite late on in the project. Extrapolating production performance from small-scale tests of only part of the code base is extremely difficult to undertake even if the development team has worked on a very similar project in the past.
A risk register is not a sign that the project will go off the rails but a methodology to determine when it is about to do so, and what actions need to be taken. It is usually reasonably easy to determine the risks of a project but much more difficult to set out what the early warning signs might be of the risk being about to emerge. Good risk management is about anticipating risks and reducing their impact by early diagnosis and action rather than waiting for the situation to arise and then having a solution ready.
A search project is not an IT project. It brings no benefits to IT other than potentially reducing the IT budget. The benefits lie with the business and so the project manager has to manage the interests of the IT team, major business stakeholders, and the employees (or customers) who will actually use the application. Add in an Agile development environment, the still low installed base of open source search projects (and therefore project experience), and the difficulty of finding skilled project managers, and the task of finding and retaining the right project manager will be challenging.
In an Agile development environment, it is very easy to find that the documentation does not reflect the current status of the implementation. The documentation must be current with the project development, in addition to being written in a language that every member of the project team understands. One of the new words in Solr development is shard and this has nothing at all to do with a certain very tall office building in London. Many SharePoint project teams have a glossary of technical terms, so that everyone knows what a “list” and a “library” is in real terms. An open source search project will also benefit from a similar glossary.
No search application development can be seen as a project. Even with the most capable of development teams, issues can arise very quickly after implementation as an unusual file format is discovered or a crawl needs further optimization. From the beginning of the engagement with the development team, there should be a very strong focus on how development support is going to be continued, including taking a decision on which upgrades should be implemented and when.
Certainly there is an increasing number of companies productizing open source search, but they are very small in IT vendor terms, often have limited support outside of their home territory, and are highly dependent on the skills of a small development team.
Hadoop is another application that was developed by Doug Cutting when he was at Yahoo!, and is now available through the Apache Foundation. Hadoop provides a solution to managing very large data arrays and is not a search application for unstructured information. Hadoop is a distributed system for storing and processing information, and its ability to ingest data may outperform relational databases.
As a result, it has become widely used by businesses that want to collect and manage very large amounts of data, such as text from social media sites, sensor logs, and GPS-based location information, without some of the performance limitations of traditional solutions.
It is likely that many large organizations with extensive data collections will be experimenting or using Hadoop to manage access to these collections. Making the assumption that Hadoop is an enterprise search application is not a realistic one, though there are certainly text mining applications for which it would be well suited. Additionally, there are various integrations between search platforms such as Lucene/Solr and Elasticsearch and Hadoop, some sponsored by very well-funded companies such as Cloudera.
A detailed discussion about Hadoop falls outside of the scope of this chapter, and is included here just for the purposes of background information.
Over the last few years, the Lucene-based Solr and Elasticsearch search applications have developed into very flexible and powerful solutions. So far, they have mainly been used either for website search or for specialized search applications where very high performance is called for, along with the ability of a development team to create customized applications. Their use in enterprise applications has been slower to develop.
There are many different ways to manage the development of these applications, ranging from in-house development to buying a packaged solution with perhaps some proprietary modules. Using downloaded versions of Lucene, Solr, and Elasticsearch will certainly result in savings in license fees, but the costs of development should not be underestimated, and need to be viewed over perhaps a three-year cost of development and support.
Clinton Gormley and Zachary Tong, Elasticsearch: The Definitive Guide (Sebastopol, CA: O’Reilly, 2015).
Doug Turnbull and John Berryman, Relevant Search (Greenwich, CT: Manning Publications, 2015).
Packt Publishing has an extensive list of books on Apache Lucene and Solr.
3.147.47.82