Enterprise search solutions work on multiple data sources. Integration of these data sources with a search engine is an important aspect of search solution design. The data sources can vary from applications to databases and web crawlers. Like ETL tools of data warehouse, the data loading in search solutions goes through similar cycles. The important part here is how and which data should be loaded in the search engine. In this section, we will look at different approaches while working with the integration of data sources with a search engine. Many times, during the design of a search engine, architects end up using one or more of these data integration patterns for the design. Most application integrations fall under one of the following patterns. Each of them offer unique benefits and drawbacks; the architects have to choose the best-suited pattern based on the landscape.
This type of data integration is used for most cases. It is applicable for the cases where the system landscape is already established, and those applications provide access to their information through a standard interface or filesystem. Enterprise search is capable of reading data that an application exposes through its interfaces. This interface can be web service based, API based, or any other protocol based. The following schematic shows the interaction between an application/database and enterprise search:
A data request is sent to the application's interface from enterprise search through a trigger. This trigger can be a scheduled event, or it can be an event triggered from applications itself, signifying the change in the state of data. When this data is requested, it is important for enterprise search to remember the last state of this data so that an incremental data request can be sent to interface, and the same can be returned by the application. When data is received by Enterprise Search, it is transformed into a format in which enterprise search can create a new index for the given data and persist it in its own repository. This transformation is generally predefined.
In Apache Solr search, a system provides predefined handlers to deal with various such data types. For example, DataImportHandler
can connect to a relational database and collect data in either the full import or delta import form. This data is then transformed into indexes and preserved in the Solr repository.
Data import by enterprise search provides the following advantages over other approaches:
This approach offers the following challenges during implementation:
In this integration pattern, the applications are capable of extending themselves to work as agents to connect to an enterprise search solution. This pattern is applicable when the enterprise system landscape is going through changes, and applications can accommodate the additional logic of pushing data on their own to enterprise search. Since this approach is followed as and when the new information arrives on the application side, this does not require polling- or event-based mechanisms to work with. Enterprise search hosts an interface where this information can be passed by the application agents directly. The following schematic describes this integration pattern:
There are two major types of applications pushing data to enterprise search. The first is where applications simply push data in some format, for example, XML, and the data transformation takes place in enterprise search. Enterprise search then converts data into indexes through its own transformation logic. Typically, such supported configurations are published affront by search software and are part of its standard documentation library.
In the second approach, the application owns the responsibility of complete transformation and pushes ready indexes/data to the search interface or directly to the search repository. Apache Solr supports this approach by providing different types of extractors for data. It provides extractors for the CSV, JSON, and XML formats of information. This data integration pattern offers the following benefits:
This approach has the following drawbacks:
The middleware-based integration pattern is a combination of the earlier patterns. This is applicable when enterprise search does not provide any mechanism to pull data of a certain type, and the application too does not provide any mechanism to push data to the search engine. In such cases, architects end up creating their own applications or agents that read information from applications on the trigger of an event. This event can be a scheduled event, or it can be coming from the application itself. The following schematic describes the interaction cycles between the middleware, application, and search application:
Once a request is sent to the application interface, it returns the required data. This data is then transformed by the intermediate agent or mediator and passed to search an application through another interface call.
Apache Solr provides an interface for applications or middleware to upload the information onto the server. When a request is for a nonconventional data source that does not provide any support for the previous two patterns, for example ERP and CRM applications, this approach can be used.
The middleware-based integration approach has the following benefits:
This approach has the following drawbacks:
We will look at a case study where all these integration patterns are used in the designing of a search engine in the next section.
18.222.177.10