Data integrating pattern for search

Enterprise search solutions work on multiple data sources. Integration of these data sources with a search engine is an important aspect of search solution design. The data sources can vary from applications to databases and web crawlers. Like ETL tools of data warehouse, the data loading in search solutions goes through similar cycles. The important part here is how and which data should be loaded in the search engine. In this section, we will look at different approaches while working with the integration of data sources with a search engine. Many times, during the design of a search engine, architects end up using one or more of these data integration patterns for the design. Most application integrations fall under one of the following patterns. Each of them offer unique benefits and drawbacks; the architects have to choose the best-suited pattern based on the landscape.

Data import by enterprise search

This type of data integration is used for most cases. It is applicable for the cases where the system landscape is already established, and those applications provide access to their information through a standard interface or filesystem. Enterprise search is capable of reading data that an application exposes through its interfaces. This interface can be web service based, API based, or any other protocol based. The following schematic shows the interaction between an application/database and enterprise search:

Data import by enterprise search

A data request is sent to the application's interface from enterprise search through a trigger. This trigger can be a scheduled event, or it can be an event triggered from applications itself, signifying the change in the state of data. When this data is requested, it is important for enterprise search to remember the last state of this data so that an incremental data request can be sent to interface, and the same can be returned by the application. When data is received by Enterprise Search, it is transformed into a format in which enterprise search can create a new index for the given data and persist it in its own repository. This transformation is generally predefined.

In Apache Solr search, a system provides predefined handlers to deal with various such data types. For example, DataImportHandler can connect to a relational database and collect data in either the full import or delta import form. This data is then transformed into indexes and preserved in the Solr repository.

Data import by enterprise search provides the following advantages over other approaches:

  • No/minimum impact on the current system landscape due to applications not being impacted by the introduction of enterprise search; all the configuration is done on the enterprise search application and applications do not require a change of code
  • Enterprise search becomes a single entity to manage all the integration; such configuration becomes easy to manage by administrators

This approach offers the following challenges during implementation:

  • Each application provides different ways of communication. Legacy applications expose non-standard ways of integration, so this kind of integration requires a complex event-based mechanism to work with applications.
  • While data can be read through integration, the state of the last data read pointer has to be maintained by a search engine to ensure there are no duplicate reads again.
  • Sometimes, it becomes difficult to provide real-time search capabilities on this system. This is applicable when the enterprise search solution-based data integrators import data in between certain intervals (polling-schedule-based approach). Search is always near-real time, in such cases.

Applications pushing data

In this integration pattern, the applications are capable of extending themselves to work as agents to connect to an enterprise search solution. This pattern is applicable when the enterprise system landscape is going through changes, and applications can accommodate the additional logic of pushing data on their own to enterprise search. Since this approach is followed as and when the new information arrives on the application side, this does not require polling- or event-based mechanisms to work with. Enterprise search hosts an interface where this information can be passed by the application agents directly. The following schematic describes this integration pattern:

Applications pushing data

There are two major types of applications pushing data to enterprise search. The first is where applications simply push data in some format, for example, XML, and the data transformation takes place in enterprise search. Enterprise search then converts data into indexes through its own transformation logic. Typically, such supported configurations are published affront by search software and are part of its standard documentation library.

In the second approach, the application owns the responsibility of complete transformation and pushes ready indexes/data to the search interface or directly to the search repository. Apache Solr supports this approach by providing different types of extractors for data. It provides extractors for the CSV, JSON, and XML formats of information. This data integration pattern offers the following benefits:

  • Since the data import part is out of enterprise search's objectives, the focus remains on searching and managing the indexes. Enterprise search becomes faster because data transformation is no longer managed by the search engine.
  • A search engine does not require complex modules such as event-based data reader or the scheduler, making the overall search much simpler to configure and use.
  • Enterprise search can provide real-time search abilities due to data sync happening from the data source itself.

This approach has the following drawbacks:

  • The characteristics and search schema are exposed to the outside world, and applications carry the burden of keeping search engines in sync with the changes done with the application data
  • It is a huge impact on the system landscape; any system integrated with enterprise search has to use search dependency

Middleware-based integration

The middleware-based integration pattern is a combination of the earlier patterns. This is applicable when enterprise search does not provide any mechanism to pull data of a certain type, and the application too does not provide any mechanism to push data to the search engine. In such cases, architects end up creating their own applications or agents that read information from applications on the trigger of an event. This event can be a scheduled event, or it can be coming from the application itself. The following schematic describes the interaction cycles between the middleware, application, and search application:

Middleware-based integration

Once a request is sent to the application interface, it returns the required data. This data is then transformed by the intermediate agent or mediator and passed to search an application through another interface call.

Apache Solr provides an interface for applications or middleware to upload the information onto the server. When a request is for a nonconventional data source that does not provide any support for the previous two patterns, for example ERP and CRM applications, this approach can be used.

The middleware-based integration approach has the following benefits:

  • Due to the presence of middleware, the current system landscape can function without awareness of the introduction of a new enterprise search engine.
  • Similarly, applications and enterprise search do not carry any dependencies as they never interact with one another.
  • To achieve middleware-based integration, there are many middleware tools available in the market, ranging from open source to commercial supported products. The cost of development of such middleware can be utilized here.

This approach has the following drawbacks:

  • Creation of mediators/middleware introduces more applications to manage for administrators.
  • Agents work with the store-forward methodology. By introducing agents, there is the addition of one more stage between the source and target data. This can impact when the data migrated is huge in size. It also increases the point of failures in a complete landscape.

We will look at a case study where all these integration patterns are used in the designing of a search engine in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.177.10