Application Architecture: Billions and Billions of InfoBytes

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6.5. Application Architecture: Billions and Billions of InfoBytes

Logically, Codexa's application is partitioned into a set of application services, including

Data acquisition, the processes whereby the system gathers data of interest to Codexa clients from the Internet
Data distribution, a messaging/event queuing service to provide the guaranteed data delivery that is critical to the Codexa system
Data evaluation and classification, which evaluates and rates incoming data based on customer-specific criteria
The KnowledgeMQ, a user-defined querying/reporting engine through which users interact with Codexa's application services
Reporting, a user-defined data-formatting engine that enables users to specify logical information association and data delivery mechanisms (bulk file, cell phone, pager, Web page, and real-time alerts that leverage applets, Info-Bus, and JMS).

In the interests of flexibility, each of these components is designed to be independent of the platform, vendor-specific tools, security protocols, infrastructure, and even other Codexa components. This modular architecture is necessary because the Codexa system must enable client expertise to be applied at any level of the application, including client systems integration and asynchronous information delivery (see Figure 6.4).

Figure 6.4. Codexa Application-Services Architecture

6.5.1. Data Acquisition

Active data acquisition is one of the core technologies created by Codexa. The data acquisition module is responsible for acquiring data from a wide variety of Web data sources, including e-mail publications, news sites, message boards, closed-caption TV broadcasts, and corporate Web sites. Traditional sources include news, earnings, company information, SEC, and price feeds. After retrieving the data from these disparate sources, this module utilizes JINI technology to distribute the processing necessary to strip the data of extraneous information and format it into a consistent, extensible, readable XML-based model. This complete process is referred to as “harvesting.”

As of this publication, the data acquisition module brings in approximately 200,000 to 300,000 data items on a slow day and up to 500,000 items on a busy day. To handle peak loads, the system must be able to handle millions of items a day, some as small as e-mail messages, some as large as 300-page SEC filings.

Extracted data is stored in the Codexa system in the form of nested XML documents, accessed through the JNDI tree. As data is stored, the data model portion of the JNDI tree becomes, in essence, a map of relevant portions of the Internet, with XML documents representing data at a URL reachable through XML documents representing URLs, and so on. Notification of harvested data is published into the messaging system for evaluation and classification.

6.5.2. Data Distribution

The guaranteed distribution of data is central to the functionality of the Codexa Service. It is also essential for easing scalability issues, because it allows many operations to be separated into components, which can work in parallel but independent of one another. The Service requires a messaging/event queuing service that enables components distributed anywhere to communicate effectively with each other. A messaging service must be available to every server and every client in the system. Codexa wrapped and extended the functionality of Softwired's iBus//Message Server to satisfy this requirement.

Codexa has a few functional requirements beyond those of the average messaging service.

A programmable interface for clients/user communication
A network-centric client-side “lite” (small footprint) version for embedding in applications
Functionality that supports the JavaBean property-change event notification model
Support for a complete finite state machine model, including state management and transition management

The initial version of the Codexa KnowledgeMQ (the querying/reporting engine) implements JMS as a publish/subscribe message queuing system with which subscribers register interest in “topics,” represented as locations in the JNDI hierarchy. This basic implementation includes support for distributed transaction-oriented messaging and is compliant with the Open Standards Group XA specification for distributed transaction processing (DTP). The transactional view of a message begins when the message producer creates the message and ends when all message consumers have received the given message.

For consistency, all messaging for KnowledgeMQ is structured as XML. All topics for publish/subscribe-based messaging use element and attribute node names in their hierarchical structure as topic names based on Codexa's implementation of the JNDI CompositeName specification.

KnowledgeMQ is a complete implementation of the JMS, supporting all the architecture defined in the JMS specification. The JMS architecture in its simplest form dictates the use of a ConnectionFactory that creates a specific type of Connection to a messaging server. The Connection is then used to access the Session. A Session is a factory for its MessageConsumer and MessageProducer references. A MessageConsumer is a subscriber to a Session. A MessageProducer is a publisher to a Session. With the addition of property-change monitoring, Codexa's KnowledgeMQ becomes a rules-based message distribution system, because property-change listeners have the ability to veto changes based on a set of rules and to force an understanding of state and transition. For example, the Codexa Service might have a base evaluation rule that multiple instances of certain keywords in a data item signal that a company is in legal trouble. If a client's rules are that some of those keywords aren't significant, the item might be judged unimportant, and propagation would be stopped before notification is sent to that client's users. This would imply patterns similar to deterministic finite automata (DFAs) in which the regular expression (RE) is substituted for rules and the KnowledgeMQ supports the propagation of state transition, including notions such as state concatenation, state alternation, and Kleen closure.

6.5.3. Data Evaluation and Classification

Data evaluation is one of the fundamental benefits the Codexa Service offers its clients. The Data Services include a set of evaluators—message consumer objects that add one or more evaluations to a harvested item. An evaluation is simply a determination that an item has a certain property (for example, that a message board posting contains a certain keyword). In addition, users can configure their own evaluators, so the types proliferate constantly.

The Data Services perform two basic types of evaluation and classification.

Dynamic classification. Classification is a subset of evaluation wherein an item is judged to be a member of a particular group (it is relevant to a specific organization, market sector, or an organization's business model).
Dynamic constraints. Constraints are data-validation modules. A constraint can veto the further propagation of a message based on its interpretation of the validity of the data. For example, if the system cannot classify a data item, a constraint removes all references to that item, and it is removed from the PCA.

6.5.4. KnowledgeMQ and Filters

The KnowledgeMQ, with its user-defined Knowledge Filters, is the front end of the Codexa Service, enabling users to interact with the application services in sophisticated ways. In the same vein as some commercial ad-hoc reporting tools, the KnowledgeMQ is a user-defined querying/reporting engine, except that it supports fuzzy logic-based queries, as well as standard reporting of data that is maintained by the system's real-time components. For example, a user can assign weight to data item attributes, then define a time scale for a weighted collection of events. When the aggregate weight of a collection of evaluations achieves the user's specified threshold within the time constraints, an alert is sent to the user's client process.

Knowledge filters are the user-defined “rules” that the KnowledgeMQ uses. The KnowledgeMQ uses the JavaBeans component architecture for rules definition. The rules are a standard XML document that follows the standard “bean customizer” pattern.

The core architectural paradigm for the KnowledgeMQ is as a real-time engine in which all communication takes place through a standard interface using a standard communications protocol. Through the realization of this design pattern, the Knowledge MQ becomes a scalable component that is accessed by server-side objects acting as proxies for organizations, organizational units, and users. Reports can then be delivered through the KnowledgeMQ.

The KnowledgeMQ offers a default set of reports, such as message volume ratios, earnings whispers, and market manipulation attempts. In addition, it can return alerts to a client, informing the client of events in which the client has registered interest, of new data available for reports the client has subscribed to, or of new information on a topic of interest to that client.

6.5.5. Reporting

Once data is fully classified, it is stored in the RDBMS, where it can be used for statistical analysis and reporting. Codexa clients access reporting EJBs through secured Java Servlets and Java ServerPages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Application Architecture: Billions and Billions of InfoBytes

Create new playlist

Sign In

Sign Up