CHAPTER 14
Data-Centric and other Emerging Technology

There is a very good chance you will be implementing one of the following emerging technologies, perhaps for a project already underway, or one in the near future.

For each technology, there is a good case to be made that the Data-Centric approach can improve the success of these initiatives.

Big data

Big Data is an approach to data management that is primarily about leaving data where it is, as it is, and sending programs (functions) to the data, rather than trying to homogenize and centralize the data (as would be done in a relational database or a data warehouse).

The classic Big Data approach relies on some sort of parallelizable architecture such as MapReduce or Hadoop.56 Forrester pointed out that that the implementation of Big Data brought with it the three (and ultimately many more) “Vs.:” Volume, Velocity, and Variety.

  • Volume is the sheer amount of data. Big Data technology was designed to handle vast amounts of data, up to exabytes of data.
  • Velocity is the alliterative term for latency; it refers to how long a query or process will take to complete. While Big Data technology could handle huge amounts of data, the approach they took introduced delay. Dispatching thousands of programs, each solving part of a problem and then reconnecting adds a lot of set up time, which impacts big data’s use in real-time applications.
  • Variety. Big Data started life using simple structures (weblogs for instance), but what rapidly happened was a great increase in variety of data formats. This became the central challenge.

Since the initial hype of big data, the pundits have been adding more “Vs,” including:

  • Veracity (How do you know it’s accurate?)
  • Value (Can it be applied?)
  • Variability (Different sources can be wildly different.)

The net of all this is that Data Scientists are spending 60-70% of their time “wrangling data.” A big advantage for data-centrism and big data is pre-organizing and pre-linking the data. Data-centric architecture employs a simple, elegant model. Because of its simplicity, it is easy to conform existing big data to the model; this helps facilitate the process of finding the relevant data.

Data lakes

Data warehouses were born of the need to report from many different systems. The data warehouse approach was to create a single data model, that was tuned for common reporting needs, and to populate it with data from many diverse systems. This population process is called ETL (extract, transform, and load). Extract is the process of getting data out of the source system, transform is the process of conforming it to the target data model, and load is organizing it for efficient writing to the warehouse.

While a great idea in principle, it has hardened over time. In most organizations, getting a new dataset into the data warehouse is now a multi-month effort, and sponsors have grown tired.

The confluence of frustration with the data warehouse and big data envy led to the “data lake.” The data lake came into existence when people realized that instead of ETL we could just take the source data, more or less as it is, and drop it into a semi-structured data store, and let the data scientists figure it out on consumption.

This, of course, has all the promises and problems of big data, most importantly, that data scientists have to work out what the data means at analytics time. Once again this means that most time is spent finding and understanding the data, and less time is spent with analytics. A further problem for most corporate settings is that the business analyst who could be taught the user-friendly data analytic tools, such as Tableaux, Cognos, and Qlikview is now mostly shut out of participating because of the high bar of skills acquisition needed to access, understand, and work with data in the data lake.

Once again, the Data-Centric approach offers a simple model that the data lake can be mapped to, greatly reducing the time spent finding and understanding problem. We believe that over time there will be more analytics tools designed for business analysts that can directly address this environment.

Cloud

Almost everyone is moving to the cloud now. On the surface this is neither pro-Data-Centric nor anti-Data-Centric, but it does represent change and opportunity. Many firms recognize that they are making a strategic investment by porting to the cloud. Most can be convinced, at least partially, that the move should be coupled with application rationalization, and a reduction in licensing fees.

Some firms see the move to the cloud as an opportunity to revamp their integration strategy. The most extreme expression of this view occurs when a company moves directly to Data-Centric methods, thereby vastly improving data integration. However, we believe the stampede to the cloud is moving too rapidly to allow something as measured as designing a shared ontology and migrating virtually all of a firm’s functionality to it.

Another approach with growing popularity is to use the newer messaging protocols of the cloud, such as Kafka, to provide the excuse to rethink the approach to message-based integration. It is possible to piggyback Data-Centric messaging onto a new initiative like this. By redefining the idea of a canonical message model in semantic and Data-Centric terms, a seed has been planted.

The semantic/canonical model has immediate value in its own right. It drastically decouples applications from each other and ends point-to-point integration. At the same time, its existence provides a safe springboard for re-implementing portions of, or even entire applications.

NLP

Natural Language Processing has been around for a long time. For most of that history its highest value was in “named entity recognition” and “topic identification.” Named Entity Recognition is the ability to scan unstructured text and find people, organizations, places, dates, events, and products. These are the “named entities.” They are fairly easy to recognize and are somewhat useful to extract.

The next phase of NLP concentrated on identifying “topics,” in other words, what is this article about. These became automated “tags” to put on an article.

NLP has also been used to detect “sentiment” in a chunk of text. Is this comment positive or negative? Is the author angry or upset? NLP can help answer these questions.

Only recently has NLP come to the point where it can really contribute to a Data-Centric architecture. The first way is through the harvesting of assertions as triples from documents. As we mentioned, finding the named entities in a document is pretty straightforward. For quite some time, NLP has been able to find people and organizations in documents. Now, though, NLP can identify the relations between them.

For instance, when an NLP system scans a court report, it will easily identify the pertinent lawyers, law firms, plaintiffs, and defendants in the case. The latest cutting-edge NLPs go beyond this, identifying and describing the relationships between the parties (e.g., which lawyer represented which party). With a bit of configuration, this extract process can be conformed to whichever ontology a firm has adopted as the center of its Data-Centric strategy. Suddenly there is a whole new source for data.

The second area where NLP is helping with Data-Centric architecture is extending the core ontology. Some NLP systems are getting good at detecting linguistic clues that suggest classes and/or taxonomic extensions to an ontology. For instance, when an NLP engine encounters a phrase such as “you’ll see many types of coniferous trees, such as pine and spruce” it can infer that pine is either a subclass of coniferous tree, or that taxonomically it is a narrower concept.

A third area where we are seeing activity is using NLP to interpret a question asked in unstructured English, translating that to SPARQL, executing it, and returning the results as the answer to the question. So far, we are seeing this primarily in text questions in chatbots and narrow domains, but we expect the domains to broaden and expect to be seeing this capability against spoken questions soon.

Rule-based systems

Most enterprises eventually need to employ rule-based systems for some of the complex decisions they make on a routine basis. There is an entire industry that caters to this need.

What someone realized at some point was that there were complex business decisions that could either be delegated to humans or could be hardcoded in procedural code. Many realized this was a ripe field for rule-based systems, which can codify the components of the decision process and do it in a way that is non-procedural (i.e., expressed in simple rules that can be combined in ways that would often not have been predicted).

What we have observed is that the “Business Rule” industry has essentially created subroutines for key business applications. Take for example a complex application like insurance underwriting. The core system takes in data (e.g., demographics about the person being insured or their claims history) and then applies “rules” to the data to arrive at a decision. But because of the way these systems have been architected, the rule system is given a bunch of data, and it grinds through a bunch of rules to generate an answer. The answer is returned to the main program that then determines what to do with the advice.

This makes the rule system totally subservient to the vocabulary and API of the calling application. The rule system ends up being captured by the system that is invoking it. This makes it very un-portable. Further the rule system can take no action on its own, it merely returns its conclusion to the calling program, and typically there is no more future for whatever the rule system concluded.

A rule system in a Data-Centric architecture is a peer. This cuts two ways. First, the rule system need not be invoked by any particular process (there can be triggers, or it can run periodically). The most important difference is, rather than just returning an answer to the program that called it, a Data-Centric rule engine persists any conclusion it came up with as conforming triples into the triple store it was attached to. A rule engine that calculates economic order quantities would not be a subroutine to a purchasing application, it would be an independent process that periodically puts data about economic order quantities back into the Data-Centric store. It might calculate, based on purchasing and freight charges, pink pearl erasers should be purchased in lots of 100. This data would be in the triple store attached to pink pearl erasers. Another rule system might independently determine when an order should be placed based on demand patterns, inventory levels, and the new economic order size.

There is a great role for business rule systems in the Data-Centric architecture, but not as most of them are currently configured.

Machine learning

Machine Learning (ML) has captured a great deal of attention of late. Systems that can learn a game in the morning and be a world class competitor that same day catch attention.

ML works by processing huge training datasets. Its algorithm pours over the data and using mostly statistical techniques, are able to mimic certain aspects of human thought. But most firms don’t have massive datasets; if they do, they’re very rarely annotated well. This is one area where it can help to embrace Semantic Technology. There is a subdiscipline of ML called transfer learning,57 which allows ML to be much more effective with smaller datasets.

The basic idea of transfer learning is rather than use brute force statistical approaches for every problem, sometimes we can use what we’ve learned in a related domain or problem. Humans do this all the time. We are finding that conforming a dataset to an elegant Data-Centric model reduces the degree of complexity that an ML algorithm needs to deal with. It is complexity that drives the need for larger datasets. Additionally, the core data model provides a narrowing and reduction in the hypotheses possible, further reducing the need for large training sets.

The number one reason that established firms are reluctant to adopt ML is its “black box-ness.” While ML has had a number of notable successes, how it comes to the conclusions it does, in almost all implementations, is opaque. Very few ML systems can introspect or explain how it came to the conclusions it did. This has led to the dark side of ML, algorithms that reinforce stereotypes, that institutionalize bias, and the famous Microsoft chatbot that rapidly became a profane misogynist.

The Semantic Technology aspect of Data-Centric has a rich history with the “explain function.” It was called “proof” in the standards, referring to the mathematical discipline of showing your work as “proof.” In Semantic Technology, all inferences are backed up with an explain function that describes how the system came to that conclusion. We think this rich history can be grafted onto the machine learning issues around helping people introspect the models.

Microservices

The Microservices Architecture movement marries the original goals of object-oriented with the web’s RESTful architecture.

The original idea of object-oriented architecture was that there would be a lot of self-contained objects that coordinated with each other to get work done. Unfortunately, virtually all implementations of this idea required that the objects that were interacting be compiled into a single monolithic program that had these characteristics. But a single monolithic system with coordinated self-contained objects is still monolithic. And many of the early object-oriented systems were very monolithic.

While object-oriented was fading in importance (although it never really went away, and most development now is object-oriented), the web was rising in ascendency. The web is built on a very decentralized and loosely coupled model. Reverse engineering why the web was so successful in his 2000 Ph.D. dissertation,58 Roy Fielding described the primary architectural pattern that the web employed as RESTful, which stood for REpresentational State Transfer. Essentially the web is a bunch of loosely connected endpoints that adhere to a very small number of methods, such as GET, PUT, and POST.

The microservices movement recognized that many of the fine-grained functions that object-oriented wanted to share could be delegated at a much broader scale. Instead of limiting the sharing to modules that were compiled together, microservices made them RESTful dynamic calls that anyone (typically within an enterprise) could call.

We believe in a well-designed architecture, there will be many dozens to hundreds (but not thousands) of microservices that could be consumed by any use case or process. A service might be, “send a message,” “convert an address to geocodes,” or “convert a measure in one unit of measure to another.”

These services could play a key role in a Data-Centric architecture. The impressive thing is that there are relatively few of them that would cover most of the landscape of services needed for an enterprise.

Kafka

Kafka is an open source messaging approach that is gaining considerable momentum in enterprises these days.

Many architects are looking at it as one more chance to seize an opportunity that slipped by in the SOA /ESB trend of the last couple of decades. SOA (Service Oriented Architecture) and ESB (Enterprise Service Bus) were more or less synonyms. They each were architectures that advocated a peer-to-peer relationship between applications, mediated through messages that were managed through queues.

At the time (the 1990s and early 2000s) it was felt that it would be unworkable to have enterprise applications communicate directly with each other through cross network APIs. Instead, it was believed that a system of managed queues was what was needed. Changes in one system could be packaged into a message and shipped off to other systems who wished to subscribe. The publishing application need not know how many other applications were interested in (and therefore subscribing to) the changes that they published. Each consuming application had a queue and consumed the messages at their leisure. The publishing application was unaware of, and certainly was not going to wait on, the eventual downstream consumption of the messages.

This was actually a pretty good architecture, and many products were sold that supported it. The reason the concept did not take hold was that instead of creating a common model and common messages that all applications were obligated to conform to, most enterprises took the lazy way out and allowed each application to define their own messages for the ESB. They took the APIs they already had, which were expressed in their own local dialect and structured and published them to the ESB. This meant that any consumers of messages on the ESB had to understand and deal with however many endpoints there were that were publishing similar information. While this homogenized the delivery format, it did nothing for the unconstrained variety that existed in the many source applications. Most firms eventually gave up on SOA and ESB because it did not deliver up to its promise (and this was due to lack of discipline in that it did not address the data variety problem).

Kafka is the new approach to message-based coordination between applications. It offers many things that SOA and ESB did not. It is free, stream based, and much more in line with internet approaches, including a compressed format that combines self-describing data with a smaller payloads (AVRO).

We believe that Kafka is an opportunity to get inter-application messaging right, in the same way that SOA and ESB were. Most Kafka implementations will fall far short of their potential. They will work in the same way that SOA and ESB implementations worked, but they will fall short of their potential in that they will fail to reduce the inter-application complexity.

The big opportunity with Data-Centric methodology and Kafka is to use the core model as the lingua franca for the Kafka messages. Firms that base their Kafka messages on a shared core model will find the elusive benefits.

Internet of things

The Internet of Things (IoT) is the term for the next wave of connectivity: where every smart device has its own internet IP address. There are smart controllers in factories, cars, and appliances, but most of them are not connected to the internet. They are primarily connected to very local networks if they are on any sort of network at all.

This is changing rapidly. One of the more visible examples are devices, such as Nest, now owned by Google, which makes some home environment management functions available via the internet. This is poised to explode. Projections are all over the place (billions to trillions of connected devices) but the consensus seems to be that there are already billions of devices connected and in the next 3-4 years there will be tens of billions.59

The variety and complexity of these devices is staggering, and therefore the possibility for runaway data complexity is a potential. There is some good news, there are many standards organizations working on standardizing the interfaces. The bad news is that there are many standards evolving, which will, at least for the medium term, contribute to complexity.

Individual organizations can take control of this complexity without waiting by adopting some Data-Centric approaches. While there are an almost endless number of device types that will be connected, there are a fairly small number of things they will do. IoT devices are either sensors or actuators or both. That is, they can either read out some value in the real world or they can respond to an internet directed signal to change something in the real world. There are many dozens of things a sensor could detect but depending on your organization, you may only be interested in a small subset.

The Nest, for instance, is a sensor that primarily detects temperature. Its primary actuator turns a furnace or air conditioner on. It has additional sensor capabilities, such as detecting motion, and additional actuators that can be custom implemented, such as those that can turn on a light or open a window.

In an enterprise setting, users may be more interested in greater detail (including electric current fluctuations, humidity, or the presence of various other chemicals). Actuators will break circuits, open gates or doors, or send alerts.

Instead of seeing the IoT as a whole new system, the Data-Centric approach sees how it fits in with and extends the broader business model. A Data-Centric approach to IoT recognizes that the devices in the IoT network are the same devices that are in the asset inventory. The buildings in the IoT are the same buildings that are in the facilities systems, HR systems, and many others. In other words, Data-Centric development is an opportunity to prevent new “silo-ization” of data.

Smart contracts

Blockchain is all the rage now. Unfortunately, we haven’t found a use case in the most visible of the blockchain implementations: the cryptocurrency marketplace. There may be a play there, it just hasn’t occurred to us yet.

What has occurred to us is another use case for Blockchain, which is smart contracts. A smart contract is one where key provisions of the contract are coded in a way that can be automatically executed when the stipulated events occur.

In the absence of a shared data model, the participants to the smart contract are reduced to agreeing on what some code will do. By creating a shared model of contract provisions, the participants in a smart contract can define the specifics of the contract, parametrically (essentially a model-driven contract) commit this to the blockchain and have the consequences of the contract executed when the triggers occur.

Chapter Summary

New technology will continue its relentless march. If we are not intentional about it, new technology initiatives will compete with funding and attention that could be applied to moving forward on the Data-Centric agenda.

Luckily, pretty much all the trending new technology initiatives can be turbocharged with Semantic Technology and Data-Centric principles. Our recommended strategies are not to compete for funding with these initiatives, but to use their funding to advance the Data-Centric initiative.

Depending on when you are reading this, there may be newer emerging technologies, but a lot of the same thought process should still apply. Depending on which initiative is getting funded, you may have slightly different strategies.

Some initiatives, such as the move to the cloud, are major projects that will touch many or most of your applications. Most firms want to use the move to improve their overall economics, and they know, at some level, a mere “lift and shift” strategy of moving from on-premise to the cloud isn’t where most of the value will be gained. The act of rewriting applications is a great opportunity to get at least some of them right in the transition.

Other technology initiatives gain direct benefit from Data-Centric methods; combining them with Data-Centric practices will generally lower their overall cost. Big data and data lakes benefit from adding Data-Centric approaches and thereby aiding data scientists and analysts in finding and organizing the data they are dealing with. Rule-based systems, coupled with Data-Centric, benefit directly by reducing the number of terms that the rules must be written to; NLP and Machine Learning benefit by creating targets for their extraction and by reducing the size of the training sets needed. Smart contracts and IoT benefit from the rationalization of contract terms and reducing the multitude of sensors and actuators to a few simple types. Finally, message and API-related technologies, like Kafka and microservices, benefit from reducing the surface area of their interfaces.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.127.221