CHAPTER 10
Case Studies with Semantic Technology

Before we move on to explore additional facets of the Data-Centric revolution, we’ll take this opportunity to present a few more case studies. These examples will help clarify and illuminate the important relationship between Semantic Technology and Data-Centric systems.

Garlik

According to Steve Harris, Garlik was Data-Centric from Day Zero. From the get-go, they employed a single, simple, shared model of all their data. Garlik, a UK-based startup, was founded in 2005 to provide services to protect consumers from identity theft and fraud.

Steve was the CTO of Garlik from early days to its ultimate sale to Experian. Because of this consistency, Garlik was one of the few companies that had no legacy systems to deal with. Unlike the rest of the case studies here, this means they really had no “negative inertia” or status quo to overcome.

I asked Steve how he had found his clarity and certainty regarding the Data-Centric approach. Prior to Garlik, Steve had been a researcher at the University of Southampton. During his ten years there he developed two insights that have served him well.

The first came from his work on research projects. He observed that there are two primary approaches to building the analytics needed to support a research project: start with the analytics needed and develop the dataset needed to support it versus starting with a fundamental model of the research domain. What he discovered was that while the initial effort was comparable in both approaches, only the latter approach withstood the inevitable changes that come mid-project.

The second insight followed the first, when he came across the semantic web and was convinced of the role it could play in enabling the Data-Centric approach.

In 2005, the opportunity to apply these insights into a commercial venture presented itself. They were funded and their initial project concept was a Business to Consumer (B2C) product focusing on identity theft protection, called Data Patrol, which allowed people to subscribe to a service that would detect whether any of their personal data was compromised or otherwise available anywhere on the web, including the dark web and other sites known for trafficking in stolen identity data.

Around this time, I attended a presentation by Steve’s boss, the founding CEO of Garlik, Tom Ilube. Tom had previously been the CIO of Egg, one of the early online banks in the UK. In that capacity, he learned a great deal about how traditional IT systems work. Tom had a very interesting observation on the efficacy of the using Semantic Technology to implement a Data-Centric approach, which was captured in a Q&A session at the Semantic Technology Conference May 20-24, 2007.

What we’re finding at the moment, in terms of today’s benefit, is that it is enabling us to harvest and extract information in a very focused way. So, if we understand more about the semantics of what we are interested in, we can focus our effort more on what we harvest in a much tighter way than we otherwise would have been able to. Having these explicit and flexible models I believe—I have to do a little bit more work on this, but I do know—that it is absolutely true, when I was running the technology in the banking environment [Egg], every time anyone came along and said we want to add a new field, or add any new data to our databases, it was a huge issue, it was a huge conversation. The database crew would all get together and put it off six months because no one wanted to mess with their databases, because if you did mess with your databases all your applications got screwed up and all the rest of it. So, it was a really big ‘leave this alone.’ Where in this environment, where I am now [Garlik] we’re sort of hungry for new information sources when we find some new information source, we figure out how to extract the information we’re interested in, and we throw it at our ‘smushing’ bit of the architecture, that puts it into the database and we’re away. It just doesn’t seem to be anything like the issue that it was in the relational database environment.

Tom Ilube, CEO, Garlik

In the early days of Garlik, triple stores were still in their infancy. So, in addition to building their application, the Garlik team took on the additional task of building their own standards compliant triple store, “4store”, followed by “5store” (the inside joke is that all triple stores are built around a tuple of at least 4 parts: subject, predicate, object, and named graph). 4Store was one of the first triple stores to cross the billion-triple threshold, and 5store routinely processed tens of billions of triples in support of the Data Patrol product.

Playing into Steve’s observation that the Data-Centric approach really comes into its own in times of change, Garlik found that offering subscriptions direct to consumers for this type of service was pretty much a non-starter. They shifted to a B2B2C model and focused on anti-fraud, rather than identity theft, where their primary customers were banks and other financial institutions. The banks in turn offered this as a service to their customers.

This was a classic pivot for a young lean company. Huge parts of the application had to be changed, to integrate with each of the bank’s systems, to present a different look and feel, processes new classes of data, and to adopt the equivalent of a multi-tenant environment. Again, the Data-Centric approach was up to the task.

Garlik was a major supplier of privacy management to the banking industry, when, on December 23, 2011, Garlik was acquired by Experian, the consumer credit company. Those of us involved with Semantic Technology thought this was a way for Experian to get a foothold in this emerging technology and approach.

It seems, however, that gaining a foothold was not the primary motivation. While Experian have kept the Data Patrol product and architecture and offer it to their clients through an API, apparently, the acquisition was more about acquiring Garlik’s customer base than about making their thousands of systems Data-Centric. While Experian were at first incredulous about the claims of being able to on board new data sources in days versus months, their sheer size seemed to prevent them from embracing the approach.

This is regrettable but understandable. As reported in Software Wasteland, most large companies have thousands of applications and have established practices, policies, and culture for dealing with the applications, despite their diseconomy. At the time of the acquisition Experian was 500 times larger than Garlik and it was highly unlikely that the tail would wag the dog.

After the acquisition, Steve left Experian and founded another startup, Aistemos. While Aistemos is in a very different business, and has a different technology stack, Steve is testament to the old adage, “A mind, once expanded by a new idea, never returns to its original dimensions” (attributed to Oliver Wendell Holmes). Steve has taken his conviction for data-centrism with him to his new startup.

Montefiore

The Montefiore Medical Center is a 134-year-old integrated network based in the Bronx, NY. It is a $7 billion-a-year enterprise and one of the 50 largest employers in New York State, consisting of 11 hospitals as well as the Albert Einstein teaching hospital, and treats 3 million patients. Montefiore has a long tradition of technical innovation and advancement. They were one of the first institutions to adopt an Electronic Medical Record (in the 1980s) and are leaders in applying patient centered outcomes to their medical interventions.

History

The Data-Centric system at Montefiore, which is currently called Patient-centered Analytic Learning Machine (PALM), had its roots in a research laboratory in University of Texas Health Science Center at Houston. The central character in this story is Dr. Parsa Mirhaji, what you could call a “triple threat”: MD, PhD, and ontologist. In 2009, Dr. Mirhaji established a research laboratory funded by the “Telemedicine Advanced Research and Technology Center-TATRC,” whose mission was to bring Semantic Technology to the field of medicine. In his tenure there, Dr. Mirhaji built ontologies and a semantic architecture for large scale health data fusion and integration. They built several point solutions for various medical and population health applications, demonstrating power of semantics in disaster preparedness (during hurricane Katrina), critical care (e.g., major trauma and transfusions), and clinical research. They received three patents45 and developed considerable intellectual property.

Dr. Mirhaji moved to New York and took on the position of Director, Center for Health Data Innovations at Montefiore. In parallel to running the Informatics practice he took on the construction of a strategic information management plan for Montefiore. One of the tenets of the plan was the central role of data and the need for its integration and management as an enterprise asset. The information plan laid the foundation for governance, architecture, competency, and a methodology to evaluate and measure the maturity of the information management at Montefiore. One of the strategic aspects of the plan was to prioritize incremental and iterative progress (agile development) over large ‘bet the farm’ implementations. The plan also suggested and encouraged leveraging alternate funding sources and early and multi-stakeholder engagement, wherever possible.

One such opportunity presented itself in 2014 when the Patient Centered Outcomes Research Institute (PCORI) went out to bid on a system that could consolidate, integrate, and aggregate patient data from all the major New York City health care providers. The project would be a great proving ground for many of the principles articulated in the strategic plan. It had to be implemented rapidly (there was an 18-month window). There were considerable volumes: 6 major medical centers, 20 million patient records, and 300 million encounters. It stressed the competition that exists between health care providing institutions (they need to cooperate to deliver coordinated care, but they are competing for the limited pool of resources and funds).

The project was a big success. It turned out to be a great learning lab for such inter-organizational governance issues as how to share information without violating privacy and inter-organizational strategic sensitivity. In the processing of the 20 million patient records it was determined that there were 10 million duplicates. This is perhaps not as odd as it first sounds given the proximity of the 6 health centers. A patient matching and deduplication algorithm needed to be built across the entire datascape.

The success of this project encouraged senior management to invest further in internal capabilities. Intel became interested in the project and provided material and engineering support. The PCORI project was executed with traditional and relational technology, as the other health centers had not yet embraced Semantic Technology or graph databases. Internally and in parallel, Parsa’s team was building out their own semantic database.

First, they unified medical knowledge (no small feat). There are hundreds of sources of medical knowledge, many of them freely available. Unfortunately, each has its own coding system:

  • ICD9 (now ICD10) codifies diseases.
  • CPT codes codifies medical procedures.
  • MeSH provides a thesaurus style interface to published medical reports and articles.
  • LOINC provides standardized nomenclature for tests and findings.
  • SNOMED cross references many of these with anatomy and rich relationships between many of the nomenclature systems.

At this point, Montefiore became “knowledge-centric.” They had one of the most complete and cross-referenced bodies of knowledge extant. The accumulating knowledge base made it possible to begin bidding on grants that would be difficult or impossible without the pre-organized body of knowledge. These grants are extremely competitive and have very strenuous performance criteria. They bid on and won a key grant to predict respiratory failure and the need for prolonged intubation. Intubation is the process of inserting a breathing tube through the mouth in order to maintain an open airway, for a prolonged period (over 48 hours). It is an expensive and dangerous procedure, uncomfortable for the patient, and often involves complication. It also often saves patients’ lives.

The ability to more accurately determine if prolonged intubation is necessary is crucial. The earlier and more accurately it can be predicted, the more time clinicians have to prepare for it or to prevent it.

Combining their medical knowledge with excerpts from their patient database gave Montefiore the table stakes to receive an award to build a real-time analytic system to predict respiratory failure and prolonged intubation, up to 48 hours in advance, and to provide clinicians with a patient-specific checklist of actions and orders that would improve the outcomes and both patient and clinician experience. The analytic system has been built entirely on principles of semantic data integration and interpretation and has been live across Montefiore Medical Center hospitals since January 2017.

They have recently added a sepsis early recognition model. Sepsis in this context is the development of internal infection, usually as byproduct of medical procedures and hospitalization. It is a leading cause of iatrogenic death. In addition to saving lives (generally a good thing), the sepsis module is saving resources. Prior to the building of the sepsis model, Montefiore had 18 people dedicated to Sepsis reporting (there are strict and exhaustive regulatory requirements on sepsis reporting, from many agencies including CMS and NY State Department of Health). Despite the number of people involved (and perhaps because of the number of people involved), the sepsis reporting runs an average of 9 months behind. Much of the delay is in the back and forth between the hospital and the regulators on irregularities or discrepancies in the reported information.

After the implementation of the new sepsis model, which is now near real-time, Montefiore got a letter from the State Department of Health to inform them that for the first time since the inception of their program, the reports had 0 discrepancies. Many of the 18 people working on sepsis reporting have been redeployed to other projects.

Current and future initiatives

Montefiore has now converted all the records for all their patients’ lives into the system, which was once called the Semantic Data Lake but is now known as Patient-centered Analytic Learning Machine (PALM).

The PALM system is bumping near real-time patient data against their medical knowledge database and looking for anomalous conditions that might be getting overlooked. When a pattern is recognized, an alert is sent to their EMR system. As with all other healthcare institutions, providers experience alert overload, so PALM is trying to be judicious and smart with the signaling. Currently, they are alerting and tracking with the intention, over time, to learn and discover which alerts are acted on or ignored, and which provide the greatest improvement in patient outcomes or provider acceptance. This incremental and automated learning is another feature of PALM.

The Montefiore team is starting to move into their next phase, which will move the system from an analytic only platform to one that includes active participation in the care delivery process. Some of the initial use cases include managing patient flow through the emergency department, managing out-patient appointments, and scheduling. It also works on anticipating patient admission triaging (and staffing accordingly), preventing readmission, automated review of billing claims, fraud detection, and more complex clinical applications that require the fusion of different modalities of data (images + free text + structured EMR data) for learning and inferencing.

As I mention in Software Wasteland, sometimes the world serves up “natural experiments” that are the closest thing we get in politics, business, and social systems to the controlled experiments of the hard sciences. In this case, the natural experiment was Montefiore’s efforts versus an effort undertaken at MD Anderson.

Montefiore’s project has averaged a team of six over five years. They were supplemented with consultants and engineering help from Franz. It seems unlikely to me that the total cost over the five years was over $10 million. More interestingly, the net cost was far less when you consider the grants that were used to fund much of the work, as well as in-kind investments into the endeavor by Franz, Cisco and Intel.

MD Anderson launched the much publicized, “Watson goes to Medical School.” Like Medical School for humans, this was not inexpensive. The total tab for Watson’s matriculation was $62 million. Apparently, Watson flunked out of medical school as the project was cancelled with nothing implemented.

We don’t know whether the MD Anderson project attempted a Data-Centric approach and failed. We do know that Montefiore adopted a hybrid “Data-Centric plus incremental” approach. Montefiore underspent MD Andersen at least by a factor of ten. They out delivered by an even greater margin, as by my assessment, they achieved much of what MD Anderson set out to do and have an implemented system to show for it.

Chapter Summary

These case studies, coupled with many more that we are familiar with but were not able to get permission to publish, tell us several things. The first is an existence proof: it is possible to build a Data-Centric enterprise with Semantic Technology.

The second thing these case studies teach us is that leaning into Semantic Technology can help the mental transition that is needed to embrace radical simplicity.

These case studies also demonstrated how semantics and graph databases make the evolution of data structures simple and seamless.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.57