Chapter 1. Introduction to Healthcare Data

Healthcare data is an exciting vertical for data science and there are tremendous opportunities. However, there are also a lot of misconceptions about working with healthcare data. Those with extensive experience working in enterprise environments tend to underestimate the complexity, often comparing healthcare real world data projects to enterprise integrations. This is not to say that a typical enterprise data project is simple or easy. One of the major differences is the relation of how and why the data were captured relative to the actual work being done.

In nearly every industry, the use of data today is a function of engineered systems. In other words, the majority of the data are generated by software systems vs. collected and entered by a human. For example, in advertising/marketing analytics, the data are generated by websites that track clicks and impressions.

This chapter will walk you through some of the nuances and complexities of healthcare data. Much of this complexity is a reflection of the delivery of healthcare itself — it is just really complicated!

For those with a traditional IT background or who have worked in large companies dealing with complex data issues, we will start with a little discussion of the enterprise mindset and how you might frame healthcare data. After this, we will dive into a broader view of the complexities of healthcare data. Once this foundation has been set, you will get a broad overview of common sources of healthcare data.

The Enterprise Mindset

The data science industry has had many successes — both from companies using data science, as well as creating new data science methods. When leveraging and using data science, most organizations have the benefit of following the traditional enterprise mindset. Information and data architects within the organization can sit down together, discuss the various sources of data and intended use cases, then craft an overarching information model and architecture.

Part of this process typically involves getting various stakeholders together into a single room to agree on how best to define individual nuggets of data or information. Until recently, this has been the approach that most companies have taken when trying to build data warehouses. The challenge in healthcare is that the sources of data operate in disconnected silos. When a patient enters the healthcare system, they typically do so via their primary care physician, urgent care, or the emergency department.

Naturally, one might say we should start here in order to create the information model that will be used to represent healthcare data. After all, nearly everything else flows downstream from the moment a patient makes an appointment or shows up to urgent care. Insurance companies or governments will need to reimburse hospitals for providing care; physicians will prescribe medications and companion diagnostics from the biopharma industry.

So, information architects can start by defining the idea of a patient and all of the associated data elements, such as demographics, medical history, and medication prescription history. However, there are already potential issues when defining the idea as “simple” as a patient as one starts to look at the healthcare industry overall. How does an insurance company think of patients? At least in the United States, insurance companies typically think of people as covered lives, not patients. While some may be quick to say that insurance companies are dehumanizing people and thinking of only statistics, it largely comes down to the operational aspects of tracking benefits.

While the person who seeks care is the patient, they may be a beneficiary of someone else who is not the patient. For example, if a child goes to urgent care for some stitches after falling at the playground, they are the patient as far as the clinic is concerned. However, to the insurance company, there are two people that need to be tracked — the child as well as the parent or guardian who is the insurance policy holder. The insurance company must track these two individuals as different people for the claim, though they are obviously related.

The potential of graphs

The above example highlights how graphs can be particularly useful when dealing with complex data. A single claim can be created as a node that is connected to two other nodes, one each for the child and their parent/guardian.

Even this relatively “simple” example can get very complex and complicated quickly and there will be similar examples throughout this book that highlight the hidden complexities of data.

Those who are veterans in the data space will likely see the parallels between the above discussion and challenges in setting up data warehouses. While an organization may be united in its overall product, each business unit or department within the organization may have very different views of a customer or user. Anecdotally, it can take upwards of 18 months to get all of the stakeholders to agree on a common understanding of concepts that make up an organization’s enterprise information architecture.

A relatively new approach that is starting to gain some traction is a data mesh.1 The data mesh approach is not a technological solution or silver bullet when dealing with increasingly complex data challenges. Instead, it offers an alternative approach and perspective when designing data architectures. At the center is the idea of a data product, combining product thinking with data ownership and governance.

The Complexity of Healthcare Data

Data meshes are just one example of the changing landscape when it comes to data — there have already been evolutions from databases to data warehouses to data lakes — and the evolution will continue. This is, however, an example of the shifting thinking around data, highlighting a key aspect of healthcare data.

Organizations and their leadership are starting to realize that there is a lot of potentially useful data that live outside of traditional systems such as the databases and applications that support sales or manufacturing. As organizations are struggling to keep up with an increasingly heterogenous data landscape, ideas such as the data mesh are invented to try and address shortcomings of existing approaches.

In healthcare, however, these complexities have always been a rate limiting factor. Administrators are not just starting to see value in data and, as a result, trying to find scalable and repeatable ways to link data from the various service lines within a hospital. Nor, are pharmaceutical companies just realizing that there is a tremendous amount of value in claims data from a medical affairs perspective.

Consequently, healthcare has been struggling to find ways to bring disparate data sources together, while still balancing critical issues such as security and privacy, governance, and (most importantly from the perspective of data scientists) normalization and harmonization of the data. Chapter 3 will go into more detail but data scientists typically think of normalization in the context of statistics or machine learning algorithms (e.g., min-max scaling, zero mean and unit variance, etc.). While that definition certainly true and applicable to healthcare data, there is an additional element of normalization.

Sources of Healthcare Data

Electronic Health Records

When people think about healthcare data, one of the most common sources that comes to mind is the electronic health record (EHR). You have probably heard of Epic2 or Cerner3, two of the most common EHR’s used in the United States. There are many other commercial EHR vendors as well as open source projects, some focused on the United States, and others more internationally. While I will list a few common ones, the main focus of this section is to provide you with an introduction to the data typically captured in an EHR and how we need to approach it as data engineers and data scientists.

Electronic Health Record (EHR) vs. Electronic Medical Record (EMR)

You may be wondering — what is the difference between an EHR and an EMR? NextGen, an ambulatory EHR,4 differentiates an EMR as a patient’s record in a single institution and an EHR as a comprehensive record across multiple sources.5 The Office of the National Coordinator of Health IT (ONCHIT) provides a slightly more specific definition, that an EHR is a comprehensive record from all of the clinicians involved in a patient’s care.6

Electronic Health Record (EHR) vs. Electronic Medical Record (EMR)

You may find various definitions of EHR’s vs. EMR’s out there. One common distinction is that an EMR is primarily used by clinicians as a replacement for the paper chart within a single institution; an EHR is a more comprehensive record of a patient and may include data from multiple sources.

In practice, the industry often uses these terms interchangeably so be sure to clarify what someone means when they use the term EHR vs. EMR.

Throughout this book, I will use the term EHR to mean a patient’s clinical record from one or more sources, but will also highlight situations if more specificity is needed

Despite this distinction, it has been my experience that the terms are used interchangeably to mean the same thing. As I will continue to mention over and over (especially in Chapter 3), it is important to make sure that everyone is using the same definition. In this case, when you are discussing EHR’s or EMR’s, be sure to clarify if you are discussing records from a single insitution or from multiple institutions. Throughout this book, I will use the term EHR to mean a patient’s clinical record from one or more sources, but will also highlight situations more specificity be necessary.

EHR’s and Data Harmonization

If I had to choose a single theme to describe this book, it would be this idea of data harmonization. Data harmonization and interoperability are closely related and often used interchangeably. In this book, I use interoperability to describe the sharing of healthcare data from a transactional perspective. In other words, how can we share data about a patient as part of the patient journey or care process. For example, when a patient is admitted to the Emergency Department, how do we transfer information about their visit back to their primary care physician? Or, if the patient is referred to a specialist, how is their data sent to their new physician?

Interoperability vs. Harmonization
Figure 1-1. Interoperability vs. Harmonization

On the other hand, I use data harmonization to refer to the process of integrating data from multiple sources in such a way that as much of the underlying context is preserved and meaning of the data is normalized across different datasets. For example, as a pharmaceutical company, I may want to combine a claims dataset from a payer with a dataset from a hospital EHR. How can I ensure that the medications in both datasets are normalized such that a simple query such as “I want all patients who received a platinum-based chemotherapy” returns the correct patients from the EHR dataset and the correct claims from the payer dataset?

Both interoperability and data harmonization are big challenges in healthcare. There is also a lot of overlap in the underlying issues and associated solutions. For example, whether one is transmitting a list of a patient’s medication history from one hospital to the next, or trying to combine two datasets, the solution may be to link the medications to a standard coding system such as RxNorm7 or National Drug Codes.8

For the purpose of this book, the key issue is that how you interpret the data largely depends on why the data were collected, who collected the data, and any workflow or user experience limitations during the collection process — what I collectively refer to as context. As data engineers and data scientists, we do not always have access to all of the necessary context and we oftentimes need to make a best guess. However, it is important that we at least ask ourselves how the data might have been affected.

One of the most commonly highlighted examples of the challenges with data harmonization of EHR data is the is the use of ICD codes within a patient’s clinical record. ICD codes9 are a common method for tracking diagnoses with a patients clinical record — at least this is the high level intention and assumption.

However, to dig a little deeper, we need to ask a few questions:

  1. How are the ICD codes actually used?

  2. Why ICD codes (versus any other type of code or coding system)?

  3. Who assigns the codes to a patient’s clinical record?

These may seem obvious but remind us that not every organization approach data the same way. In the United States, ICD codes in an EHR are typically used as part of the billing process, and entered into the system by medical billing specialists, not physicians or nurses. This impacts out analyses because we need to keep in mind that the codes may be used to justify other parts of an insurance claim or to satisfy internal reporting requirements, and not intended at all to accurately document a patient’s medical history.

Example 1-1. Diabetes Screening Scenario

A patient may go to their primary care physician for a routine checkup. During the encounter, a physician suspects that their patient may have diabetes and decides to order a hemoglobin A1c test. When the billing specialist reviews this encounter and prepares a claim for submission to the insurance company, they add the ICD-10 code of Z13.1, “Encounter for screening for diabetes mellitus.” At the same time, there is an internal policy to also code such patients with the code R73.03, “Prediabetes,” which is used to feed a dashboard for subsequent follow up with educational and other patient engagement materials.

As data scientists, we come along months or years later and are attempting to generate a cohort of patients who are prediabetic (but not yet diabetic) and subsequently diagnosed as diabetic. We know that there is the perfect ICD-10 code (R73.03) and proceed to write a SQL query to find all patients who have that code. However, what if our patient above had an A1c result of 5.5% (completely within normal limits) and the physician was just being overly cautious? In this scenario, our SQL query would erroneously return this patient as part of our prediabetic cohort.

To further highlight the potential complexity, let’s say our clinic is currently participating in a local program with the county’s department of public health on a diabetes screening campaign. The physicians in our clinic are now ordering prediabetes screenings at a much higher rate than other counties in the state. If we continue to rely on this “perfect” ICD-10 code of R73.03, it will appear as if this particular county has an extremely high rate of prediabetics who are never subsequently diagnosed with diabetes. This could be interpreted that the campaign is extremely successful in preventing diabetes.

Given the example above, what is the solution? Typically, in this sort of a situation where we have an objective definition of “prediabetes,” we can just add a threshold to our SQL query, triggering off the A1c result. For example, the threshold for prediabetes may be 5.7-6.4% so we can update the WHERE clause of our SQL query accordingly.

However, the threshold at which point a person is diagnosed as diabetic is not uniform across all hospitals, clinics, or health systems. For example, while our clinic uses 6.4%, another clinic in the county or state may use a threshold of 6.7%. As data scientists and data engineers, we must then query for the raw lab result and apply additional filtering later on in our process.

While the example above may seem unnecessarily complex, it is actually simpler than many, especially those in specialities such as oncology or neurology where )the diseases themselves are not as well understood. I will continue to share similar examples and how the use of graph databases can make our job as data engineers and data scientists easier and more efficient, particularly through the lens of reproducibility.

Claims Data

Claims data capture the financial side of healthcare delivery. While this is commonly associated with the US healthcare system given our reliance on private insurers, countries with national health systems also track similar data (essentially treating the government as the payer). You may hear the term “payer” used to describe insurance companies and I will also use that term throughout this book.

On the surface, we might assume that the data contained in a claim (e.g., diagnoses, medications, procedures, etc.) correspond to the data contained in the patient’s record in an electronic health record. For example, if you go to your doctor’s office for a routine checkup and they do a blood draw and order a panel of lab tests, we should expect the records of these in both the EHR as well as the claim to correspond to one another.

For very simple situations, this may be true. However, there is usually a serious discrepancy between claims and EHR data. You may have heard of the term “upcoding” which is often used to describe the process by which clinics and hospitals (collectively referred to as “providers”) submit codes fraudently to increase payments.

For example, a provider may have seen a patient for a short visit where only routine screening services were performed, which should have been coded using 99212 (the evaluation of an established patient with at least 2 of: a problem-focused history, a problem-focused examination, and straightforward medical decision making). However, the clinic knows that it would get reimbursed more if it submits code 99215 (an evaluation requiring at least 2 of: a comprehensive history, a detailed examination, and medical decision making of high complexity). While this is a very clear example of fraudulent upcoding, not all situations where codes appear to be inflated are fraudulent.

Another example may be when a provider adds a particular diagnosis code which appears to increase the complexity of a patient (as discussed in Example 1-1). In this situation, a code was attached to the claim in order to justify associated lab tests. From the perspective of retrospective analysis, this may initially appear to be upcoding since the patient was not actually diabetic.

Either way, this highlights that claims data may be inaccurate in a clinical sense given the processes behind claims adjudication and reimbursement, or even fraudulent upcoding. The key takeaway is that we, as data engineers and data scientists, must really understand the nuances underlying the claims data, and not assume that it is accurate for our particular use case.

Self-Insured Employers

Most larger companies in the United States are self insured or self funded — this means they take on the financial risk instead of an insurance company. They typically contract with third party administrators (e.g., United HealthCare, Anthem, etc.) to handle the claims processing. In these situations, there is yet another source of influence on how/why data are collected, which then directly impacts any downstream analyses.

Clinical / Disease Registries

Clinical and disease registries are typically used to collect data prospectively given a specific set of criteria (often referred to as a study protocol). Many of the same data harmonization challenges exist in registries as they do with EHR data. However, one of the key differences is that the primary intention behind a disease registry is to collect data for later analysis (e.g., clinical research, population health, public health surveillance).

We discuss it in a bit more detail later but this highlights the biggest difference between EHR/claims data and most other data collected in healthcare — whether the data were collected for the purpose of later analysis or not. EHRs and claims data are collected primarily to transact the business of healthcare. When a physician captures data in the EHR, they are trying to document such that they (or another clinician) can provide the patient with the appropriate care. When a medical coding specialist assigns ICD-10 codes, they are helping the clinic submit reimbursement claims. The intention is not to collect data for data science.

Registries, on the other hand, are collected so that data analysts and data scientists can use the data to derive insights about populations of patients. Instead of needing to do a deep dive into a particular hospital’s workflow to understand the context and nuance of the data, you would refer to the study protocol instead. This becomes particularly beneficial for us when working with data from mulitple institutions or data collection points. In a registry, all sites for data collection use the same study protocol and attempt to collect data as uniformly as possible. In contrast, the local influences on EHR and claims data will vary between clinic, insurance companies, and even employers as discussed above.

Clinical Trials Data

Clinical trials data is likely the “cleanest” of all healthcare data since there is significaly financial and regulatory incentives in place. The success of a clinical trial and approval by regulatory authorities hinges on having clean data and robust analyses. As a result, pharmaceutical companies and clinical/contract research organizations (CROs) dedicate significant resources to the data collection, cleaning, processing, and analysis.

Additionally, there are clearly defined standards (e.g., CDISC10) around clinical trials data since regulatory agencies have clearly defined submission requirements. While such standards help decrease some of the challenges when harmonizing trials data, they do not solve all of the challenges.

Other Data

Electronic health records, registries, claims data, and clinical trials data are the typical categories of data that come to mind when talking about healthcare data. However, there are many other sources of data that are being used at all levels of healthcare, from digital health startups to hospitals and biopharma companies.

There are far too many to list here but the approaches presented throughout this book are just as applicable. Whether you are dealing with patient reported outcomes, radiology reports, diagnostic data, or anything else, approach it as you would any of the sources discussed above.

Data Collection and How that Affects Data Scientists

Retrospective vs. Prospective Studies

Prospective Studies

The term prospective is adapted from its use when describing clinical research studies — highlighting the relationship between when the study starts and when the final outcome is measured1. In prospective studies, a study protocol is put in place and data are collected. Data continue to be collected per the study protocol until the end of the study. Analysis of the data may start immediately (even while the study is on going) or may start after the data has been locked and no additional data are collected.

One of the key points to consider with prospective studies is that the criteria for data collection and how the data are collected are explicit and influenced by the purpose of the study. For example, take a study that seeks to identify clinical signs associated with impending death in patients with advanced cancer2. This was setup as a prospective study and the protocol dictated that 52 physical signs were documented every 12 hours, starting with the patient’s admission.

In the study above, and as with most prospective studies, decisions about the underlying format/data types and the meaning (referred to as semantics) of each data element are determined up front. Those involved with the collection, management, and analysis of the data can all refer back to the study protocol for the intended semantics.

The concepts of prospective studies and primary data are often conflated given their frequent association. Data collected in the context of prospective studies are typically considered primary data because they are collected for the purpose of the study3. However, though data may have been collected in a prospective study, it could be used as secondary data in a follow up study. Continuing with the cancer study referenced above, if researchers took that data and wanted to look for correlations between various bedside clinical signs and various medications, this would be considered secondary use of the data. That is, the data are being used and analyzed for reasons other than why they were originally collected.

So, while data from prospective studies are usually considered primary data, they may also be considered secondary data — this distinction between primary and secondary use depends entirely on the question(s) being asked of the data, relative to how and why the data were initially collected.

Retrospective Studies

Historically, prospective studies were the major mechanism for gathering healthcare data for analysis, often in the form of clinical research. However, as with most industries, data are being collected more and more frequently — in a variety of forms whether through electronic health records, digital health tools, or even in clinical and disease registries. Consequently, people are looking to these data to find insights and are essentially conducting retrospective studies.

Retrospective studies are those where the outcome is already known (e.g., we already know the overall survival of all patients in the data set) and data are collected from existing sources or memory. As a result, retrospective studies typically involve secondary use of data. This is where things can get confusing between prospective studies and retrospective studies, and primary data and secondary data.

One common example of a retrospective study involving secondary use of data is the extraction of data from electronic health records (EHR’s)4 — a researcher may want to look at the relative overall survival of cancer patients on a particular medication (e.g., bevacizumab5) relative to standard chemotherapy alone. Instead of constructing prospective study, the researcher decides they will extract data on a subset of patients who match the inclusion/exclusion criteria. Though the data have already been collected in the EHR, the researcher is retrospectively analyzing previously collected for their study.

The example above also highlights secondary use of data. The data were originally collected in the EHR for the purposes of patient care or billing, but are now being used to compare the efficacy of traditional chemotherapy regimens and those that include the addition of bevacizumab.

Generally speaking, from the perspective of the data, whether a study is prospective or retrospective is less important than whether it is a primary or secondary use (and collection) of data. As data engineers or data scientists, it is important to consider how and why the data were collected since this directly impacts the [[wrangling (cleaning, processing, normalization, and harmonization) of data]].

It is usually easier to wrangle data that have been collected for the specific study being conducted. Data types and formats have been decided; there is an established common understanding of the data elements and how they are supposed to be collected. This does not ensure that the data are in fact clean, but it does decrease the data wrangling challenges.

In the case of secondary use of data, the data were collected for a variety of different (and sometimes conflicting) reasons so it is not always clear how the data should be cleaned and processed. For example, one common misconception is that a list of ICD-10 codes within a patient’s record in the EHR is a good source of identifying patients with a particular diagnosis. While ICD-10 codes are commonly used to track diagnoses in a variety of datasets, it is important to understand the context of the use of ICD-10 codes in many (though not all) EHR’s.

Take a patient who comes into their primary care provider’s office for a routine checkup and the physician orders a hemoglobin A1c (HbA1c) test to rule out diabetes. That is, the physician feels their patient may have diabetes and is attempting to validate this hypothesis. They will put in the order for the test and continue with the visit. However, somewhere behind the scenes, someone responsible for medical billing also tags the patient’s record with the ICD-10 code of E13, indicating “other specified diabetes mellitus.”

Why did they do this? Perhaps this allows the hospital administration to track why particular tests are being ordered, or this allows insurance companies to identify erroneous test orders. The insurance company may have a policy that says, “HbA1c tests are approved only for patients having or suspected of having diabetes.” In order to validate incoming claims, the insurance company has pushed the burden onto hospitals. Existing diabetic patients will already have a corresponding ICD-10 code and will pass the validation. However, a patient who has not been diagnosed with diabetes will fail this test and the claim will be kicked back to the hospital. So, in order to pass the validation, the hospital codes the patient as having diabetes.

In this example, is the patient diabetic? Perhaps. Perhaps not. Until the result of the HbA1c test is examined, there is no way for a data scientist to know if this is a diabetic patient or not (and whether or not to include this patient in the cohort).

1 How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.

2 Epic Systems Corporation.

3 Cerner Corporation.

4 “Ambulatory” is being used to mean “outpatient,” distinguishing it from an in-patient or hospital EHR



7 RxNorm.

8 National Drug Codes.

9 ICD Codes.

10 CDISC. clinical trials data

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.