Chapter 13: Machine learning for optimizing healthcare resources

Abdalrahman Tawhid; Tanya Teotia; Haytham Elmiligi    Computing Science Department, Thompson Rivers University, Kamloops, BC, Canada

Abstract

Optimizing healthcare resources is a major goal for any healthcare administrator. The significance of such a goal becomes very clear during pandemics. Access to such resources at the right time affects the quality of healthcare services and provides alternative treatments that can save patients’ lives. Achieving this goal becomes more challenging when there are large number of patients who have chronic diseases. Machine learning algorithms provide a very promising solution that can help healthcare administrators make the right decision at the right time. Machine learning model can predict the progress of pandemics, classify a patient with well-defined symptoms as contagious or not, and can also predict the number of patients who will be hospitalized in the future. This chapter shows how to utilize machine learning algorithms to create a models that can predict some of the key issues in healthcare systems. The discussion in the chapter relates to COVID-19 pandemic and highlights the solutions offered by machine learning in such scenarios. The chapter also highlights the significance of feature engineering and its impact on the accuracy of machine learning models. The chapter ends with two case studies. The first case study shows how to build a prediction model that can predict the number of diabetic patients who will visit certain hospitals in a specific geographic location in future years. The second case study analyzes health records during the COVID-19 pandemic.

Keywords

Chronic diseases; Public Health Agency of Canada; Diabetic patients; Cross-validation

1: Introduction

Researchers and health administrators are always trying to explore the best options to optimize access to healthcare resources. Access to such a vital service is becoming more significant during pandemic eras. In January 2020, when the COVID-19 virus began to spread across different countries, the healthcare systems all over the world were having a stress test to ensure they have well-functioning emergency plans to deal with the COVID-19 pandemic. At early stages of the pandemic, hospitals were suffering from shortage in ventilator systems that can be used to help COVID-19 patients to breath if they have breathing difficulties. Consequently, researchers started developing machine learning models to predict the healthcare resources that will be needed in the upcoming days. Although these models were not 100% accurate, they provided a great help to decision makers to analyze the situation and to have a rough figure of resources needed during different phases of the pandemic.

Resource management during pandemics becomes more challenging when there are large population with chronic diseases. A chronic disease, in general, is a disease that lasts 3 months or more. Chronic diseases generally cannot be prevented by vaccines or cured by medication, nor do they just disappear. There are many different chronic diseases such as diabetes, kidney failure, heart, high blood pressure, etc. A study done in the New York City, United States that analyzed data of 5700 COVID-19 patients has found that nearly all of these patients had at least one major chronic health condition, and 88% of them had at least two chronic health conditions (Richardson et al., 2020).

Machine learning is a process of teaching the machine, that is, the computer, how to find common patterns in a set of data and come to a conclusion about the data correlation to conclusions without being specifically programmed to get to that result (Behera and Das, 2017). There are different types of machine learning algorithms that take different approaches in understanding the datasets and building the predictive models. Machine learning algorithms can also be used for classification or regression. In classification, the algorithm classifies the test pattern as it belongs to a specific class. In regression, the algorithm tries to predict a specific number based on the given test pattern. Machine learning problems can, sometime, be formulated as classification and/or regression problems.

Machine learning algorithms help in finding correlations between different attributes related to certain events. Hence, not only they can be used to optimize healthcare resources through predictive models, but also they can be used to provide exit strategies for pandemic situations after long lockdown. During the COVID-19 pandemic, governments started looking for an exit strategy to protect their citizens from COVID-19 virus while opening the economy after a long lockdown. In April 2020, Prof. Kostas Kostarelos at the University of Manchester suggested to deal with the COVID-19 pandemic more like a chronic disease (Kostarelos, 2020). In such a case, analyst should adopt a care model usually applied to cancer patients in order to provide a constructive way to reopen the economy while dealing with the virus. Some experts even suggested to treat COVID-19 as a chronic diseases due to the fact that many COVID-19 tested positive after recovery (Lan et al., 2020). However, there is no scientific evidence up to the time of writing this chapter that COVID-19 is a chronic disease that will keep attacking the body even after the patient has recovered over and over again.

Nevertheless, COVID-19 pandemic highlighted the significance of machine learning models in optimizing healthcare resources, predicting the number of patients that need specific care, understanding the risk factor on patients with chronic diseases, and building models that help decision makers decide when and how to return to normal life.

In this chapter, we explore how machine learning techniques could be used to help hospital administrators manage their staff and resources efficiently. Machine learning is a process used to find hidden patterns in large batches of data that can be used to teach the machine how to classify or predict certain numbers. Machine learning depends on effective data collection and warehousing as well as algorithms and computer processing. The goal of this chapter is to discuss various techniques to analyze dataset features and to develop an efficient model to predict different parameters.

2: The state of the art

Prior to COVID-19 pandemic, most of the previous machine learning studies that examined risk factors for hospital admissions have focused primarily on creating datasets for a specific disease or condition, a single hospital site, or even a specific patient in the population of interest. On the contrary, pandemic situations require collaborative datasets that share information from multiple sites to come up with an accurate prediction. The accuracy of traditional forecasting largely depends on the availability of admission data to base its predictions and estimates of uncertainty. In outbreaks of epidemics, and specifically at the COVID-19 pandemic, there was no data at all at the beginning and then data started to grow as time passes, making predictions widely uncertain. Although data scientists tried to come up with several models to predict the pandemic peak in each country, the accuracy of these models was not high due to the lack of data.

Based on our analysis of research papers published in this area, we managed to narrow our findings into three main categories:

  1. 1. The first category is related to resource management. This category includes research that utilizes machine learning algorithms to create models that aid in the prediction of hospital admissions in general or to the emergency department, which can cause overcrowding. We found many papers published prior to as well as during the COVID-19 pandemic.
  2. 2. The second category of interest encompasses research work that relates to the COVID-19 pandemic and its impact on people’s health. This includes models that forecast mortality, and infection rates among different age groups, genders, and ethnic groups.
  3. 3. The third category is related to exploring possible exit strategies to help the economy of each country to bounce back after a long shut-down period due to the COVID-19 pandemic. This includes developing models to predict the peak time of the pandemic, the estimated time before finding a vaccine, the optimum time to lower the emergency level across the country, when to open airport to international flights, etc.

The following sections discuss the approaches taken to address these three different categories in more details.

2.1: Resource management

Several researchers tried to explore applying machine learning algorithms to create predictive models that address a specific problem related to a single site or a country. For example, a research work conducted by Michael LaMantia utilized a triage-based model to create a model that gives a probability of admission rates of elderly patients (LaMantia et al., 2010). This work was performed based on dataset of 4873 visits of patients to the hospital. Triage is the process of sorting and filtering out patients based on priority, it aims to determine a patient’s acuity level in order to facilitate timely and effective care before their condition worsens. The work managed to predict the number of admissions in a total population of visits by elderly patients. Regression modeling is performed to identify the variables that are most significant in predicting the probability of admission. The study concluded that these variables are: age, heart rate, triage score, chief complaint, and diastolic blood pressure. However, this work only applies to elderly patients. This does not provide the broad representation which is needed to properly enable the health administrators to allocate their resources.

Another example is a study published by Weissman et al. (2020) to predict the hospital capacity needs during the COVID-19 pandemic. The study shows that using records of patients with COVID-19 alone, it would be 31–53 days before the demand for more resources exceeds existing hospital capacity. The study used the Hospital Impact Model for Epidemics (CHIME), which is a modified SIR model that is used to compute the number of people infected with COVID-19 in a closed population overtime. The CHIME model estimated that it would be 31–53 days before demand exceeds existing hospital capacity. The study identifies the needed resources in terms of the total capacity for hospital beds, the number of ICU beds and ventilators in the best-case and worst-case scenarios. Such a study is significantly important in identifying the resources that will be needed during a pandemic to help healthcare administrators plan ahead and make the most suitable decisions to optimize their resources.

A third example is the research conducted by Giacomo Grasselli in Italy, which was one of the major spots during the COVID-19 pandemic. Grasselli used historical data to predict the number of patients who will be admitted to the intensive care unit over a period of 2 weeks (Grasselli et al., 2020). This work is prevalent only to that specific hospital and its admissions. The benefit of his methodology is the ability to use real-time data specific to that hospital. The downside, however, is that his forecast does not take into account factors that account for higher infection rates and that these predictions are very region specific.

A final example is a study completed by Chen et al. In their work, authors utilized data from the Tonji Hospital in China and applied five machine learning approaches: logistic regression, partial least-squares regression, elastic net, random forest, and bagged flexible discriminant analysis to select the features and predict outcomes in severe COVID-19 patients (Chen and Liu, 2020). The authors successfully validated their results by using the area under the receiver operating characteristic curve (AUROC), which was applied to compare the model’s performance. They also tested their results on 64 patients admitted to the hospital with COVID-19. The main focus was predicting the number of moralities that will result for the admitted patients. In order to have methodology that properly allows healthcare administrations to allocate resources properly, a prediction of infection rates would be a much more useful feature to work with. Also, focusing on a small hospital in a large country is not enough to give a proper representation of the virus activity. Validation of such methods needs to be established on other subjects not native to the city.

2.2: Impact on people’s health

The second category discusses the research work that used machine learning algorithms to predict COVID-19’s impact on people’s health and well-being.

Before looking at the opportunity to forecast different trends of the pandemic, researchers started to think of what the global impact of the COVID-19 pandemic will be. Answering this question requires accurate prediction of the spread of confirmed cases all over the world as well as analysis of the number of deaths and recoveries in each countries.

Forecasting, however, requires ample and reliable historical data which may be a limiting factor as the virus was fairly new at the time of the study. Moreover, forecasts are influenced by the reliability of the data, vested interests, and what variables are being predicted. Also, the psychological factors play a significant role in how the population perceive and react to the danger caused by the disease and the fear that it may affect them in their own communities. Therefore, the main challenge that faced data scientists at that time is how to predict the impact on people’s health using machine learning algorithms. The risks were significant because the reports generated by these algorithms were used by world leaders to make critical decisions that impact the local as well as international economy.

A study done by Fotios Petropoulos used statistical data from the John Hopkins University and analyzed preexisting graphical representations of trends to attempt and forecast future representations of the graph (Petropoulos and Makridakis, 2020). Petropoulos used exponential smoothing models, which can capture a variety of trend and seasonal forecasting patterns such as mortality, confirmed cases, and recovered cases in 10 day intervals. One of the benefits of this model is that it takes the most recent observations into account and weights them accordingly. For example, if we are looking at 4-month data in 2018, they will likely be weighted differently when considered in the same period of 2020. The exponential smoothing method takes this into account (Avercast, 2020).

In another study completed by Li, where she focused on analyzing the existing data of the Hubei epidemic situation, a corresponding model is then established, and then the simulation was carried out. Through the simulation, she studied the main factors affecting the spread of COVID-19, such as the number of basic regenerations, the incubation period, and the average number of days of cure (Li et al., 2020). This was done using a Gaussian distribution. Gaussian distribution, also known as the normal distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. A normal distribution has two parameters: the mean and the standard deviation. An analysis of graphical representations is useful for short-term expectations of certain data, but it also limits using region-specific data.

2.3: Exit strategies

How did a health crisis translate to an economic crisis? Why did the spread of the COVID-19 bring the global economy to its knees? When will COVID-19 reach its peak? The answer to these questions lies in two methods by which COVID-19 stifled economic activities. First, the spread of the virus encouraged social distancing which led to the shutdown of financial markets, corporate offices, businesses, and events. Second, the exponential rate at which the virus was spreading, and the heightened uncertainty about how bad the situation could get, which led to the closure of airports, a new look at consumption and investment among consumers, investors, and even international trade partners. Preparing to open the economy once again was a very critical decision. Therefore, decision makers needed a trusted methodology to help accurately forecast the peak and termination times of the pandemic. In those circumstances, having such a methodology was an important factor to minimize the risk of further losses whether economically, or physically.

For example, a study conducted by Zahiri utilized statistical data collected from Iran and their published infection rates. The study applied the SIR model to forecast infection rate, peak times, termination times, and other parameters (Zahiri et al., 2020). An SIR model is an epidemiological model that, in this study, computed the theoretical number of people infected with a contagious illness in a closed population over time. The name of this class of models derives from the fact that they involve coupled equations relating the number of susceptible people (S), number of people infected (I), and number of people who have recovered (R) (Smith and Moore, 2004). This model may be useful in interpreting data for a specific country and under specific conditions. However, it was difficult to generalize this model and adopt it in other countries. The disease has thus far been unpredictable and this model did not also account for the wave of infection that can occur if airports and businesses reopen.

Airline business was another major industry that was impacted by the disease. In March 2020, a financial analysis study reported that Canadian airports were confronting the prospect of 2 billion dollars in losses over the next few months as borders shut and planes stay parked due to the COVID-19 pandemic (Times Colonist, 2020). This major loss encourages airport facilities to reopen, to avoid any more losses. But it is important to have methods that enable us to accurately forecast the optimum time to open airports. In our analysis of literature, this aspect seemed to be missing, with the focus primarily being on mortality and infection rates.

3: Machine learning for health data analysis

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning (a subset of artificial intelligence) plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data and records, and the treatment of chronic diseases.

As mentioned previously, a proper and reliable dataset is imperative to achieve accurate results. Data quality relies on four factors:

  1. 1. accuracy
  2. 2. completeness
  3. 3. validity
  4. 4. timeliness

Accuracy refers to how well the data describes the real-world conditions it aims to describe. Inaccurate data create clear problems, as it can cause the machine learning algorithm to come to incorrect conclusions. The actions’ administrators take based on those conclusions might not have the effects they expect because they are based on inaccurate data.

Completeness encompasses how complete the dataset is and that there are no gaps in it. Everything that was supposed to be collected was successfully collected. For example, if a customer skipped several questions on a survey, the data they submitted would not be complete. If the data are incomplete, we might have trouble gathering accurate insights from it.

Validity refers to how the data are collected rather than the data itself. Data are valid if it is in the right format, of the correct type, and falls within the right range. If data do not meet these criteria, we might run into trouble organizing and analyzing it.

The final factor is timeliness, which refers to how recently the event the data represent occurred. In most cases, data should be recorded as soon as possible after the real-world event. Data typically become less useful and less accurate as time passes on. Data that reflect events that happened more recently such as the COVID-19 pandemic would be even more useful in application to the current reality. Therefore, selecting the proper dataset is a very crucial factor in any research work. The data used during this work are a secondary dataset, derived from the Public Health Agency of Canada (2015). The raw data contained data for over 10 different chronic diseases. Each disease had data for 2000–11. Within each year, there were several categories. There was the gender, 18 different age groups, prevalent cases, mortality, hospital, general, specialist visits, and much more. There were two problems that developed within this dataset. First, the dataset only provided data, for 2000–11, which left us with only one available option, to create a general formula which will be able to predict the years we do not have. The second dilemma is, the dataset cannot be used in the data mining tool as is, there would have to be preprocessing done to the data.

4: Feature selection techniques

In data mining applications, feature selection plays a vital role by removing irrelevant and redundant features from the dataset. This reduces the dimensionality of the dataset as well as the computational time by (1) optimizing the learning process and (2) improving the performance of machine learning algorithms (Guyon and Elisseeff, 2003; Xue et al., 2016).

Feature selection techniques can be categorized into three categories:

  1. 1. filter (Guyon and Elisseeff, 2003; Kohavi and John, 1997);
  2. 2. wrapper (Kohavi and John, 1997); and
  3. 3. embedded (Guyon and Elisseeff, 2003; Blum and Langley, 1997).

Fig. 1 shows our categorization of feature selection techniques used in the previous work. In the following sections, we discuss each approach briefly and provide examples from the literature.

Fig. 1
Fig. 1 Feature selection techniques.

4.1: Filter approach

In the filter approach, a ranking-based criterion is used to rank all features in the dataset. Then, a threshold value is set and all features below the threshold are removed (Chandrashekar and Sahin, 2014). The ranking methods involved in this process are known as filter methods as they filter out irrelevant features from the dataset before being evaluated by a classification algorithm. The features that are essential to construct an optimum subset are identified as highly relevant features (Yu and Liu, 2004). It is also important to ensure that there are no redundant features in the dataset to reduce the effect of curse of dimensionality (Chandrashekar and Sahin, 2014). The redundancy of two features can be calculated by calculating the correlation between their values. The following section discusses the major filter-based selection methods.

4.1.1: Correlation based

Correlation criteria are mainly based on the hypothesis Good feature subsets contain features that are highly correlated with the class, but uncorrelated to other features (Hall, 2000). There are two types of correlation measures: one is based on feature-feature correlation, whereas the other is based on feature-class correlation. The merit of feature subset S consisting of N features can be calculated as (Chen et al., 2006; Vanaja and Kumar, 2014):

MeritSN=Nr-fcN+N(N1)Nr-ff

si1_e  (1)

where r-ffsi2_e represents the average of feature-feature correlation and r-fcsi3_e represents the average of feature-feature correlation. N represents the number of features in that particular subset S. The feature-class and feature-feature correlations can be calculated by various correlation coefficient mathematical criteria. One of the most common criteria is based on Pearson correlation coefficient (Guyon and Elisseeff, 2003; Chandrashekar and Sahin, 2014):

R(i)=Cov(xi,Y)Var(xi)*Var(Y)

si4_e  (2)

where xi represents the ith attribute of the dataset, Y represents the class label, V ar() and Cov() represent the variance and covariance. Correlation-based feature selection (CFS) is one of the most common filter approaches applied to feature selection problems (Halakou, 2013; Balagani et al., 2011; Kumar and Zhang, 2005).

4.1.2: Information based

Information-based feature selection is a measure that indicates the difference between prior and expected posterior uncertainty (Guyon and Elisseeff, 2003; Chandrashekar and Sahin, 2014; Liu and Setiono, 1996; Molina et al., 2002). The uncertainty of a feature Y can be calculated by Shannon’s definition of entropy (Chandrashekar and Sahin, 2014)

H(Y)=yp(y)log(p(y))

si5_e  (3)

and the conditional entropy of a random variable Y given variable X is computed by (Chandrashekar and Sahin, 2014)

H(Y|X)=xyp(x,y)log(p(y|x))

si6_e  (4)

The difference between Eqs. (3) and (4) describes the mutual information with respect to a random variable between Y and X (Chandrashekar and Sahin, 2014):

MI(Y,X)=H(Y)H(Y|X)

si7_e  (5)

where MI is the mutual information that measures the mutual dependence between two random variables, and

MI=0,XandYare independent of each other,>0,XandYare dependent on each other.

si8_e

When X and Y are independent of each other, this results in less information gain and more uncertainty. When X and Y are dependent on each other, this results in more information gain and less uncertainty.

MI measures the distance between two probability distributions, whereas correlation criteria, such as Pearson correlation coefficient, measure the degree of correlation based on linear relationship between two random variables.

4.1.3: Consistency based

Consistency-based feature selection is used to measure the consistency of features with respect to a class value. The objective of using this measure is to find a subset of features leading to zero inconsistencies. Having two instances with the same values but belonging to two different classes indicates the occurrence of an inconsistent pattern in the dataset (Dash et al., 2000). The inconsistency count of an instance can be calculated by the number of times the same pattern is found in the dataset minus the largest number of the same pattern belonging to a single class. The inconsistency rate can be calculated by dividing the total sum of all inconsistency counts of all patterns found in a particular feature subset by the total number of patterns in the dataset with respect to that selected feature subset (Dash et al., 2000).

4.1.4: Distance based

Distance-based criterion is divided into two subcategories: (1) classical distance based measures and (2) probabilistic distance-based measures, also known as divergence measures.

Classical distance-based measures are used to find similarities between two feature vectors. When the distance value is small, the two vectors are considered similar, whereas when the distance value is large, they become less similar. Relief/ReliefF-based feature selection algorithm is an example of classical distance-based measure, which uses Euclidean and Manhattan distance measures in high-dimensional feature space (Robnik-Šikonja and Kononenko, 2003). Relief algorithm considers two feature vectors. One belongs to the same class, called nearest hit, whereas the other belongs to a different class, called nearest miss.

The underlying principle is that a feature is considered more relevant when the Euclidean distance between a feature and the nearest miss is large, whereas a feature is considered less relevant when the Euclidean distance between a feature and the nearest hit is small (Molina et al., 2002). Weights are given to any feature based on its relevancy. Similarly, ReliefF algorithm selects a feature vector of an instance randomly but searches for k-nearest neighbors belonging to the same class, called nearest hits, and also k-nearest neighbors belonging to each of the different classes, called nearest misses and computes their averages (Molina et al., 2002).

Even though filter methods, in general, are used as a preprocessing step to reduce the dimensionality of a dataset and overcome overfitting, many researchers concluded that it does not provide the best accuracy rate compared to wrapper methods as it ignores the feature subset dependency on the learning algorithm (Kohavi and John, 1997).

4.2: Wrapper approach

In the previous section, we discussed the filter methodology which uses an independent measure as an evaluator for subset evaluation. In this section, we discuss wrapper methodology, which uses a learning algorithm as a feature subset evaluator (Kohavi and John, 1997). This methodology uses a learning algorithm as a black box and its performance as an objective function for evaluating a selected feature subset (Chandrashekar and Sahin, 2014; Kohavi and John, 1997).

Various search techniques are used to generate different feature subsets and are evaluated using a machine learning algorithm till the optimal subset is found. These methods perform better at defining the optimal subset that is best suited to a learning algorithm. One of the key advantages of using wrapper-based feature selection approach is that it does not ignore the dependency of the selected feature subset on the overall performance of a learning algorithm (Kohavi and John, 1997). Therefore, the performance of this approach is usually superior. However, since it uses a search technique and a learning algorithm together for the subset selection and evaluation, it tends to be slower and computationally expensive.

In order to optimize the search time, different types of search algorithms have been used in the literature. Various wrapper-based experiments have been carried out by varying different search techniques and learning algorithm as a subset evaluation measure.

4.2.1: Classic search algorithms

In wrapper-based feature selection, sequential algorithms are used as a search technique to execute the process of feature selection sequentially. In other words, features are selected one after another in a succession fashion. They can be further divided into two basic categories, namely sequential forward selection and sequential backward selection (Chandrashekar and Sahin, 2014). In this section, we will study some search algorithms which are based on sequential search technique.

Greedy search

In wrapper-based feature selection, the greedy selection algorithms are simple and straightforward search techniques. They iteratively make “nearsighted” decisions based on the objective function and hence, are good at finding the local optimum. But, they lack in providing global optimum solutions for large problems. Traditionally, they are divided into two categories: (1) Greedy forward selection (GFS) and (2) Greedy backward elimination (GBE).

GFS algorithm starts with an empty set and at each iteration, adds one feature to the subset until a local optimal solution is achieved. Whereas GBE algorithm starts from a complete set of features and iteratively removes one feature until a local optimal solution is achieved.

Best first search

Best first search is a search technique which explores the nodes in a graph with a heuristic evaluation function (Kohavi and John, 1997). In feature selection, the best first search uses this evaluation function to score every candidate feature and selects the one which provides best “score” first. There are two lists which maintain the track of visited (CLOSED) and unvisited (OPEN) nodes.

This algorithm can be further divided into two subcategories: (1) best forward selection (BFS) and (2) best backward selection (BBS) (Darabseh and Namin, 2015). BFS starts with an empty set and at each iteration, adds one feature to the subset and BBS starts from a complete set of features and iteratively removes one feature from the subset.

4.2.2: Metaheuristic algorithms

Genetic algorithms

Genetic algorithm (GA) is a probabilistic search technique based on the evolutionary idea of natural selection, which mimics the process of evolution. The algorithm starts with initializing the population of chromosomes represented by a binary string, but not necessary. In feature selection, 1 and 0s of a binary string represent feature selection and rejection. At each iteration, these chromosomes are evaluated by a fitness function. A fitness function is used to score the evolving generations of these chromosomes. Pairs of chromosomes are selected at random to reproduce based on the score assigned by the fitness function.

Operations such as mutation and crossover are performed on the selected pair of chromosomes to create the next generation, new chromosome, or an offspring. Mutation involves modifying a chromosome, whereas a crossover involves merging two chromosomes of the present generation to create an offspring. After n generations, this algorithm converges to the best set of chromosomes, which represent the optimal or suboptimal feature subset in a feature selection problem.

Particle swarm optimization

Particle swarm optimization (PSO) is a technique based on the paradigm of swarm intelligence. The intelligent behavior is inspired by the social behavior of animals like fish and bird (Liu et al., 2011). The algorithm starts with initializing a swarm of particles, where each particle represents a prospective solution to the optimization problem. At each iteration, these particles are evaluated by a fitness function.

Each particle has a position and a velocity, which describes its movement in the search space. In every iteration, position and velocity of every particle are updated based on its personal best and global best value. Each particle in the swarm has a memory of its personal best known as pbest value and the common global best gbest value that is obtained so far by any particle in the swarm. Each particle learns to accelerate toward its personal best and the global best position to reach the level of intelligence. Each particles’ new velocity for acceleration is calculated based on its current velocity, the distance from its personal best, and the distance from the common global best position. After n iterations, all particles converge to the best solution.

Ant colony optimization

Ant colony optimization (ACO) is a probabilistic search technique based on metaheuristic optimization, which mimics the process of ants foraging for food. It is used to search an optimal path in a graph between a source and a destination. The algorithm starts with all ants selecting random paths, where each path represents a prospective solution to the optimization problem. When ants start moving in search of food, they leave pheromone a chemical material on the path. An ant moving in a random direction might encounter the previously laid pheromone trail and decides to follow it based on the probability. More ants following the same path increase the pheromone deposited on it, thus reinforcing the probability of the path being followed. In a graph, the shortest path is learned via pheromone trails. But, pheromone trails gradually decrease by evaporation. At each iteration, an ant reaches a node and selects the next path based on transition probability until its overall path converges to an effective path (Kashef and Nezamabadi-Pour, 2015). In feature selection, all features are considered as nodes of a graph and ACO is used to find the optimal path, which provides the optimal feature subset.

Artificial bee colony

Artificial bee colony (ABC) is a stochastic search technique based on swarm intelligence, which mimics the process of honey bee swarms foraging for food. In this algorithm, each candidate solution represents the position of food source in the search space and the quality of nectar amount of the food source is used as a fitness evaluator (Schiezaro and Pedrini, 2013). It involves three group of bees: employed bees, onlookers, and scouts (Yavuz and Aydin, 2016). The number of employee bees is equivalent to food sources. Employed bees leave the hive in search of food source and collect the nectar amount of the other food sources in the vicinity of the discovered one. Once they return to their hive, they perform dance through which they provide information about the explored food source (location and quality) to the onlookers. Onlookers recruit a new food source from the information provided by the employed bees based on the selection probability of nectar amount and abandon the food source of low fitness value. Once an onlooker picks a new food source to explore, it becomes an employed bee. Once the employed bee’s food source is abandoned, it converts into a scout bee which performs a random search for a food source in the search space. This process is repeated until the optimal food source is found.

Grey wolf optimization

Grey wolf optimization (GWO) is a metaheuristic algorithm that is inspired by the behavior of grey wolves in leadership and hunting (Mirjalili et al., 2014). The algorithm classifies a population of possible solutions into four types of wolves α, β, δ, and ω. The four types are ordered based on the fittest solution, which means that α is considered the best solution and ω is the worst. The new generation is created by updating the wolves in each one of these four groups. This update is based on the first three best solutions obtained from α, β, and δ in the previous generation.

Artificial immune system algorithms

Several algorithms that mimic the artificial immune system (AIS) have been published in the literature (Watkins and Boggess, 2002). There are two main approaches considered in the proposed AIS algorithms. The first approach is negative selection, in which the algorithm’s main task is to define whether an instance belongs to the trained model or not. This approach has been widely used for anomaly detection (Jinquan et al., 2009). The second approach is positive selection, in which the algorithm’s main task is to identify each one of the training instances as a detector and assign a radius to it. The matching phase will then examine the test instance and check if it belongs to any one of the detectors’ zones.

Gravitational search optimization

Gravitational search optimization (GSO) uses a collection of masses as a representation of candidate solutions and uses Newtonian physics theorem to create the next generation based on the gravity law and the notion of mass interactions (Rashedi et al., 2009). Based on GSO, the relative distances and masses of the candidate solutions play the major role in attracting these candidate solutions to each other in the search space. Although authors claim that mass interaction provides an effective way of communication between the possible solutions to transfer information through the gravitational force, this concept was questioned in a later study by Gauci et al. (2012). Gauci et al. (2012) spotted a fundamental inconsistency in the mathematical formulation of the GSO and showed that the distance between possible solutions was not taken into account in creating the next generation. Hence, GSO cannot be considered to be based on gravity laws.

4.3: Embedded approach

Embedded methodology performs feature selection as a part of the training process. In comparison to wrapper approach, these methods provide normal or extended functionality to the learning process by lowering the computational cost (Guyon and Elisseeff, 2003; Chandrashekar and Sahin, 2014). During the learning phase of the model construction, they identify the features which will be the best fit for the model based on different independent measures. Following that, these use the learning algorithm to select the final feature subset, which provides the best performance. Decision trees such as C4.5 and random forest are some of the commonly used embedded methods in classification. In regression analysis, embedded methods like Ridge Regression, Elastic Net, and LASSO are used to perform feature weighting based on different regularization models to minimize the outlines and reduce the feature coefficients to be smaller or equal to zero (Tibshirani, 1994; Zou and Hastie, 2005; Yang et al., 2015).

5: Machine learning classifiers

This section explores possible classification solutions. We first explain the difference between using one-class classification (OCC) versus multiclass classification. Following that, we study the feasibility of using supervised versus unsupervised learning algorithms to identify hidden patterns in health datasets.

5.1: One-class vs. multiclass classification

OCC or unary classification is different from binary/multiclass classification, as it trains a model with data objects belonging to only one class, called the target class, in the presence of no or limited outlier distribution. The outliers in the data objects are identified by error measurements of the feature values compared to other target class objects. OCC provides a solution for classifying new data by defining a decision boundary around the target class, such that it only accepts the target class objects while minimizing the probability of accepting outlier objects.

In literature, many machine learning applications like outlier detection, novelty detection, and anomaly detection have originated with the similar concept (Ritter and Gallegos, 1997; Bishop, 1993; Pauwels and Ambekar, 2011). Several algorithms have been developed based on SVM and ELM to address OCC problems (Tax and Duin, 2004; Schölkopf et al., 1999; Gautam and Tiwari, 2016).

In binary/multiclass classification, a model is trained to classify data objects into two or more classes, where each object is assigned to only one-class label. The trained model then classifies new data by defining a decision boundary around each class, such that it only accepts the associated class objects, while minimizing the probability of accepting other class objects. In a classification problem, the associated (optimal) class membership of a data object is predicted based on the probabilistic distribution of its association with each class.

5.2: Supervised vs. unsupervised learning

Supervised machine learning trains a model with an input X which represents a set of data objects and a labeled output Y which represents a categorical or continuous target value to construct a hypothesis function which can be later used to predict the output value for new data objects. Supervised learning algorithms address both classification and regression problems in machine learning. It is different from unsupervised learning, as the predictive model learns from the input as well as its corresponding output value.

In unsupervised learning, a model constructs a hypothesis based on input data X with no labeled output. Clustering is a conventional unsupervised machine learning algorithm, which groups observations to identify hidden patterns in the input data.

6: Case studies

6.1: Experimental setup

6.2: Case study 1: Diabetes data analysis

Diabetes is a disease that occurs when blood glucose reaches a very high level. The food we consume gets converted into blood glucose, which is our primary source of energy. Insulin, which is a hormone produced by the pancreas, helps glucose from food get into our cells to be utilized as energy. Sometimes our bodies do not make enough insulin or use the insulin well. The glucose stays in our blood, and cannot reach our cells (Health Information, 2016). According to the Public Health Agency of Canada, one in seven Canadians are affected by disease, and in 2050 about one in three will be afflicted (Taylor, 2016). The condition is progressing very rapidly, resulting in overflow in hospital visits. Hospitals are experiencing capacity, and resources issues that are profoundly affecting performance. Aside from hospital visits, there would be an overflow in clinic visits to the general and specialist physicians. It would be beneficial to have a reliable method for predicting how many diabetic patients are expected to be hospitalized. In a more general term, it would be very helpful to predict the future trends in the healthcare industry. Data mining is a unique method of data analysis that allows analysts to uncover hidden patterns in datasets. There are several software programs that are currently being used to conduct data analytic tasks, such as Weka, IBM Watson Analytics, and Alteryx. In this section, we used Weka to conduct our analysis. Weka is a collection of machine learning algorithms for data mining tasks. It can either be applied directly to a dataset or called from a separate Java code. Weka features include machine learning, data mining, preprocessing, classification, regression, clustering, association rules, experiments, and more. Weka is written in Java, and was developed at the University of Waikato in New Zealand.

In our case study, we downloaded raw datasets from the Public Health Agency of Canada. Although the data are huge and contain information on many prominent chronic diseases, our main focus is diabetes. Our choice to focus on diabetes in this case study was based on how prominent the disease is in today’s society and the significant growth rate associated with it. Our analysis targets predicting the number of diabetic patients who are expected to visit a specific hospital 1 year ahead. The dataset contained information on visits to hospitals for 2000–11, throughout the country. The list of features includes the patients’ age group, number of incident cases per year, number of GP visits per year, number of prevalent cases per year, and mortality per year. To create a generalized model that would be accepted and useful to healthcare administrators, we created six different models for 2006–11, which would effectively model the changes and fluctuations that were established throughout the years. In the data processing step, we converted all data types into strictly numerical values using the NominaltoNumeric filter package that is supported by the Weka software. After analyzing that dataset correlation, we decided to use seven different algorithms to predict the number of diabetic patients who are expected to visit a specific hospital 1 year ahead. These algorithms are: linear regression, decision trees, random forest, Naïve Bayes, support vector machine (SVM), and sequential minimal optimization (SMO).

  •  Linear regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting (Gupta, 2018). Different regression models differ based on the kind of relationship between dependent and independent variables under consideration and the number of independent variables being used.
  •  Decision tree algorithm belongs to the family of supervised learning algorithms (Lior and Oded, 2005). Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too. The goal of using a decision tree is to create a training model that can be used to predict the class or value of the target variable by learning simple decision rules inferred from the training data.
  •  Random forest as its name implies, random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our models prediction.
  •  Naïve Bayes is a classification technique based on Bayes theorem with an assumption of independence among predictors (Ray, 2020). Naïve Bayes model is easy to build and particularly useful for very large datasets. Along with simplicity, Naïve Bayes is known to, sometimes, outperform other highly sophisticated classification methods.
  •  SVM is a supervised machine learning algorithm, which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of features we have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyperplane that best differentiates the two classes.
  •  SMO is a method of decomposition, by which an optimization problem of multiple variables is decomposed into a series of subproblems each optimizing an objective function of a small number of variables, typically only one, while all other variables are treated as constants that remain unchanged in the subproblem.

Table 1 shows the performance evaluation of different prediction models trained with all features using different machine learning algorithms available within the Weka software. SMO has the best performing results, in terms of accuracy as well as execution time. The great advantage of the SMO approach is that we do not need a quadratic problem solver to solve the problem and instead it can be solved analytically. As a consequence, it does not need to store a huge matrix, which can cause problems with machine memory. Moreover, SMO uses several heuristics to speed up the computation, which is evident in the execution time. The SMO algorithm is, therefore, used throughout this project.

Table 1

Overview of performance of different algorithms.
ClassifierAverage accuracy (%)Execution time
Linear regression85.912 min
SVM88.66 s
Naïve Bayes792 s
SVM90.33.2 s
SMO96.561.3 s
Random forest88.315 s
Decision trees70.12 s

Fig. 2 shows an example model produced by the SMO algorithm to predict the number of diabetic patients who are expected to visit the hospital in the year 2006 based on records of the previous 5 years.

Fig. 2
Fig. 2 An example model produced by the SMO algorithm to predict the number of diabetic patients who are expected to visit the hospital in 2006 based on records of the previous 5 years.

To evaluate the performance of a regression algorithm, we consider the following factors:

  •  Root mean square error (RMSE) is one of the standard ways to measure the error of a model in predicting quantitative data.

    RMSE=1ni=1n(yixi)2

    si9_e  (6)

    where n represents the number of features in that particular subset. (yixi) represent the differences between the experimental result and the actual value and then squared. The summation of all the values of the training set are expressed by sigma.
  •  Mean absolute error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction.

    MAE=1ni=1nyixi

    si10_e  (7)

    where n also represents the number of features in that particular subset that you are working with. yixi also represents the difference between the experimental result and the actual value, but instead here it is absolute and cannot be negative.

Both MAE and RMSE express average model prediction error in units of the variable of interest. Both metrics can range from 0 to and are indifferent to the direction of errors. They are negatively oriented scores, which mean the lower the values the better the results.

Although Weka software provided the RMSE and MAE values for each model, it is still crucial to validate the models statistically, which was done in Microsoft Excel. We calculated the difference between the actual value of a given year and the predicted value for the same year. We were able to calculate the absolute error, average accuracy, and residuals. Using these values, we can manually calculate RMSE and MAE. We were also able to create a graphical representation of the error and accuracy. Fig. 3 shows a graphical representation of the accuracy (top) and the error (bottom) for the 2011 model when calculated manually.

Fig. 3
Fig. 3 This is an example of a graphical representation of the accuracy (top) and the error (bottom) for the 2011 model when calculated manually.

In this case study, we created an individual model to predict the number of diabetic patients who are expected to visit the hospital in a specific year based on the previous 5-year records. Since we have records from 2006 to 2011, we were able to create six different models, one to predict each year. Our goal was to create a more generalized model that can be used for any year based on the previous 5-year records. Create such a generalized model that can maintain a low error rate was a challenging task. The normalized coefficients that were multiplied by the features were close in range, therefore, the method of choice was to average these coefficients for all the years and place the new averaged value as the new coefficient in the generalized model.

Once the generalized model is created, a cross-validation process is done to calculate the RMSE, MAE, error rates, and accuracy of this new model. We tested the new model and compared it with the individual models. Fig. 4 shows the accuracy (top) and the error (bottom) for the 2011 model when using the generalized model.

Fig. 4
Fig. 4 This is an example of a graphical representation of the accuracy (top) and the error (bottom) for the generalized model when tested to predict the number of patients in 2011.

These average accuracy using the generalized model throughout the separate years is 93% compared to an average accuracy of 96% using each individualized model. When working with these types of scenarios, we expect the accuracy to drop much further when creating a general model. However, in our case, the model still shows an acceptable performance and could be used in later years.

7: Case study 2: COVID-19 data analysis

On December 31, 2019, a cluster of cases of pneumonia was reported in Wuhan, China, and the cause has been confirmed as a new coronavirus that has not previously been identified in humans. This virus is now known as COVID-19 (Novel Coronavirus, 2020). There are now confirmed cases of COVID-19 that have been identified in many countries, including Canada. The current situation is evolving every day. Therefore, new information is becoming available daily and a clearer picture is being formed as this information is analyzed by researchers in provincial, national, and international health agencies. Confirmed cases, recovered cases, and the mortality rate are varying on a daily basis. Machine learning algorithms are used to identify possible patterns of the spread of the disease and to predict the number of patients who will be hospitalized. However, it is important to understand that COVID-19 was a new disease at the time of writing this chapter, and there is very limited datasets that may help analysts to predict the numbers accurately. One of the main sources is the open source dataset by the John Hopkins University, which is used through this section of the study. The data provide a snapshot of confirmed case, recovered cases, and mortality rate for each individual country (COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, 2020). It is important to highlight that this dataset is constantly being updated. Due to the limited size of the available data records, analysts were only able to predict accurate numbers on a weekly basis. In this study, we developed a regression model to predict the number of confirmed cases for each country.

The first step, like the previous case study involved the preprocessing of the dataset, is to prepare it for use in the Weka software. We started by using the data of all previous 60-day reported cases to create our model. However, we noticed that the records were not completely accurate during the first few days of the pandemic so we focused on using the data records when countries started to conduct regular tests on citizens to spot the coronavirus. Since the original dataset included every country world-wide whether it contained cases or not, we removed all countries that did not have cases, or data were not sufficient or inconclusive, as it would interfere with the accuracy of our prediction. Our main objective was to create a model that would represent an accurate prediction of confirmed cases for each day in the week ahead. In this case study, we used linear regression, SMO, multilayer perceptron, random forest, and finally locally weighted learning (LWL).

  •  Multilayer perceptron (MLP) is a class of feed-forward artificial neural network (ANN). MLP utilizes a supervised learning technique called back propagation for training (Nicholson, 2019). Its multiple layers and nonlinear activation distinguish MLP from a linear perceptron. It can distinguish data that are not linearly separable.
  •  LWL is a class of function approximation techniques, where a prediction is done by using an approximated local model around the current point of interest. The goal of function approximation and regression is to find the underlying relationship between input and output (Rahul, 2019). In a supervised learning problem training data, where each input is associated with one output, is used to create a model that predicts values which come close to the true function.

Table 2 presents the results of applying different algorithms on the dataset to predict the number of confirmed cases 1 week ahead. Linear regression achieved the best performance because it attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

Table 2

A performance summary of different algorithms applied on the COVID-19 dataset.
ClassifierAverage accuracy (%)Execution time (s)
Linear regression93.670.03
LWL67.121.85
Multilayer perceptron70.340.12
SMO84.590.08
Random forest46.4731

It is important to note that this prediction works well during the rising period of the pandemic but it does not predict the peak or when the confirmed cases curve will start to go down. This model was used solely as a short-term prediction model to predict only 1 week ahead data. Other models are used to predict the peak point and when the pandemic will be over.

To test our model, we compared the predicted results for the past week during the period of writing this chapter to the actual numbers that were reported by the WHO. This allowed us to compare our results to currently existing data, and confirm the accuracy of our forecast. Table 3 summarizes the results and reports the error rate.

Table 3

The accuracy and error rate for the dates May 25, 2020 to May 31, 2020.
DateAccuracy (%)Error rate (%)Execution time (s)
May 25, 202093.71156.28850.03
May 26, 202094.5365.4640.04
May 27, 202096.2783.7230.04
May 28, 202092.887.1240.03
May 29, 202095.7464.2540.03
May 30, 202095.74774.2530.02
May 31, 202093.5996.4010.03
Average94.6435.3570.0314

Table 3

Note: The average accuracy of all the individualized models is 94.6%.

Fig. 5 presents the accuracy and error rate when predicting the number of confirmed cases in each country for May 31, 2020. The accuracy and error rate percentages vary based on the model response to the country datasets. As shown in the figure, the model reacts efficiently to countries that have enough historical records that help the model predict future values.

Fig. 5
Fig. 5 The accuracy and error rate when predicting the number of confirmed cases in each country for May 31, 2020.

At the time of writing this chapter, COVID-19 was a serious health threat, and the situation was evolving daily. The risk varies between and within communities. Having the ability to develop predictive models allows data analysts to report accurate and efficient forecasts that will assist healthcare administrators in preparing accommodations, resources, and other variables essential in the fight against this pandemic. We can also apply these methods to aid in the reopening of the economy, airlines, and other sectors affected by the disease.

8: Summary and future directions

The value of machine learning in health care is its ability to learn from huge datasets beyond the scope of human capability. Machine learning can then be used to uncover hidden patterns in this dataset. In healthcare applications, machine learning can be used to provide clinical insights that aid physicians in planning and providing care, ultimately leading to better outcomes, lower costs of care, and increased patient satisfaction. Health care needs to move from thinking of machine learning as a futuristic concept to seeing it as a real-world tool that can be deployed today. If machine learning is to have a role in health care, then we must take an incremental approach. Healthcare administrators should start utilizing data analytics and use predictive models as essential elements in the decision-making process. This chapter discussed and explored opportunities to apply machine learning methods in the healthcare sector. We plan to extend our analysis to analyze other datasets in the healthcare sectors and study the correlation between the quality of healthcare services and the social and economical factors within the community.

There are two possible directions that we plan to investigate. First, exploring the correlation between the rate of disease spread during pandemic and the government spending on public education. Second, analyzing the possibility of creating travel bubbles between countries without enforcing quarantine period on travelers and how that affects the public safety in each country.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.222