Chapter 23: Machine learning in precision medicine

Dipankar Sengupta    PGJCCR, Queens University Belfast, Belfast, United Kingdom

Abstract

In recent years, the healthcare industry has made great advancements with the inclusion of poly-omics data, besides the data from traditional heuristic methods. Analyzing this growing multidimensional data is helping to gain knowledge and a better understanding of the disease, i.e., changes or variations in the macro (cellular) to micro (genes/proteins/metabolites) level, which can be correlated to the patient’s phenotype. In particular, this is encouraging for domains like cancer biology, where the underlying causes are complex and heterogeneous. Deciding an appropriate treatment is also challenging for such diseases, as a one-standard-treatment strategy does not fit all patients. Targeted therapies have been trying to address this, but still, there are patient strata who either do not respond or have disease recurrence in a few years.

Precision medicine is trying to address the aforesaid challenges by using data-driven approaches, amalgamating the expertise of bioinformatics, clinical science, and machine learning for making an informed decision(s) including appropriate knowledge and its clinical translation. An overview of how machine learning is used in precision medicine and its potential use in the detection, diagnosis, prognosis, risk assessment, therapy response, the discovery of new biomarkers and drug candidates is discussed in this chapter.

Keywords

Biomarker; Cancer; Diagnosis; Machine learning; Omics data; Precision medicine; Prognosis

1: Precision medicine

Sir William Osler described, “Variability is the law of life, and as no two faces are the same, so no two bodies are alike, and no two individuals react alike and behave alike under the abnormal conditions which we know as a disease” (Osler, 1903). This impeccably summarizes the challenge of understanding the overlaying mechanism for disease, its regulation, and treatment in medicine or associated sciences. Diseases like cancer involve complex underlying causes among the biological processes ranging from the molecular to the cellular level. These genetic variations may not always be one-to-one with the patient phenotype. Different genetic aberrations may result in similar or different phenotype, making it difficult to recognize the specific cause of the disease (Kim et al., 2016; Mardinoglu and Nielsen, 2012). Therefore, often a treatment used against a particular disease may not work on all the patients (Fig. 1).

Fig. 1
Fig. 1 Majority of the current treatment protocols follow one-treatment-for-all, i.e., the treatment is available for a standard patient based on the knowledge base of the particular disease. In this scenario, a few get the benefit, whereas there are sub-groups in the population who are not relieved or may have adverse side-effects.

In 1999, Francis Collins, one of the pioneers of the Human Genome Project, gave the foundation document for precision medicine (Collins, 1999). The following year, he briefed in a news conference, how the completion of the Human Genome Project is going to accelerate precision medicine leading to the complete transformation of therapeutic medicine (Wade, 2010). This gave the base concept for improvising patient care, by planning the treatment based on the maximum information gained for a patient along with the existing knowledge base of the disease. In the last 20 years or so, the study of genomes and other omics disciplines (proteomics, metabolomics, transcriptomics, etc.) has rapidly advanced the scenario and is being routinely adapted along with the clinical examinations for decision-making.

Precision Medicine can be best described as “an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person” (Ashley, 2015; Collins and Varmus, 2015; Hodson, 2016). Besides the patient-doctor relationship being the core, the new biomedical information might add substantial information beyond the observable signs and symptoms (König et al., 2017). Thus, Precision Medicine implies the novelty of harnessing this wide array of patient’s data, including clinical, poly-omics (genomic, transcriptomic, proteomic, metabolomic, epigenomic, etc.), and lifestyle information (Pinho, 2017) (Fig. 2). This gives the opportunity of identifying sub-populations who vary in their prognosis and treatment response, by understanding their difference in biology for that particular disease (Uddin et al., 2019).

Fig. 2
Fig. 2 Precision medicine aims to harness patient’s clinical data along with the poly-omics, environment, and lifestyle data, to develop treatment strategies for the sub-cohorts, which benefits or has no adverse side-effects. The aim is to deliver the right treatment to a patient by identifying the differences in underlying biology.

The major contribution for this has been the rapid advancement of genomic and computational technologies, and with the costs of genetic tests plunging, the new targeted therapies are making it increasingly possible to prevent or treat illnesses based on an individual patient’s characteristics (Love-Koh et al., 2018; Weil, 2018). It provides the opportunity to the clinicians, bioinformaticians, biomedicine, and associated researchers to develop new approaches for detection and diagnosis, prognosis, and other applications by analyzing a wide range of biomedical data—including molecular, genomic, cellular, clinical, behavioral, physiological, and environmental features. For example, in “precision oncology,” i.e., precision medicine for cancer, a few of the obstacles currently been addressed are: the unexplained drug resistance observed in certain patient cohorts, genomic heterogeneity of tumor, risk assessment and means for monitoring responses, tumor recurrence, and drug combinations to be used for treatment (Collins and Varmus, 2015).

Machine learning combines the strengths of computer science, mathematics, and statistics, providing the computational capabilities to learn from the data, and thus has the competency to provide a platform aiding precision medicine for learning from massive sets of clinical, poly-omics, and other datasets. Thus, in this chapter, we review and discuss the basic applications of machine learning on the heterogeneous datasets comprising poly-omics and clinical data, focusing on methods in use and its opportunities in precision medicine. In the following sections, firstly we discuss machine learning in brief; thereafter, the use of machine learning in precision medicine and along with discussing example applications showing its potential use in the detection and diagnosis, prognosis, therapy response, and in the discovery of new biomarkers and drug candidates; we conclude looking into the opportunities for addressing a few other challenges that can be addressed using precision medicine.

2: Machine learning

As modern computing evolved in the early part of the 20th century, the inception of “thinking machines” was aroused by the advent and evolution of the “electronic or digital computer” (Turing, 1950). The aim was to apply this computational capacity toward elucidating patterns and inferences from the dataset which is difficult to achieve by conventional statistical methods. In 1959, Arthur Samuel coined the term “Machine Learning,” explaining how from a given set of minimum parameters a computer can be programmed, so that it will learn to play a better game of checkers in comparison to the humans (Samuel, 1959). Thereafter, since the 1960s, machine learning has been at the forefront, enabling learning from data. Furthermore, in the 1990s, Tom Mitchell explained, the focus for machine learning to be, “developing a computer program, which is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” (Mitchell, 1997). A simple example of this would be a computer program that could predict what kind of television program a person likes, based on the parameters like age, gender, occupation, and geographical location. Or based on customer’s bank transaction history, a program could identify and/or predict fraudulent transactions. In both these scenarios, the computer program is considered to learn from the available data and its prediction performance improves as more data are made available to it.

Machine learning aims to provide learning algorithms (computer programs), which can be applied for analysis on the input data, for predicting the corresponding output values, identify patterns and trends within an acceptable performance range, and improvise based on experience (Fig. 3).

Fig. 3
Fig. 3 Developing a machine learning model involves varied steps: data preparation (data cleaning, selection of features), build models using different learning techniques, validate and select a model by analyzing its performance on test dataset.

Depending on how the algorithm makes this learning, machine learning can be broadly categorized into supervised, unsupervised, and semisupervised learning (Ayodele, 2010; Brownlee, 2016; Maetschke et al., 2014). Supervised learning can be defined as learning a function, which given a sample of labeled data (i.e., data with known outputs), approximates a function that maps inputs to outputs (example—classification or regression-based learning) (Caruana et al., 2008; Caruana and Niculescu-Mizil, 2006). Unsupervised learning can be defined as learning a function in data that does not have any labeled outputs; therefore, it goals to infer the natural structure present within a set of data points [example—clustering learning] (Celebi and Aydin, 2016; Ghahramani, 2004). Whereas, in semisupervised learning, the learning can be defined as labeling the unlabeled data points using knowledge learned from a few labeled data points [example—reinforcement learning] (Sinha, 2014; Xu et al., 2012).

3: Machine learning in precision medicine

Technological advancements aiding the development of new omics-based diagnostics and therapeutics have the potential of creating the unprecedented ability for detection, prevention, treatment planning, and monitoring of diseases. Cancer is one such heterogeneous disease having complex biological relationships, involving key molecules across the omics space, some of which are uncharacterized while some have unknown context-specific functions. In almost all cancer types, the available treatments are beneficial only for a patient sub-population, whereas for others, it may have adverse effects, or no improvements are observed. Considering the genetic variations at the granularity of an individual can thereof help in a better understanding of the disease and lead in the facilitation of better patient management. Thus, the need for this hour is the technological interventions that can guide for individualized prevention diagnostics and therapeutics leading to improved outcomes for all.

In the last 15–20 years, with the completion of the Human Genome Project and advancement in omics-based technologies, the data have been ever-increasing. Like, the cancer genome atlas program, a joint initiative by National Cancer Institute, USA and the National Human Genome Research Institute, USA, which commenced in 2006, and over a period has generated ~ 2.5 PB (PetaBytes) of genomic, epigenomic, transcriptomic, proteomic, and clinical data for different cancer types. For clinical advancements and research objectives, the data in this program have been made publicly available via the genomics data portal [https://portal.gdc.cancer.gov/] that handles genomic, epigenomic, transcriptomic, and clinical data (Akbani et al., 2014; Liu et al., 2018; Weinstein et al., 2013); and the cancer proteome atlas portal [https://www.tcpaportal.org/], which manages the protein expression datasets (Li et al., 2013). This poly-omics data is complex in nature, and along with clinical data provides an opportunity for exploring new insights via data-driven approaches (Filipp, 2019). Learning from this data can have multifold applications in diagnostics, prognostics, and as well as the discovery of new biomarkers or drug candidates. Spurred by the advancement in computer technologies, machine learning for precision medicine has therefore been a growing area of interest (Handelman et al., 2018; Holzinger, 2014). Supporting these objectives are initiatives like Project Data Sphere [https://data.projectdatasphere.org/projectdatasphere/html/home], which is promoting the development of new cancer therapies by providing an open access data sharing platform for sharing and analysis of patient-level data, giving access to more than 150 datasets (Green et al., 2015; Hede, 2013). It also offers an opportunity to collaborate in research programs using machine learning and big data analytics for oncology (Fig. 4).

Fig. 4
Fig. 4 Studying and analyzing the poly-omics data along with the clinical data and existing knowledge will help in better understanding of how a disease regulates in each patient. This would help in improvising the existing patient care and management facilities.

Machine learning can be used for analyzing the omics (genome, epigenome, transcriptome, proteome, metabolome, and microbiome) data together with the clinical data and prior knowledge, inferring relationships or finding patterns or deciphering causal associations, giving insight into pleiotropy, complex interactions, and context-specific behavior. The multidimensional poly-omics datasets along with the clinical and other relevant data can be trained using machine learning algorithms to find the relevant genotypic structures which could be subsequently mapped to the observed phenotype. This model may then be used for diagnostic (predict risk, stage of disease), prognostic (chances of the patient treated successfully), and other outcomes for individual patients based on their characteristics (Fig. 5).

Fig. 5
Fig. 5 Novel insights from the application of machine learning on poly-omics and clinical datasets may help bridging the genotype–phenotype relationships, like identification of novel biomarkers which can be used for either diagnostic, prognostic, or predictive purposes against a disease.

In comparison to the current process of treatment based on the symptoms-based classification (Fig. 1), this would provide clinicians the opportunity to tailor-made interventions for patients (Fig. 2).

Globally, various research groups and companies like Google and IBM are already exploring the opportunities of machine learning-centered advances for precision medicine. And, of late, there have been ground-breaking studies describing these approaches illustrating exemplary outcomes, for example, the prognosis of lung cancer patients based on 9879 histopathological and image-based features, distinguishing short-term from long-term survivors (Yu et al., 2016). In further sub-sections, we embellish applications discussing machine learning in precision medicine for the detection and diagnosis, prognosis, and the discovery of new biomarkers and drug candidates.

3.1: Detection and diagnosis of a disease

In precision medicine, data heterogeneity forms a major challenge in the development of early diagnostic applications. This can be addressed with the aid of machine learning, as it assists in extracting relevant knowledge from clinical and omics-based datasets, like disease-specific clinical-molecular signatures or population-specific group patterns. One such explicatory example is the recent development of a classifier for predicting skin lesions (skin cancer) using a single CNN (convolutional neural network), with its competency comparable to a dermatologist (Esteva et al., 2017). Detection and diagnosis of any disease shapes the ground for clinicians to plan and provide a targeted treatment ensuring minimal/no side-effects, along with consideration of the patient’s past clinical history and medications. In the past five years, numerous machine learning-based efforts have been made for a better understanding of diseases facilitating predictive diagnosis in cancer, cardiac arrhythmia, gastroenterology, ophthalmology, and other diseases. The genotype–phenotype associative analysis from this would help in translating the clinical management by early diagnosis and patient stratification, and thus, in the decision-making for selection among the available drug treatments, treatment alterations, and additionally in prognosis, providing personalized care to each patient.

In the past two decades, omics-based technologies have remarkably advanced, which is making a tremendous impact on a better understanding of complex diseases, like cancer. And with the growing data complexity, machine learning is helping to get insights, which help in the development of computational tools for early diagnosis for different cancer types. As in diseases like cancer, an early diagnosis ensures higher chances of survival for a patient. Leukemia, an hematological malignancy, has a high occurrence and prevalence rate, with its early diagnosis being a key challenge. To address this, diagnostic applications have been developed, which are based on CNN, SVM (support vector machines), hybrid hierarchical classifiers, and other pattern-based approaches (Salah et al., 2019). Similar studies have explored different supervised machine learning approaches being used for breast cancer diagnosis, primarily based on histopathological images and mammograms (Gardezi et al., 2019; Yassin et al., 2018). Also, there have been population-specific diagnostic predictors developed, considering the genetic and physiological differences amidst the human ethnicities. Like, models based on SVM, Least Squares Support Vector Machine (LS-SVM), Artificial Neural Network (ANN), and Random Forest (RF) to detect prostate cancer in Chinese populations using the prebiopsy data (Wang et al., 2018).

3.2: Prognosis of a disease

Disease prognosis can only be performed after a medical diagnosis had been made. And it primarily focuses on the prediction of susceptibility, i.e., risk assessment, recurrence, and survival of a patient against a particular disease. In terms of machine learning, susceptibility can be defined as the challenge for predicting the likelihood of developing a disease prior to its occurrence; while, recurrence is predicting the likelihood of regenerating the disease later to a treatment; and, survival is trying to predict an outcome postdiagnosis in terms of life expectancy, survivability, and disease progression. To address these challenges, prognostic prediction(s) need to consider factors besides the clinical diagnosis. For example, in cancer, a prognosis usually involves varied subsets of biomarkers along with the clinical factors, including the age and the general health of the patient, the location and type of cancer, as well as the grade and size of the tumor. Combining the genetic factors like somatic mutations in carcinoma genes, expression of specific tumor proteins, the environment of tumor cells with the clinical data, increases the robustness of cancer prognoses predictions (Cochran, 1997; Edge and Compton, 2010; Fielding et al., 1992; Gress et al., 2017). In 2016, the American Joint Committee on Cancer (AJCC) identified the need for personalized probabilistic predictions and therefore described the necessary characteristics and thus setting guidelines that shall help in developing prognostic applications (Kattan et al., 2016).

The majority of machine learning built prognostic applications use supervised learning, which is based on conditional probabilities. Cruz and Wishart have well investigated the different machine learning methods used in cancer prognosis (Cruz and Wishart, 2006). In general, artificial neural networks (Rumelhart et al., 1986), decision trees (Quinlan, 1986), genetic algorithms (Sastry et al., 2005), linear discriminant analysis (Duda et al., 2001), and nearest neighbor algorithms (Barber and Barber, 2012) have been the most frequently used algorithms for the aforesaid purpose. However, pertaining to a particular cancer type, the diagnostic accuracy of such models is important for its adoption under clinical settings. Like, a meta-analysis study performed to determine the diagnostic accuracy of machine learning algorithms for breast cancer risk prognosis showed an SVM-based model to be the best performing one (Nindrea et al., 2018).

In recent years, deep learning has demonstrated to be an effective method for illuminating novel findings from heterogeneous datasets. DeepSurv, a deep learning-based framework along with the state-of-the-art Cox survival method, designed for modeling interactions between a patient’s covariates and treatment effectiveness, is been used for treatment recommendations (Katzman et al., 2018). Using a similar combination of an algorithm, GDP (Group lass regularized Deep learning for cancer Prognosis) was developed, which uses clinical and poly-omics data for survival prediction (Xie et al., 2019). One of the recent studies demonstrated the significance of genetic factors in prognosis (Ming et al., 2019). It showed a remarkable difference in the predictive accuracy for risk prognosis in breast cancer among geographically distinguishing populations. For the US-based population, a combination of adaptive boosting with random forest gave the best performance, in comparison to the Swiss-based population, for whom it was adaptive boosting with Markov chain Monte Carlo generalized linear mixed model.

3.3: Discovery of biomarkers and drug candidates

Definition of biomarker has evolved with time, and it may be best defined as “characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacological responses to a therapeutic intervention” (Atkinson et al., 2001). Pertaining to healthcare, the development and use of biomarkers have primarily wedged varied aspects to diseases and corresponding patients. Therefore, the role of biomarkers in the development of precision medicine provides a strategic opportunity for technological developments to improve healthcare (Slikker, 2018). Machine learning in these multidimensional data settings can accelerate the early stages of the biomarker discovery process, inform the development of biomarker-driven therapeutic strategies, and give insights to enable better planning of patient treatment pathways. Translating candidate biomarkers into clinical applications is laborious with a high attrition rate and is very costly. A study by the pharmaceutical and diagnostic industry experts suggests an average of $100 M is being spent on developing and commercialization of new biomarker-based technology, with 55% of the amount being used in the initial phases associated with candidate identification, development, and validation (Graaf et al., 2018). Therefore, selecting the right candidate biomarker is of utmost importance, that has the best chance of successful adoption in the clinical setting.

In the last 5–10 years, many of the cancer clinical studies have tried exploring biomarkers discovery with the intra- and inter-tumor heterogeneity, as well as clonal and sub-clonal evolution in response to different treatments via tumor samples, to devise novel individualized and adaptive management strategies (Collins et al., 2017). Like, in lung cancer, attempts for identification of biomarkers by integrating different types of molecules into a biomarker panel along with the patient clinical data, to differentiate lung nodules into noncancer/cancer with poor survival probability and cancer with higher survival probability subtypes (Vargas and Harris, 2016). Besides, there also have been studies investigating tissue-specific biomarkers of human muscles with an age prediction task using ElasticNet, Support Vector Machines, k-Nearest Neighbors, Random Forests, and feed-forward neural networks (Mamoshina et al., 2018).

This process may also support drug discovery, as it could help to identify endotypes, i.e., patients having the same underlying cause for a disease. The process benefits by leveraging the available vast-collection of patient-level data, gaining the knowledge from advanced poly-omics analysis and thus, reducing the need for wet-lab experiments. The key feature is capturing the patient heterogeneity along with the underlying biology leading to patient stratification. This provides an opportunity for early identification of responder patients, helping in the designing of effective clinical trials. Overall, this would help to reduce the patient attrition rate, especially in the stages of phase 2 and 3 trials. Companies like Deep Genomics (a Canada based start-up) has already begun developing platforms using machine learning to lessen the amount of costly trial and error in drug discovery (Home | Deep Genomics, n.d.).

4: Future opportunities

Application of machine learning in precision medicine presents the distinctive challenges of having clinical interpretability of a model for its adoption under clinical settings. Computationally, this provides a unique research challenge of forming a balance in-between the performance and interpretability of the model as well as the results. For example, a deep learning model may effectively be able to identify a diseased from a healthy person, but clinically it will be imperative to know, based on what features or parameters, these predictions are been made and why.

Also, based on what is known about precision medicine, it is clear that for tailoring patient care, the domain needs to explore varied data dimensionalities. This depends on multiple factors including clinical (patient history, observed features, diagnostic measures, etc.), genetic (genome, proteome, metabolome, etc.), and, less clear nevertheless to be studied in-depth, wellness (behavior, emotional problems, stress, social support, etc.) along with the lifestyle and environmental factors (smoking, alcohol, drugs, malnutrition, etc.). These additional factors may assist in understanding, why some people even with a disease genotype do not develop a disease or why a sub-group shows a healthier prognosis. The role of a few factors like smoking (active and passive) are well-established with triggering, expression, and progression of cancer, cardiovascular, or other diseases. Nevertheless can emotions like “happiness or hope” be also playing a role, which can be associated with the genetic or clinical expression of a disease? Only time and further research in this domain may probably answer such questions but for sure it gives an opportunity for an integrative approach to explore and analyze these data dimensionalities (Fig. 6).

Fig. 6
Fig. 6 Precision medicine targets to bring together different data types of an individual with the aim of providing better patient care facilities. The treatment can be thus planned based on a patient’s detailed history which includes the clinical, genetic, as well as the information coming from his/her lifestyle, wellbeing, and environmental factors an individual is exposed to (pre- and postnatal).

5: Conclusions

By the advancements made in computational power, theoretical understanding, and an ever-increasing amount of data, the last decade has witnessed widespread applications of machine learning in every major field of human society, including medicine and healthcare. In the 21st century, heterogeneous and complex diseases like cancer, cardiovascular, and rare genetic disorders are some major and pressing medical problems. The holistic approach offered by precision medicine will likely become increasingly important in order to provide effective treatment and management strategies for such diseases. Machine learning is currently a vital tool in precision medicine and is almost certain to remain so in the future. It can effectively handle the massive poly-omics datasets and aid to solve a number of analytical problems within precision medicine and in numerous ways better than historically conventional statistical methods. As discussed in this chapter, the state-of-the-art algorithms and analytical techniques offered by machine learning have shown a wide range of applications in precision medicine. Machine learning thus presently offers a diverse and effective tool set for precision medicine research; a tool set that will grow and improve in the future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.47