Chapter 3: Data types and resources

Stephanie Kay Ashendena; Sumit Deswalb; Krishna C. Bulusuc; Aleksandra Bartosikd; Khader Shameere    a Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, United Kingdom
b Genome Engineering, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
c Bioinformatics and Data Science, Translational Medicine, Oncology R&D, AstraZeneca, Cambridge, United Kingdom
d Clinical Data and Insights, Biopharmaceuticals R&D, AstraZeneca, Warsaw, Poland
e AI and Analytics, Data Science and Artificial Intelligence, Biopharma R&D, AstraZeneca, Gaithersburg, MD, United States

Abstract

Recent innovation in the field of machine learning has been enabled by the confluence of three advances: rapid expansion of affordable computing power in the form of cloud computing environments, the accelerating pace of infrastructure associated with large-scale data collection and rapid methodological advancements, particularly neural network architecture improvements. Development and adoption of these advances have lagged in the health care domain largely due to restrictions around public use of data and siloed nature of these datasets with respect to providers, payers and clinical trial sponsors.

Keywords

Data; Omics; FAIR; Big data; SMILES; InChI

Notes on data

Recent innovation in the field of machine learning has been enabled by the confluence of three advances: rapid expansion of affordable computing power in the form of cloud computing environments, the accelerating pace of infrastructure associated with large-scale data collection and rapid methodological advancements, particularly neural network architecture improvements. Development and adoption of these advances have lagged in the health care domain largely due to restrictions around public use of data and siloed nature of these datasets with respect to providers, payers, and clinical trial sponsors.

There are many different types of data that are relevant to drug discovery and development, each with its own uses, advantages and disadvantages. The type of data needed for a task will rely on an understanding and clarity of the task at hand. With an increasing amount of data being made available, new challenges continue to arise to be able to integrate (with a purpose), use and compare these data. Comparing data is important to capture a more complete picture of a disease, of which is often complex in nature.1 One approach is to ensure that the data is FAIR, meaning that it is Findable, Accessible, Interoperable, and Reusable. A generic workflow for data FAIRification has been previously published2 and discusses seven-core steps. These steps are to identify the objective, analyze the data and the metadata, define a semantic model for the data (and metadata), make the data (and the metadata) linkable, host the data somewhere, and then assess the data.2 Other key considerations include assigning licenses and combining with other FAIR data.3

Data integration has been discussed by Zitnik and co-authors.1 There are different integration stages such as early, Intermediate, and late integration.1 These stages involve the transformation of the datasets into a single representation. This representation can then be used as input in a machine learning algorithm. In intermediate integration, many datasets are analyzed and representations that are shared between them are learnt. In late stage integration, each dataset has its own model built and these models are combined by building a model on the predictions of the previous models.1

Zitnik and co-authors1 also discuss the fact that there are many challenges in integrating data such as the sparseness of biomedical data and its complexity. The authors note that the data are often biased and/or incomplete. For example, databases containing manually created data from papers may be limited to a certain number of journals. ChEMBL4, 5 routinely extracts data from seven journals but does also include other journals not included in the seven.6 The authors note that machine learning can be used for data integration.

However, there are other concerns with integrating data beyond the technical difficulties. Sharing and privacy concerns especially in relation to clinical data are a key consideration in the pharmaceutical industry. However, sharing clinical trial data is important in improving scientific innovation.7 To this end, attempts have been made to improve clinical data sharing policies and practises8 but only a small amount of companies met such measures with many failing to share data by a specified deadline and failed to report all data requests.8 Other approaches include MELLODDY9 which aims to bring together information to accelerate drug discovery by allowing pharmaceutical companies to collaborate. MELLODDY notes that huge amounts of data are generated during the drug discovery process and their hypothesis is that working across data types and partners will improve predictive power and understanding of models in the drug discovery process.9 The large collection of small molecules with known activities can be used to enhance predictive machine learning models without exposure of proprietary information.9

Such large volumes of data are known as big data. In the medicinal field this may include omics data, clinical trials data and data collected in electronic health records. The data can be a combination of varying levels of structuredness and can be fantastic resources for information mining and machine learning projects.10 Ishwarappa and Anuradha11 discussed the five Vs of big data and explain that they correspond to:

  •  Volume (the amount of data)
  •  Velocity (how rapidly the data is generated and processed)
  •  Value (what the data can bring)
  •  Veracity (the quality of the data)
  •  Variety (the structure and types of data)

Different types of data will be used for different types of analysis and will enable for a variety of questions to be answered. Later, we discuss some of the key types of data that may be used.

Omics data

Omics studies aim to understand various organisms at the molecular level by studying specific components such as genes or proteins in both experimental and computational ways.12 Such omics include genomics (study of genes), proteomics (study of proteins), metabolomics (study of metabolites), transcriptomics (concerned with mRNA) as well as more niche omics such as lipidomics and glycomics (Fig. 1). The rise of omics data gives thanks to technical advances in areas such as sequencing, microarray and mass spectrometry,13 and omics data can be used throughout the drug discovery pipeline. For example, for identifying and validating novel drug targets13 and understanding and interpreting genetic variations in patients for personalized medicine.13, 14

Fig. 1
Fig. 1 Branches of the omics studies.

Bioinformatic techniques are used throughout the omics studies to analyze the resultant data, make sense of it and derive hypotheses and conclusions. There are a wide variety of omics data types available as well as databases that contain useful information that can be exploited throughout the drug discovery process. Later, we summarize the different omics methods and include some of the key databases.

Genomics

Genomics is concerned with understanding the genes that are within a genome (it is estimated that there are 20,000–25,000 genes in the human genome), it is also concerned with how those genes interact with each other and other environmental factors.15 Specifically, genomics is concerned with interactions between loci and alleles as well as considering other key interactions such as epistasis (effect of gene-gene interactions), pleiotrophy (effect of a gene on traits), and heterosis.

Libbrecht and Noble published an article on the applications of machine learning in genetics and genomics.16 The authors discuss the different uses of supervised, semisupervised, unsupervised, generative, and discriminative approaches to modelling as well as the uses of machine learning using genetic data.16 The authors explain that machine learning algorithms can use a wide variety of genomic data, as well as being able to learn to identifier particular elements and patterns in a genetic sequence.16 Furthermore, it can be used to annotate genes in terms of their functions and understand the mechanisms behind gene expression.16

Transcriptomics

The transcriptome is the set of RNA transcripts that the genome produces in certain situations.17 Transcriptomics signals aid in understanding drug target adverse effects.18

Methods such as RNA-Seq are used to profile the transcriptome. As a method it can detect transcripts from organisms where their genomic sequence is not currently profiled and has low background signal.19 It can be used to understand differential gene expression,20 RNA-Seq is supported with next generation sequencing of which allows for large numbers of read outs.20

Transcriptomic data have been used in machine learning algorithms in cases such as machine learning diagnostic pipeline for endometriosis where supervised learning approaches were used on RNA-seq as well as enrichment-based DNA methylation datasets.21 Another use has been the development of GERAS (Genetic Reference for Age of Single-cell), which is based on their transcriptomes, the authors Singh and co-authors explain that it can assess individual cells to chronological stages which can help in understanding premature aging.22 It has also been used alongside machine learning algorithms to aid in diagnostics and disease classification of growth hormone deficiency (random forest in this case).

Metabolomics and lipomics

Metabolomics and lipidomics are concerned with the metabolome and the lipidome, respectively. Metabolomics allows for the understanding of the metabolic status and biochemical events observed in a biological, or cellular, system.13 Approaches in metabolomics includes the identification and quantification of known metabolites, profiling, or quantification of larger lists of metabolites (either identified or unknown compounds) or a method known as metabolic fingerprinting, of which is used to compare samples to a sample population to observe differences.23 Metabolomics has been combined with machine learning to identify weight gain markers (again Random Forest algorithms were used).24 Sen and co-authors have shown that deep learning has been applied to metabolomics in various areas such as biomarker discovery and metabolite identification (amongst others).25

Lipids are grouped into eight different categories including fatty acyls, glycerolipids, glycerophosolipids, sphingolipds, saccharolipds, polyketides, sterol, and prenol lipids.26 They are important in cellular functions and are complex in nature, change under different conditions such as physiological, pathological and environmental.27 Lipidomics has be used to show tissue-specific fingerprints in rat,26 shown potential in risk prediction and therapeutic studies28 and can be used through the drug discovery process.27 Fan and co-authors used machine learning with lipidomics by developing SERRF (Systematic Error Removal using Random Forest) which aids in the normalization of large-scale untargeted lipidomics.29

Proteomics

Proteomics is concerned with the study of proteins. Proteomes can refer to the proteins at any level, for example, on the species level, such as all the proteins in the human species, or within a system or organ. In addition, one of the major difficulties with proteomics is its nature to change between cells and across time.30 Questions may include understanding the protein expression level in the cell or identifying the proteins being modulated by a drug. Key areas of proteomic study involve, protein identification, protein structure, analysis of posttranslational modifications.

Typically a proteomic experiment is broken down into three key steps; the proteomics separation from its source such as a tissue. The acquisition of the protein structural information and finally, database utilization.31 Experimental procedures to separate a protein from its source involve electrophoresis where the proteins appear as lines on a gel, separated by their molecular weight.31 They are visualized by staining the gel and then preceded by acquiring an image of the gel. The proteins can be removed from the gel to be digested and put through a mass spectrometer. Sequencing is often completed by mass spectrometry methods, of which involves ionization of the sample, analysis of the mass, peptide fragmentation, and detection ultimately leads to database utilization. A typical global proteomics experiment involves profiling of several compounds to determine changes in particular proteins. By analyzing the observed abundance of the proteins across different treatment channels it is possible to observe treatment effects.

Swan and co-authors published applications of machine learning using proteomic data. The authors note that MS-derived proteomic data can be used in machine learning either directly using the mass spectral peaks or the identified proteins and can be used to identify biomarkers of disease as well as classifying samples.32 Gessulat and co-authors developed Prosit, a deep neural network that predicts the chromatographic retention time as well as the fragment ion intensity of peptides.33

Chemical compounds

Compounds are often represented in a computer readable form. The ChemmineR34 package for R35 or RDKit36 package in KNIME37 or Python (https://www.python.org/) provides example compounds for analysis.

SDF format

SDF formats (structure data files formats) were developed by Molecular Design Limited (MDL) and are used to contain chemical information such as structure. The first section contains general information about the compound, including its name, its source and any relevant comments. The counts line has 12 fields that are of fixed length. The first two give the number of atoms and bonds described in the compound. Often Hydrogens are left implicit and can be included based on valence information.38 The second block is known as the atom block (atom information encoded) and the third is known as the bond block where bond information is encoded. In the atom block, each line corresponds to each individual atom. The first three fields of each line correspond to the atoms position with its x-y-z coordinates.38 Typically the atom symbol will be represented and the rest of the line relates to specific information such as charge information.38 The bond blocks also have one line per individual block, and the first two fields index the atoms and the third field indicated the type of bond. The fourth refers to the stereoscopy.38

InChI and InChI Key format

InChI39 is a nonproprietary line notation or 1D structural representation method of which aims to be canonical identifier for structures (and thus is suitable for cross database comparisons).40 Owing to uniqueness of InChI, it has been used to derive canonical SMILES (described later) to create something called InChIfied SMILES.41 InChI key is a hashed and condensed version of the full InChI string.

It was developed by the International Union of Pure and Applied Chemistry (IUPAC) along with the National Institute of Standards and Technology (NIST). It is continually updated by the InChI Trust. InChI captures a wide variety of compound information, not limited to its stereochemistry, charge and bond connectivity information.

InChI keys were developed to allow for searching of compounds as the full InChI is too long for this. It contains 27 characters, the first 14 corresponding to the connectivity information. Separated by a hyphen is the next eight characters that include other chemical information of the structure. The following characters (each separated by a hyphen) give information about the type of InChI, the version of it and finally, the protonation information of the compound.

It has an almost zero chance of two separate molecules having the same key. It was estimated that if 75 databases each had 1 billion structures, there would be one instance of two molecules having the same InChI key. Despite this, an example of a “collision” was identified with two compounds with different formulae and no stereochemistry.4244 This estimated rarity of collisions was experimentally tested and suggested that if uniqueness was desired it would probably need a longer hash.45

SMILES and SMARTS format

The simplified molecular-input line entry system also known as (SMILES) is one of the most commonly used.4648 SMILES are based on molecular graph theory where the nodes of a graph are the atoms and the edges are the bonds.46, 47 Generic SMILES do not give details on the chirality or the isotopic nature of the structure (of which are known as isomeric SMILES).49

One problem with SMILES is that a single structure can be represented in multiple different SMILES strings and therefore, it is recommended to use canonicalized structures to prevent one compound being identified as multiple due to the different representations used. Daylight give an example of the ways that the SMILES string CCO can be written, including OCC, [Ch3][CH2][OH], Cglyph_sbndCglyph_sbndO, and C(O)C.49

Daylight gives an in depth explanation of the rules for generating and understanding SMILES strings49 and the common rules are summarized here. SMILES follow encoding rules, namely, the use of atomic symbols for atoms with aliphatic carbons being represented with a capital C and aromatic carbons being written with a lower case c. Brackets are used to describe abnormal valences and must include any attached hydrogens, as well as a number of + or – to indicate valance count. Absence of these will result in it being assumed there are zero hydrogens or charge. To indicate isotopic rules, the atomic symbol is preceded by its atomic mass such as [12C] or [13C]. On a side note, hydrogens are often omitted when writing SMILES strings and can be highlighted by either implicit nature (normal assumptions), explicit nature by either count (within brackets) or as explicit atoms themselves [H]. Bonds are represented by –, =, #, or : to depict single, double, triple, or aromatic bonds, respectively. Alternatively, atoms may be placed next to each other with the assumption that either a single or an aromatic bond separates them. To include direction, and / are used. Branching is dealt with within parentheses (of which can be nested) and cyclic structures contain a digit to indicate the breaking of a bond in the ring such as C1CCCCC1. Any disconnected structures are separated by a period. Dealing with tetrahedral centers can be represented by @ (neighbors are anticlockwise) or @@ (neighbors are clockwise) after the chiral atom. Many specific natures of compounds, such as tautomerization, chirality and shape, need to be explicitly specified in SMILES notation.

Extending on from SMILES is the SMARTS49 notation which is designed to aid with substructure searching. SMARTS, extend atoms and bonds by including special symbols to allow for generalized identification, for example, the use of * to denote the identification of any atom or ~ to denote any bond. Many of these rules follow the rules of logical rule matching in coding languages such as the use of an explanation mark to denote NOT this, as an example, [!C] tells us to find not aliphatic carbons.

Daylight describes the difference between SMARTS and SMILES as SMARTS describing patterns and SMILES describing molecules. In addition, SMILES are valid SMARTS.

Fingerprint format

A molecular descriptor’s role is to provide one and capture similarity and differences between compounds in a chosen dataset. There are multiple kinds of molecular descriptors that range in dimensionality (0D, 1D, 2D, 3D, and 4D). A molecular fingerprint is an example of a 1D-descriptor. It is a binary string with a list of substructures or other predefined patterns.50 They are defined before a model is trained to avoid overfitting on sparse or small datasets. If a specified pattern is found in a molecule, the corresponding bit in the binary string is set to “1,” otherwise it is set to “0.”51

Example of fingerprints are ECFP4 (extended connectivity for high dimensional data, up to four bonds), FCFP4 (functional class-based, extended connectivity), MACCS (166 predefined MDL keys), MHFP6 (for circular structures) Bayes affinity fingerprints (bioactivity and similarity searching), PubChemFP (for existence of certain substructures), KRFP (from the 5-HT 5A dataset to classify between active or inactive compounds). Sometimes it is better to create custom fingerprints than rely on predefined ones.52

Essentially the features of the molecules (such as the presence of a particular atom) are extracted, hashed, and then the bits are set.53 There are a wide host of available fingerprints that can be used as discussed in Table 1.

Table 1

Table of example of different types of molecular fingerprints.
NameNotes
MACCS54Substructure keys
Morgan55, 56Circular fingerprints
Extended-Connectivity Fingerprints (ECFP)55ECFP# where # is a number denoting the circle diameter. Typically, between 0 and 6
Daylight57Path fingerprints that encode the substructure
Signature58, 59Topological descriptor
MHFP660, 61For circular structures
Bayes affinity fingerprints62Bioactivity and similarity searching
PubChemFP63For existence of certain substructures
KRFP (Klekota Roth fingerprint)64Substructure keys

Performance of a machine learning model and prediction accuracy depends on the quality of data and descriptors and fingerprints chosen. For instance, fingerprint-based descriptors, for example, ECFP or MACCS, are recommended for active substances with functional groups located in meta or para positions.65 For genotoxicity prediction, Support Vector Machines (SVM) models perform best with PubChemFPs. However, the authors recommend combining Random Forest (RF) and MACCS fingerprints.66

Extended-connectivity fingerprints (ECFPs) were designed for structure-activity modeling of which are topological and circular.55 They are related to Morgan fingerprints, but differ in their algorithm. The ECFP algorithm is well documented55 and summarized here. Each atom is assigned an identifier of which is updated to capture neighboring atom information. Finally, any duplicate identifiers are removed (so the same feature is only represented once). Rather than a bit vector, ECFC derive a count of features.67

In comparison, to the ECFP algorithm of which has a predetermined set of iterations, Morgan fingerprints and their algorithm56 continue to have iterative generations until uniqueness is achieved. This process is described by Rodgers and Hahn55 in their extended-connectivity fingerprints paper where they explain that for Morgan fingerprints, their atom identifiers are not dependent on the atoms original numbering and uses identifiers from previous iterations after encoding invariant atom information into an initial identifier. Essentially the Morgan algorithm iterates through each atom and captures information about all possible paths through the atom, given a predetermined radius size.68 Morgan fingerprints were designed to address molecular isomorphism55 and are often used for comparing molecular similarity. These are hashed into a bit vector length (also predetermined). The iterative process involves each atom identifier in a compound and updating the information about it. For example, at iteration 0, only information about the atom is captured (as well as related bonds) whereas as the iterations increase, so does the information about the atom’s neighbors, and so on.

Two other popular fingerprints are MACCS keys and Daylight fingerprints. The Molecular ACCess System (MACCS) keys is a predefined set of 116 substructures.54 A problem with the MACCS keys is that there is no publication that defines what each of the 116 substructures are. Generally, when citing, individuals refer to a paper discussing the re-optimization of MDL keys.54, 69 Daylight fingerprints are a form of path fingerprints which enumerate across the paths of a graph and translate them into a bit vector.70 Signature fingerprints are not binary and are based on extended valence sequence.58 They are topological descriptors that also describe the connectivity of the atoms within a compound.71

Other descriptors

A molecular descriptor can be derived from experimental data or calculated theoretically. Examples of nonfingerprint molecular descriptors include reactivity, shape, binding properties, atomic charges, molecular orbital energies, frontier orbital densities, molar refractivity, polarization, charge transfer, dipole moment, electrophilicity,72 molecular and quantum dynamics.73

Molecular descriptors are generated with the use of tools, for example, PaDEL-Descriptor,74 OpenBabel,75 RDKit,36, 53 CDKit,76 and E-Dragon.77

Structural 2D descriptors perform well in models handling binary information such as classification and class probability estimation models78 and in association rules learning.7981 There exists no universal descriptor that works best with every prediction model. However, various descriptor types can be combined as input data for a model to achieve higher performance.

There are various commercial and open-source software, databases, and servers that use molecular descriptors to predict toxic endpoints: OECD QSAR Toolbox,82 Derek Nexus,83 FAF-Drugs4,84 eTOXsys,85 TOXAlerts,86 Schrödinger’s CombiGlide8789 Predictor, Leadscope Hazard Expert,90 VEGA,91 METEOR.83 ChemBench,92 ChemSAR,93 ToxTree,94 Lazar,95 admetSAR,96 Discovery Studio97 and Pipeline Pilot98 are ML-based tools. For more detailed information, please refer to review on computational methods in HTC by Hevener, 2018.99

Furthermore, descriptors can also be calculated for protein structures. Local descriptors have been shown to aid in the characterization of amino acid neighborhoods.100 The tool ProtDCal calculates numerical sequence and structure based descriptors of proteins.101 Another publication had the authors develop a sequence descriptor (in matrix form) alongside a deep neural network that could be used for predicting protein-protein interactions.102

Similarity measures

It is often a requested task to compare the similarity of two compounds. Different similarity metrics are summarized in Table 2. Similarity can be rephrased as comparing the distance between the compounds to evaluate how different two compounds are. For fingerprint-based similarity calculations, Tanimoto index is a popular method.110 A study compared several of these metrics comparing molecular fingerprints. They identified that the Tanimoto index, Dice index, Cosine coefficient and Soergel distance to be best and recommended that Euclidean and Manhattan distances not be used on their own.110

Table 2

Table of different similarity metrics.
NameEquationEquation information
Tanimoto/Jaccard103, 104Ta,b=NcNa+NbNcsi1_eN = number of attributes in objects a and b
C = intersection set
Tversky105similarityA,B=ABaA+bB+ABsi2_eα = weighs the contribution of the first reference molecule
The similarity measure is asymmetric106
Dice107similarityA,B=2ABA+B+ABsi3_eAB is bits present in both A and B
Manhattan106similarityA,B=A+BA+B+AB+!A!Bsi4_eThe more similar the fingerprint the lower the similarity score (acting more like a distance measure)106
Euclidean distance106DistA,B=AB+!A!BA+B+AB+!A!Bsi5_e!A!B represents the bits that are absent in both A and B
Cosine108, 109similarityx,y=cosθ=x.yxysi6_ex = compound x
y = compound y

The reason for comparing the similarity of compounds is that, in combinatorial library design, chemists may reject compounds that have a Tanimoto coefficient ≥  0.85 similar to another compound already chosen from the library.111 This is for the purpose of ensuring structural diversity within the library. A study showed that by using Daylight fingerprints, and Tanimoto similarity, found that there was only a 30% chance that two compounds that were highly similar were both active, likely due to differences in target interactions.111

Similarity=11+distance

si7_e  (1)

Eq. (1) is used for calculating the similarity of two compounds.

QSAR with regards to safety

QSAR studies involve pattern discovery, predictive analysis, association analysis, regression, and classification models that integrate information from various biological, physical, and chemical predictors. It relies on the assumption that chemical molecules sharing similar properties possess similar safety profile.81 QSAR model establishes a relationship between a set of predictors and biological activity (e.g., binding affinity or toxicity). Biological properties correlate with the size and shape of a molecule, presence of specific bonds or chemical groups, lipophilicity, and electronic properties. Biological activity can be quantified, for example, as minimal concentration of a drug required to cause the response. According to the Organization for Economic Co-operation and Development (OECD) guidelines QSAR model should have (a) a defined endpoint; (b) an unambiguous algorithm; (c) a defined domain of applicability; (d) appropriate measures for goodness-of-fit, robustness, and predictivity; and (e) mechanistic interpretation.112

The largest advantage of QSAR modeling is feature interpretability, high predictability, and diversity of available molecular descriptors. QSAR enables calculation of biological activity and reduces significantly the number of molecules that need to be synthesized and tested in vitro. QSAR method has some limitations, though. To develop a model of high prediction power and high statistical significance, large datasets are necessary as well as a preselection of predictors. Additionally, it is not always possible to deduce human dose, duration of treatment or exposure without the use of animal data. Furthermore, not all structurally similar molecules exert a similar influence in vivo. Thus an experienced human expert should define the applicability domain and scope of interpretation of QSAR prediction.

QSAR approach dates to 1962, when Hansch assumed independence of features that influenced bioactivity and developed a linear regression model. In the Hansch model (Eq. 2), authors estimated logarithm of the reciprocal of the concentration (C) using the octanol/water partition coefficient (π) and the Hammett constant (σ):

log1C=4.08π2.14π2+2.78σ+3.36

si8_e  (2)

Eq. (2) represents Hansch Model.

A positive coefficient of a descriptor suggests a positive correlation between specific toxicity endpoint and that descriptor; negative coefficient is linked to negative correlation.113 Two years later, in 1964, the Free-Wilson method basing on regression analysis was developed, and the chemical structure has been used as a single variable.114 In the 1980s and 1990s, linear regression has been applied to develop toxicity prediction models with both single and multiple molecular properties as variables.

Approaches such as linear regression analysis and multivariate analysis perform well for single molecular properties prediction. However, currently, it is possible to generate many more types of molecular descriptors (1D to 4D) than it was 40 years ago, which leads to more high-dimensional datasets.81 Hence, nowadays advanced nonlinear techniques have become more popular in toxicity prediction.

In certain cases, large numbers of input features, that is, dimensions (e.g., molecular descriptors) may result in decreased machine learning model performance. This phenomenon is referred as a curse of dimensionality because sample density decreases. The data set becomes sparse. As a result, the model may overfit, which means it learns too much about each data point. To assure that model’s level of generalization is just right, preselection of preferably most relevant features may be indispensable. This process is called dimensionality reduction of n-dimensional feature space. Most common dimensionality reduction methods include: Least Absolute Shrinkage and Selection Operator (LASSO), Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Linear Discriminant Analysis (LDA),115 Multidimensional Scaling (MDS), Recursive Feature Elimination (RFE), Distributed Stochastic Neighbor Embedding (t-SNE), and Sequential floating forward selection (SFFS).116

Data resources

There are many resources available for data analytics, both commercial and open. Many of these resources can be used for multiple tasks. Below contains many of the key resources used in drug discovery, however, it is worth noting that as more data is created, and gaps are identified in available resources, new resources will be developed.

Toxicity related databases

As a result of the application of high throughput screening (HTS) and development of novel chemical and biological research techniques in the 21st century, a number of publicly available repositories is rapidly growing. This enables integration of siloed information and prediction of less evident side effects resulting from synergistic effects, and complex drug-drug interactions can be discovered. In this section, we present an overview of existing data sources related to toxicogenomics, organ toxicity, binding affinity, biochemical pathways, bioactivity, molecular interactions, gene-disease linkage, histopathology, oxidative stress, protein-protein interactions, metabolomics, transcriptomics, proteomics, and epigenomics (Table 3).

Table 3

An overview of toxicity, chemical and multiomics databases useful in the computational evaluation of safety.
Database nameData typeDescriptionSource publication DOI
CEBS (Chemical Effects in Biological Systems)Adverse eventsBiology-focused database of chemical effects that focuses on systems toxicology 117
IntSideAdverse eventsChemical and biological side effects database 118
MetaADEDBAdverse eventsIntegrates CTD, OFFSIDES, and SIDER, focusing on ADE-drug occurrences 119
OFFSIDESAdverse eventsADRs reported during clinical trials before drug approval 119
SIDERAdverse eventsContains information on marketed medicines and their recorded adverse drug reactions 120
BioGRIDMolecular interactionsGenetic and protein interactions 121
BiomodelsMolecular interactionsRate-related interactions 122
BioplexMolecular interactionsImmunopurification and mass spectrometry-based protein interaction database 123
HAPPI-2Molecular interactionsProtein-protein interactions with a confidence score 124
HPRDMolecular interactionsHistoric, no longer updated database of manually curated 125
IntActMolecular interactionsOpen-source database system and analysis tools for molecular protein interaction data 126
InWeb_IMMolecular interactionsProtein-protein interaction datasets with orthological predictions 127
menthaMolecular interactionsA taxonomy browser of interactions from publications and databases 128
NRF2OmeMolecular interactionsManually curated human oxidative stress and NRF2 response specific database 129
OmniPathMolecular interactionsManually curated human signaling database 130
SignaLink2Molecular interactionsManually curated signaling database with regulations and predicted interactions 131
SignorMolecular interactionsManually curated pathway interactions with directions and signs 132
STRINGMolecular interactionsCurated databases using text mining interactions in different species 133
KEGG (Kyoto Encyclopedia of Genes and Genomes)PathwaysThe database focuses on high-level functions of biological systems from molecular-level information 134
MsigDB (Molecular Signature Database)PathwaysA collection of annotated gene sets for use with GSEA (Gene Set Enrichment Analysis) software 135
Pathway CommonsPathwaysFree online database of pathways, bundled with open-source data analysis tools 136, 137
PharmGKB (The Pharmacogenomics Knowledgebase)PathwaysManually curated collection of PGx information from the primary literature 138, 139
Qiagen IPA (Ingenuity Pathway Analysis)PathwaysA commercial pathway analysis tool, capable of complex analysis and prediction of downstream effects 140
ReactomePathwaysFree online database of pathways, mostly focused on human biology 141, 142
The Gene Ontology ResourcePathwaysDatabase of pathways from molecular to organism-level for multiple species, focusing on the function of the genes and gene products. Datapoints have annotations on multiple levels of specificity 143
WikiPathwaysPathwaysCommunity-curated collection of pathways with links to other sources and pathway databases 144
admetSARToxicity-molecule associationsAn online tool for the prediction of chemical ADMET properties 96, 145
BindingDBToxicity-molecule associationsA database of measured binding affinities, interactions of protein drug targets and small 146
CDT (Comparative Toxicogenomics Database)Toxicity-molecule associationsLiterature-based, manually curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures 147
ChEMBLdbToxicity-molecule associationsAn EMBL manually curated chemical database with bioactivity data 5
ChemProtToxicity-molecule associationsA compilation of chemical-protein-disease annotation resources for studying systems pharmacology of a small molecule from molecular to clinical levels 148, 149
DSSToxToxicity-molecule associationsA subset of ACToR related to toxicity 150
eChemPortalToxicity-molecule associationsAn aggregator of chemical hazard and risk information 151
PKKB (Pharmaco Kinetics Knowledge Base)Toxicity-molecule associationsHigh-quality data for experimental ADMET properties 152
PubChemToxicity-molecule associationsAn aggregator of chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations 153
SuperToxicToxicity-molecule associationsCompounds and toxicity information 154
T3DB (Toxic Exposome Database)Toxicity-molecule associationsToxins data combined with target information 155
Tox21 (Toxicology in the 21st century)Toxicity-molecule associationsToxicity data for commercial chemicals, pesticides, food additives/contaminants, and medical compounds 156
ToxBank Data WarehouseToxicity-molecule associationsAn aggregator of data for systemic toxicity 157
ToxCast Database (invitroDB)Toxicity-molecule associationsHTS assay target information, study design information and quality 158
TOXNETToxicity-molecule associationsAn aggregator of several toxicity databases, Integrated into PubMed in 2019 159
TTD (Therapeutic Targets Database)Toxicity-molecule associationsProtein and nucleic acid targets, diseases, pathways 160
ECOTOX (Ecotoxicology Database)Toxicity-molecule associations, adverse eventsAdverse effects of single chemical stressors related to aquatoxicity 161
DrugBankToxicity-molecule associations, biological activityA bioinformatics and cheminformatics resource on drug targets and properties of drugs 162
STITCHToxicity-molecule associations, pathwaysMetabolic pathways, binding experiments, crystal structures, and drug-target relationships 163
Connectivity MapTranscriptomicsHuman cancer cell lines treated with various perturbants, Affymetrix GeneChip Human Genome 164
Drug MatrixTranscriptomicsRat Liver, kidney, heart and thigh muscle from Affymetrix GeneChip Rat Genome 165
LINCS L1000TranscriptomicsMicroscopy data, transcripts from L1000 database 166
Open TG-GATEsTranscriptomicsHistopathology and clinical chemistry rat’s liver, kidneys, hear and thigh muscle data, Affymetrix GeneChip Rat Genome 167
GEO (Gene Expression Omnibus)Functional genomics dataContains array and sequence-based data 168
ArrayExpressFunctional genomics dataExperimental data from high-throughput functional genomic tests 169
UniProt KnowledgeBaseProtein sequences and functional informationDatabase is split into two sections including UniProtKB/Swiss-Prot and UniProtKB/TrEMBL which respectively reflect whether the data are manually annotated and reviewed or not 170172
Protein DatabankProtein informationContains information about the “3D shapes of proteins, nucleic acids and complex assemblies” 173, 174
PRIDEProteomicsRepository of MS derived proteomics data 175177
ProteomeDBProteomicsAim to aid in the identification of the proteome 178
GnomAD (Genome Aggregation Database)Sequencing dataExome and genome sequencing data that has been combined from large-scale sequencing projects 179
WITHDRAWNWithdrawn drugsContains withdrawn and discontinued drugs 180
DISGeNETTarget-disease informationTarget-disease relationships 181
Open TargetsTarget-disease informationTarget-disease relationships 182
Clinical Pharmacology and British Pharmacology Society Guide to Pharmacology DatabaseTarget and ligand informationResource on targets and ligands 183
SuperTargetTarget-drug informationTarget-drug information 184
GOSTARTarget compound databaseManually curated target-compound database from literature and patents 185
SureChemblPatent dataOpen-source patent data 186

Table 3Table 3

TOXNET187 is an aggregator of other toxicity-related databases on breastfeeding and drugs, developmental toxicology literature, drug-induced liver injury, household product safety, and animal testing alternatives. TOXNET is available via PubMed since December 2019. ToxCast158 and ECOTOX161 are two databases created by the US Environmental Protection Agency. They contain high-throughput and high-level cell response data related to toxicity and environmental impact of over 1800 chemicals, consumer products, food and cosmetic additives. Tox21156 is a collaborative database between some of the US Federal Agencies that aggregates toxicology data on commercial chemicals, pesticides, food additives, contaminants, and medical compounds? ToxBank Data Warehouse157 stores systemic pharmacology information and additionally integrates into models predicting repeated-dose toxicity. PubChem153 and DrugBank162 are not purely toxicology databases; however, they collect bioactivity and biomolecular interactions data as well as clinical and patent information, respectively. ChEMBL188 and CTD147 databases contain manually curated data on chemical molecule and gene or protein interactions, chemical molecule and disease as well as gene and disease relationships. There exist various online public resources devoted to drug side effects: SIDER,120 OFF-SIDES,189 and CEBS.66 These data are integrated with pathway-focused sites, for example, KEGG,190 PharmGKB,138 and Reactome,141 which are curated and peer-reviewed pathway databases. The following table contains the main ones, however, it is not exhaustive.

A large number of molecular-omics data is present in the public domain and allow for reusing and exchange data from between experiments. High-dimensional and noisy biological signals used in, for example, differential gene expression, gene co-expression networks, compound protein-protein interaction networks, signature matching and organ toxicity analysis, often require a standardized ontology as well as manual data curation before they can be used to train a model.18 However, the following public databases offer relatively high-quality data. DrugMatrix191 contains in vivo rat liver, kidney, heart and thigh muscle from Affymetrix GeneChip Rat Genome 230 2.0 Array GE Codelink and Open TG-GATEs167 contain rat liver and kidney data. The latter also contains human and rat in vitro hepatocytes histopathology, blood chemistry and clinical chemistry data. Toxicity data for five human cancer cell lines derived from the Affymetrix GeneChip Human Genome U133A Array are stored in the Connectivity Map.164 Microscopy images of up to 77 cell lines treated with various chemical compounds and gene expression data can be found in the Library of Integrated Network-based signatures L1000 (LINCS dataset).164

Many of the resources above have multiple applications. A wide variety of resources are available for proteomic studies from the EBI including UniProt KnowledgeBase (UniProtKB) and PRIDE.192 Uniprot provides freely accessible resources of protein data such as protein sequences and functional information. UniProtKB is included in these resources.170172 It is split into two sections namely, the manually annotated and reviewed section known as UniProtKB/Swiss-Prot. The second section, UniProtKB/TrEMBL refers to the computationally annotated and nonreviewed section of the data. Owing to be computationally annotated, EBI states that there is high annotation coverage of the proteome.172 These data can be used to find evidence for protein function or subcellular location.172 Finally, PRIDE incudes protein and peptide identifications (such as details of posttranslational modifications) alongside evidence from mass spectrometry.175177

This growth in the number of data repositories and databases has been fueled by the large amount of proteomic data generated.175 The Protein DataBank is concerned with structural protein information such as the 3D shape if the protein and is maintained by the RCSB.173, 174

To deal with this, The HUPO Proteomics Standards Initiative,193 or HUPO-PSI for short, was developed to ensure the universal adoption of stable data formats that has resulted in aggregation of proteomic data.175 The HUPO-PSI’s about section states that these standards were developed “to facilitate data comparison, exchange and verification”.193 However, it does not deal with the quality of data and the issues that brings.

Other key resources include WITHDRAWN180 of which contains information about withdrawn and discontinued drug, DISGeNET181 and Open Targets182 for target-disease relationships. The Clinical Pharmacology and British Pharmacology Society Guide to Pharmacology Database183 also contains target information and information about a variety of ligands. GOSTAR185 and SureChEMBL194 both contain information on target-compound information from patents with GOSTAR also containing that information available from literature.

Drug safety databases

To monitor, systematically review, and enable data-driven decisions on drug safety, WHO Collaborating Monitoring Centre in Uppsala195 and National Competent Authorities (NCAs) maintain several databases dedicated to safety signals collection (Fouretier et al., 2016). The largest and the oldest ones are WHO VigiBase (1968), EU Eudravigilance, FDA FAERS, and VAERS, but most countries have established their own databases supported by Geographical Information Systems (GIS).196 Geolocalization allows using these databases to detect both global and local trends. Table 4 presents an overview of the largest publicly accessible databases related both to postmarketing surveillance, unsolicited reporting, and solicited reporting from clinical trials.

Table 4

An overview of publicly available resources on adverse drug reactions maintained by WHO, competent national authorities (NCAs), WHO and scientific institutions.
Database nameOrganizationReportersContent
VigiBaseUppsala Monitoring Centre, WHOMAHs, HCPs, consumers or any regional centerTwenty million ICSRs from 125 member states and 28 associate members on medicinal product-related suspected adverse events; postmarketing spontaneous severe and nonserious cases ICSRs, sometimes clinical trials, literature
Related tools: WHO-ART, MedDRA, WHO ICD, WHODrug, VigiSearch VigiLyz, VigiMin, ICD, VigiAccess
EudravigilanceEMAMAH, NCAs, EEA sponsors of clinical trials14.5 million ICSRs; Clinical Trial Module (EVCTM); Post-Authorization Module (EVPM) Related tools: EVDAS, Addreports.eu, MedDRA
FAERSFDAMAH, HCPs, consumersOver 19 million postmarketing surveillance adverse event reports related to medications. Causality analysis not required for submission
Related tools: Sentinel Initiative, FAERS Public Dashboard, AERSMIne, Open Vigil
VAERSFDA, CDCMAH, HCPs, consumers700,000 postmarketing surveillance adverse event reports related to vaccines including unverified reports, misattribution, underreporting, and inconsistent data quality, Related tools: empirical Bayes and data mining tools built-in
Adverse Event Reports for Animal Drugs and DevicesFDA, Center for Veterinary MedicineVeterinary professionals, consumersVoluntary AE submission, database contain postmarketing surveillance adverse event reports related to animal drugs including drugs, supplements, vitamins
Yellow CardMHRA, Commission on Human MedicinesHCP, hospital and community pharmacists, members of the publicICSRs on medicines, OTCs, vaccines, herbal preparations and unlicensed medicines, e-cigarettes, counterfeit drug reports, defective medicinal products. Interactive Drug Analysis Profile IDAPs) can be downloaded for each drug
Related tools: Android app, built-in analytics
Canada Vigilance Adverse Reaction Health CanadaHCP, MAHsClinical and postmarket surveillance SAE reports prescription and nonprescription medications; natural health products; biologics (includes biotechnology products, vaccines, fractionated blood products, human blood and blood components, as well as human cells, tissues and organs); radiopharmaceuticals; and disinfectants and sanitizers with disinfectant claims
DAEN—medicinesAustralian Department of Health TGAHCPs, MAHs, members of public, therapeutic goods industryADR reports on adverse events related to medicines and vaccines used in Australia
LAREBNetherlands Pharmacovigilance Centre LarebHCPs, community pharmacists, members of the publicDownloadable reports with preprocessed data and literature related to ADR reporting in the Netherlands
PROTECTEMA and partnersNonePROTECT ADR database is a downloadable Excel file listing of all MedDRA preferred terms or low-level terms adverse drug reactions (ADRs), text mined Summary of Product Characteristics (SPC) of medicinal products authorized in the EU, automated mapping of ADR terms, fuzzy text matching, expert review
SIDEREMBLNonePostmarket surveillance, extracted from public documents and package leaflets and Summary of Product Characteristics include side effect frequency, drug and side effect classifications, links to drug target relations, top-down database

Table 4

Database contain both solicited and unsolicited data.

Majority of the unsolicited resources is unstructured, fragmentary, unstandardized and suffering from the presence of confounders. Although WHO, ICH, and NCAs have taken a considerable standardization effort, the quality of ADR, reports vary across countries.197 Additional curation of the data in indispensable as databases contains duplicates, missing data points, and it has high sample variance.

Furthermore, cases when patients were administered drugs as intended and no ADR occurred, are naturally not reported. From the perspective of data analysis and developing machine learning models lack their presence in a dataset results in class imbalance, survivorship bias and high numbers of false positives in predictions.198 Thus, one cannot calculate the rate of occurrence for the whole population basing on spontaneous resources only. Otherwise, the risk of false-positive reporting for certain medicines may be artificially elevated.199 Finally, statistical significance in a model does not always mean clinical relevance. A majority of patients might be likely to respond better to certain medications statistically. However, some atypical side effects may occur that lower the quality of life of a small number of patients and hence outweigh the benefits.

Finally, longitudinal patient medical history may not always be easily retrieved, and thus it is challenging to verify reported information as well as establish causality understood as in ICH-E2A guideline.200 Reports submitted to SRS databases are subjective and often contain inconsistent records when compared with original medical documentation.

Key public data-resources for precision medicine

This section describes many completed and ongoing efforts to generate large-scale datasets from cell lines, patients and healthy volunteers. These datasets are a necessary asset that will be used to generate novel AI/ML-based models to guide precision medicine.

Resources for enabling the development of computational models in oncology

Beginning with the characterization of NCI60 cell lines for predicting drug sensitivity, there has been enormous number of large-scale studies to generate genomics, proteomics, functional genomics, or drug sensitivity datasets that can be utilized to predict cancer cells sensitivity to a targeted agent (Table 5). Among them Cancer Cell Line Encyclopedia (CCLE) project by the BROAD Institute is one of the most comprehensive. In its first round in 2012, CCLE included gene expression, copy number and mutation profile data for 947 cell lines, and pharmacological profile for 24 anticancer drugs in 479 of the cell lines. In 2019, project extended to include data on RNA sequencing (RNAseq; 1019 cell lines), whole-exome sequencing (WES; 326 cell lines), whole-genome sequencing (WGS; 329 cell lines), reverse- phase protein, array (RPPA; 899 cell lines), reduced representation bisulfite sequencing (RRBS; 843 cell lines), microRNA expression profiling (954 cell lines), and global histone modification profiling (897 cell lines) for CCLE cell lines. In addition, abundance of 225 metabolites was measured for 928 cell lines. An additional project from Genentech profiles gene expression, mutations, gene fusions and expression of nonhuman sequences in 675 human cancer cell lines. MLCP project characterized the proteome of the human cancer cell lines. Two resources that include the drug sensitivity data are Genomics of Drug Sensitivity (GDSC) from the Sanger Institute and the Cancer Therapeutics Response Portal (CTRP) from the BROAD institute.208, 218, 219 By generating expression data (and making it public) that indicates how cells respond to various genetic and environmental stressors, the LINCS project from the NIH helps to gain a more detailed understanding of cell pathways.164, 220

Table 5

Key public data resources in the oncology for enabling the development of computational models and knowledge discovery.
ResourceBiological materialOmics readout (#cell lines)WeblinkLast updateReference
NCI-6060 cancer cell linesDrug sensitivity (>  100,000 compounds), SNV, CNV, RNAseq, DNA methylation https://discover.nci.nih.gov/cellminer/ 201
McDermott et al500 cell linesDrug sensitivity (14 kinase inhibitors)2007 202
GSK311 cell linesDrug sensitivity (19 compounds)2010
CCLE~  1000 cancer cell linesWES (326), WGS (329) RNAseq (1019), Methylation (RRBS, 843), RPPA (899), microRNA profiling (954), global histone modifications (897), drug sensitivity (24 compounds, 479) and metabolic profiling for 225 metabolites (928) https://portals.broadinstitute.org/ccle May 2019 203205
GDSC1001 cell lines, 453 compoundsTranscription (microarray) Methylation (Infinium HumanMethylation450 BeadChip arrays) Drug sensitivity https://www.cancerrxgene.org/ July 2019 206, 207
CTRP481 Compounds across 860 cancer cell linesDrug sensitivity http://portals.broadinstitute.org/ctrp.v2/ 208
Genentech675 human cancer cell linesRNA-seq and SNP array analysis https://www.nature.com/articles/nbt.3080 2015 209
Connectivity MapNine cell lines1,319,138 L1000 profiles from 42,080
perturbagens (19,811 small molecule compounds, 18,493
shRNAs, 3,462 cDNAs, and 314 biologics), corresponding to
25,200 biological entities (19,811 compounds, shRNA and/or
cDNA against 5075 genes, and 314 biologics) for a total of 473,647 signatures
http://www.lincsproject.org/
https://clue.io/cmap
2017 164, 210
MCLPCell linesRPPA https://tcpaportal.org/mclp/#/ 211
Cheng et a.l15 HPV and 11 HPV  + HNSCC cell linesWhole exome sequencing and RNA-seqOct 2018 212
TCGA>  11,000 primary cancer and matched normal samples spanning 33 cancer typesGenomic, methylation (Infinium HumanMethylation450 BeadChip arrays), transcriptomic and proteomics (RPPA) https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga 213
ICGC86 cancer projects across 22 sites, ~  25,000 patientsGenome sequencing https://icgc.org/ 214
TCPA8167 tumor samplesRPPA https://tcpaportal.org/tcpa/
CPTACCancer patients, primary tumor and the adjacent tissue. 45 total studies for 10 tissue types, resulting in a total of 2696 samplesPhosphoproteomics, proteomics, transcriptomics, SCNA, mutations https://cptac-data-portal.georgetown.edu/ May, 2019 215, 216
Dong-Gi mun et al.Paired tumor and adjacent normal tissues, as well as blood samples, from 80 patients with EOGCs under 45 years of ageExome sequencing, RNA-seq, global proteome, 26 ummariz-proteome and glycoproteome 217

Table 5

Abbreviations: NCI60, National Cancer Institute collection of 60 cell lines; CCLE, Cancer Cell Line Encyclopedia; GDSC, genomics of drug sensitivity in cancer; COSMIC, Catalogue of Somatic Mutations in Cancer; TCGA, the cancer genome atlas; MCLP, MD Anderson Cell Lines Project; CPTAC, Clinical Proteomic Tumor Analysis Consortium; RPPA, reverse phase protein array.

Although cancer cell line data is crucial for many insights and some of the large-scale experiments such as CRISPR functional genomics screens can only be done in cell lines, primary data on patients is vital to understand and modeling of human disease.

Several large consortiums/projects took this challenge of characterizing tumor samples in various genomics, epigenomics, and proteomics aspects. Prominent among them is the cancer genome atlas (TCGA) which has sequenced and characterized more than 11,000 patient samples in 33 cancer types.213 International cancer genome consortium (ICGC) is another consortium of several national projects to sequence the cancer samples.214 The Cancer Proteome Atlas (TCPA) performed RPPA analysis on more than 800 samples and Clinical Proteomic Tumor Analysis Consortium (CPTAC) launched in 2011 by NCI pioneered the integrated proteogenomic analysis of colorectal, breast, and ovarian cancer.215 These efforts revealed new insights into these cancer types, such as identification of proteomic-centric subtypes, prioritization of driver mutations, and understanding cancer-relevant pathways through posttranslational modifications. The CPTAC has produced proteomics data sets for tumor samples previously analyzed by TCGA program.

Key genomic/epigenomic resources for therapeutic areas other than oncology

There are multitudes of ongoing projects outside oncology domain for large-scale data generation. Some of them are summarized in Table 6.

Table 6

Large genomic datasets for nononcology applications.
ResourceBiological materialOmics readoutWeblinkLast updateReference
GWAS CatalogHuman primary tissue/cellsDNA https://www.ebi.ac.uk/gwas/ Every week 221
Expression AtlasMultiple species and tissuesRNA https://www.ebi.ac.uk/gxa/home August 2020 222
ClinVarHuman primary tissue/cellsDNA https://www.ncbi.nlm.nih.gov/clinvar/ 223
OMIMHuman primary tissue/cellsDNA https://www.omim.org/ Everyday
ENCODECell lines, primary cells, cell free samples, tissueEpigenetic profiling https://www.encodeproject.org/ August2019 224
Human Cell AtlasHuman primary tissue/cellsSingle cell sequencing https://www.humancellatlas.org/ 225, 226
Single Cell PortalCollection of studies on single cells (288 so far)Single cell sequencing https://portals.broadinstitute.org/single_cell August 2020
GTEx Portal54 nondiseased tissue sites across nearly 1000 individualsPrimarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank. The GTEx Portal provides open access to data including gene expression, QTLs, and histology images https://gtexportal.org/home/ August 2019 227
PsychENCODEHuman brain samples and organoidsDNA, RNA and epigenetics profiling http://www.psychencode.org/ December 2018 228
NIAGADS61 datasets, >  59,000 samplesGenotypic data for the study of genetics of late-onset Alzheimer’s disease https://www.niagads.org/ February 2019
ADSPThe Alzheimer’s Disease Sequencing ProjectDNA https://www.niagads.org/adsp/ November 2018 229
ADNIAlzheimer’s Disease Neuroimaging Initiative, >  800 subjectsClinical, genetic, MRI image, PET image, Biospecimen http://adni.loni.usc.edu/ 230
AutDBresource for exploring the impact of genetic variations associated with autism spectrum disorders (ASD)Human Gene, which annotates all ASD-linked genes and their variants; Animal Model, which catalogs behavioral, anatomical and physiological data from rodent models of ASD; Protein Interaction (PIN), which builds interactomes from direct relationships of protein products of ASD genes; and Copy Number Variant (CNV), which catalogs deletions and duplications of chromosomal loci identified in ASD http://autism.mindspec.org/autdb Quarterly 231
NDARNational Database for Autism ResearchGenetics, behavioral data https://nda.nih.gov/ November 2018 232
NIMH Dara Archive (NDA)NDA is a collection of data repositories including the Research Domain Criteria Database (RdoCdb). The National database for Clinical trials related to mental illness (NDCT) and the NIH pediatric MRI Repository (PedsMRI) https://nda.nih.gov/ August 2019

Table 6

Resources for accessing metadata and analysis tools

Accessing and analyzing raw sequencing data can be quite cumbersome for most biologists. Resources that present analyzed or easy to grasp data on genetic alterations as well as pathway level analysis are very helpful. Several such resources that can be used directly for hypothesis generation/verification exist. Some of these are listed in Table 7.

Table 7

Resources for accessing metadata and analysis tools.
DatabaseContentOmics readoutWeblinkLast updateReference
COSMICTumor samples and >  1000 cell linesExpert curated database of somatic mutations https://cancer.sanger.ac.uk/cosmic V92, August 2020 233
Cancer DepMapData from CCLE, TCGA, GDSC, RNAi and CRISPR screensGenomics, proteomics, RNAi/CRISPR screens and drug sensitivity https://depmap.org/portal/ Every 90 days
Cell Model PassportsCell lines and organoidsMutations, expression, CNV, methylation, fusions, drug response, CRISPR score https://cellmodelpassports.sanger.ac.uk/passports 234
cBioPortalThe portal hosts a total of 263 cancer studies including CCLE and TCGA dataMutations, CNV, RNAseq, RPPA http://www.cbioportal.org/ 235
TCGA-CDRClinical data resource for high quality survival outcome analyticsSurvival dataSee reference 236
mSigDBAnnotated gene sets for use with GSEAGene sets http://software.broadinstitute.org/gsea/msigdb/index.jsp 237
EnricherAnnotated gene sets for use with GSEAGene sets https://amp.pharm.mssm.edu/Enrichr/ 238
GeneMANIAHypothesis generation regarding function of a geneMultiple omics-based data https://genemania.org/ 239
L1000CDS2LINCS L1000 characteristic direction signature search engineFinds consensus L1000 small molecule signatures that match user input signatures https://amp.pharm.mssm.edu/L1000CDS2/#/index 240
GeneshotRanking genes based on text miningLiterature, expression data https://amp.pharm.mssm.edu/geneshot/ 241

Table 7

Fig. 2 recapitulates progress on data generation frontier that include drug screening in cell lines, functional genomics (RNAi and CRISPR) screens, detailed characterization of cell lines and finally exome or whole genome sequencing of patients and healthy volunteers. Some of these data were already used employing AI/ML-based approaches to identify novel synthetic lethality pairs, predict drug IC50, or even clinical outcome prediction.207, 242, 243 By designing an AI algorithm to analyze CT scan images, researchers have created a radiomic signature that defines the level of lymphocyte infiltration of a tumor and provides a predictive score for the efficacy of immunotherapy in the patient.244 Gene expression profile analysis of needle biopsy specimens was performed from the livers of 216 patients with hepatitis C-related early-stage cirrhosis who were prospectively followed up for a median of 10 years. Evaluation of 186-gene signature used to predict outcomes of patients with hepatocellular carcinoma showed this signature is also associated with outcomes of patients with hepatitis C-related early-stage cirrhosis.245 Recently, whole-genome sequencing was used to accurately predict profiles of susceptibility to first-line antituberculosis drugs.246

Fig. 2
Fig. 2 Historic resources for clinical trials.

Table 8 lists some of the examples of historical data sets, potential methods to analyze them, and their respective applications in biopharma. The recent innovation in the field of AI has been enabled primarily by the confluence of rapid advances in affordable computing power in the form of cloud computing, infrastructure to process and manage large-scale data sets and architectures and methodologies such as neural networks.

Table 8

Selected examples of historical data sets, potential methods to analyze them and their respective applications in biopharma.
ExamplesData typeData and methodsApplications in biopharma
National Biomedical Imaging Archive (NBIA); GenomeRNAiImaging dataImage preprocessing and analyses, data annotation, data extraction, segmentation, deep learning, computer visionClinical or cellular phenotyping, patient stratification and disease subclassification
TCGA; dbGAPGenomic dataVariant calling, annotation, structural variants differential expressionDiagnosis, disease subtyping, therapeutic matching, clinical trial matching
UK Biobank; BioMe BiobankBiobanks and electronic health recordsClinical trajectory estimation, biomarker-based modelingPredict risk of diseases, real world evidence modeling
ClinicalTrials.gov; AACT DatabaseClinical trials databasesClinical trial protocols, performance metrics, patient population summariesPredictive modeling of clinical trial metrics

Table 8

References

1 Zitnik M. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Inf Fus. 2019;50:71–91.

2 Jacobsen A. A generic workflow for the data fairification process. Data Intell. 2020;2:56–65.

3
FAIRification process—GO FAIR. Available at: https://www.go-fair.org/fair-principles/fairification-process/ [Accessed 11 August 2020].

4
ChEMBL. Available at: https://www.ebi.ac.uk/chembl/ [Accessed 5 September 2018].

5 Gaulton A. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–D1107.

6
ChEMBL data questions—ChEMBL interface documentation. Available at: https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions [Accessed 11 August 2020].

7
The evolving role of clinical trial data sharing. Available at: https://pharmaphorum.com/views-and-analysis/clinical-trial-data-sharing/ [Accessed 4 September 2020].

8 Miller J., Ross J.S., Wilenzick M., Mello M.M. Sharing of clinical trial data and results reporting practices among large pharmaceutical companies: cross sectional descriptive study and pilot of a tool to improve company practices. BMJ. 2019;366:l4127.

9
MELLODDY. Available at: https://www.melloddy.eu/ [Accessed 4 September 2020].

10 Rouse M., Botelho B., Bigelow S. Big data. Search Data Management. Available at: https://searchdatamanagement.techtarget.com/definition/big-data. 2020.

11 Ishwarappa, Anuradha J. A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Comput Sci. 2015;48:319–324.

12 Horgan R.P., Kenny L.C. ‘Omic’ technologies: genomics, transcriptomics, proteomics and metabolomics. Obstet Gynaecol. 2011;13:189–195.

13 Paananen J., Fortino V. An omics perspective on drug target discovery platforms. Brief Bioinform. 2019 bbx122.

14 Simon R., Roychowdhury S. Implementing personalized cancer genomics in clinical trials. Nat Rev Drug Discov. 2013;12:358–369.

15
A brief guide to genomics. Available at: https://www.genome.gov/about-genomics/fact-sheets/A-Brief-Guide-to-Genomics [Accessed 14 October 2019].

16 Libbrecht M.W., Noble W.S. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16:321–332.

17
Transcriptomics—Latest research and news | Nature. Available at: https://www.nature.com/subjects/transcriptomics [Accessed 14 July 2020].

18 Alexander-Dann B. Developments in toxicogenomics: understanding and predicting compound-induced toxicity from gene expression data. Mol Omics. 2018;14:218–236.

19 Wang Z., Gerstein M., Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63.

20
Transcriptomics today: Microarrays, RNA-seq, and more | Science | AAAS. Available at: https://www.sciencemag.org/features/2015/07/transcriptomics-today-microarrays-rna-seq-and-more [Accessed 14th July 2020].

21 Akter S. Machine learning classifiers for endometriosis using transcriptomics and methylomics data. Front Genet. 2019;10:766.

22 Singh S.P. Machine learning based classification of cells into chronological stages using single-cell transcriptomics. Sci Rep. 2018;8:17156.

23 Roessner U., Bowne J. What is metabolomics all about?. BioTechniques. 2009;46:363–365.

24 Dias-Audibert F.L. Combining machine learning and metabolomics to identify weight gain biomarkers. Front Bioeng Biotechnol. 2020;8:.

25 Sen P. Deep learning meets metabolomics: a methodological perspective. Brief Bioinform. 2020;doi:10.1093/bib/bbaa204.

26 Pradas I. Lipidomics reveals a tissue-specific fingerprint. Front Physiol. 2018;9:1165.

27 Yang K., Han X. Lipidomics: techniques, applications, and outcomes related to biomedical sciences. Trends Biochem Sci. 2016;41:954–969.

28 Meikle P.J., Wong G., Barlow C.K., Kingwell B.A. Lipidomics: potential role in risk prediction and therapeutic monitoring for diabetes and cardiovascular disease. Pharmacol Ther. 2014;143:12–23.

29 Fan S. Systematic error removal using random forest for normalizing large-scale untargeted lipidomics data. Anal Chem. 2019;91:3590–3596.

30
What is proteomics? | EMBL-EBI Train online. Available at: https://www.ebi.ac.uk/training/online/course/proteomics-introduction-ebi-resources/what-proteomics [Accessed 8 October 2019].

31 Graves P.R., Haystead T.A.J. Molecular biologist’s guide to proteomics. Microbiol Mol Biol Rev. 2002;66:39–63.

32 Swan A.L., Mobasheri A., Allaway D., Liddell S., Bacardit J. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omi A J Integr Biol. 2013;17:595–610.

33 Gessulat S. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods. 2019;16:509–518.

34 Cao Y., Charisi A., Cheng L.-C., Jiang T., Girke T. ChemmineR: a compound mining framework for R. Bioinformatics. 2008;24:1733–1734.

35 R Core Team. R: A language and environment for statistical computing. R Vienna, Austria: Foundation for Statistical Computing; 2020. https://www.R-project.org/.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. https://www.R-project.org/.

36 Landrum G. RDKit: open-source cheminformatics. https://www.rdkit.org.

37 Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B. KNIME: The Konstanz Information Miner. In: Studies in Classification, Data Analysis, and Knowledge Organization. Springer; 2007.

38
What is the correct format for compounds in SDF or MOL files?—Progenesis SDF studio. Available at: http://www.nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-format-guidance.aspx [Accessed 18 October 2019].

39 Heller S.R., McNaught A., Pletnev I., Stein S., Tchekhovskoi D. InChI, the IUPAC international chemical identifier. J Cheminform. 2015;7:.

40 Heller S., McNaught A., Stein S., Tchekhovskoi D., Pletnev I. InChI—the worldwide chemical structure identifier standard. J Cheminform. 2013;5:.

41 O’Boyle N.M. Towards a Universal SMILES representation—a standard method to generate canonical SMILES based on the InChI. J Cheminform. 2012;4:22.

42
chem-bla-ics: InChIKey collision: the DIY copy/pastables. Available at: https://chem-bla-ics.blogspot.com/2011/09/inchikey-collision-diy-copypastables.html?_sm_au_=iHHRkrfFZLWsZNV6 [Accessed 16 September 2019].

43
An InChIkey collision is discovered and NOT based on stereochemistry ChemConnector blog. Available at: http://www.chemconnector.com/2011/09/01/an-inchikey-collision-is-discovered-and-not-based-on-stereochemistry/ [Accessed 16 September 2019].

44 Willighagen E.L. InChIKey collision: the DIY copy/pastables. 2011.

45 Pletnev I. InChIKey collision resistance: an experimental testing. J Cheminform. 2012;4:.

46 Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28:31–36.

47 Weininger D., Weininger A., Weininger J.L. SMILES. 2. algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989;29:97–101.

48 Weininger D. Smiles. 3. Depict. Graphical depiction of chemical structures. J Chem Inf Comput Sci. 1990;30:237–243.

49
Daylight theory: SMARTS—a language for describing molecular patterns. Daylight Chemical Information Systems, Inc; 2012. Available at: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html [Accessed 8 September 2018].

50 Yang H. Evaluation of different methods for identification of structural alerts using chemical ames mutagenicity data set as a benchmark. Chem Res Toxicol. 2017;30:1355–1364.

51 Cammarata A., Menon G.K. Pattern recognition. Classification of therapeutic agents according to pharmacophores. J Med Chem. 1976;19:739–748.

52 Wu Y., Wang G. Machine learning based toxicity prediction: from chemical structural description to transcriptome analysis. Int J Mol Sci. 2018;19:2358.

53 Landrum G. Fingerprints in the RDKit. RDKit UGM 2012: fingerprints in the RDKit. Available at: https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf. 2012.

54 Durant J.L., Leland B.A., Henry D.R., Nourse J.G. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42:1273–1280.

55 Rogers D., Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50:742–754.

56 Morgan H., The L. Generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service. J Chem Doc. 1965;5:107–113.

57
Daylight theory: fingerprints. Available at: https://www.daylight.com/dayhtml/doc/theory/theory.finger.html [Accessed 16 September 2019].

58 Faulon J.L., Visco D.P., Pophale R.S. The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci. 2003;43:707–720.

59 Faulon J.L., Churchwell C.J., Visco D.P. The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci. 2003;43:721–734.

60
GitHub—reymond-group/mhfp: Molecular MHFP fingerprints for cheminformatics applications. Available at: https://github.com/reymond-group/mhfp [Accessed 9 October 2020].

61 Probst D., Reymond J.L. A probabilistic molecular fingerprint for big data settings. J Cheminform. 2018;10:.

62 Bender A. ‘Bayes affinity fingerprints’ Improve retrieval rates in virtual screening and define orthogonal bioactivity space: when are multitarget drugs a feasible concept?. J Chem Inf Model. 2006;46:2445–2456.

63 Wang Y. PubChem BioAssay: 2017 update. Nucleic Acids Res. 2017;45:D955–D963.

64 Klekota J., Roth F.P. Chemical substructures that enrich for biological activity. Bioinformatics. 2008;24:2518–2525.

65 Banerjee P., Siramshetty V.B., Drwal M.N., Preissner R. Computational methods for prediction of in vitro effects of new chemical structures. J Cheminform. 2016;8:.

66 Fan D. In silico prediction of chemical genotoxicity using machine learning methods and structural alerts. Toxicol Res (Camb). 2018;7:211–220.

67 O’Boyle N.M., Sayle R.A. Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform. 2016;8:.

68
How to choose bits and radius during circular fingerprint calculation in RDKit? Available at: https://www.researchgate.net/post/How_to_choose_bits_and_radius_during_circular_fingerprint_calculation_in_RDKit [Accessed 18 September 2019].

69 Dalke A. No title. Available at: http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html. 2019.

70
Fingerprint generation—Toolkits—Python. Available at: https://docs.eyesopen.com/toolkits/python/graphsimtk/fingerprint.html#section-fingerprint-path [Accessed 5 February 2020].

71 Alvarsson J. Ligand-based target prediction with signature fingerprints. J Chem Inf Model. 2014;54:2647–2653.

72 Dhawan A., Kwon S. In vitro toxicology. Int J Toxicol. 2017;doi:10.1080/10915810305079.

73 Yang H., Sun L., Li W., Liu G., Tang Y. Identification of nontoxic substructures: a new strategy to avoid potential toxicity risk. Toxicol Sci. 2018;165:396–407.

74 Yap C., PaDEL-descriptor W. An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32:1466–1474.

75 O’Boyle N.M. Open Babel: An Open chemical toolbox. J. Cheminform. 2011;3:.

76 Steinbeck C. The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J Chem Inf Comput Sci. 2003;43:493–500.

77 Tetko I.V. Virtual computational chemistry laboratory—design and description. J Comput Aided Mol Des. 2005;19:453–463.

78 Hewitt M., Enoch S.J., Madden J.C., Przybylak K.R., Cronin M.T.D. Hepatotoxicity: a scheme for generating chemical categories for read-across, structural alerts and insights into mechanism(s) of action. Crit Rev Toxicol. 2013;43:537–558.

79 Borgelt C., Berthold M.R. Mining molecular fragments: finding relevant substructures of molecules. In: 2002 IEEE International Conference on Data Mining. ICDM; 2002:51–58. IEEE Comput. Soc, 2002. https://doi.org/10.1109/ICDM.2002.1183885.

80 Venkatapathy R., Wang N.C.Y. Developmental toxicity prediction. In: Reisfeld B., Mayeno A.N., eds. Computational toxicology. Humana Press; 305–340. 2013;vol. 930.

81 Raies A.B., Bajic V.B. In silico toxicology: computational methods for the prediction of chemical toxicity. Wiley Interdiscip Rev Comput Mol Sci. 2016;6:147–172.

82 Gómez-Jiménez G. The OECD principles for (Q)SAR models in the context of knowledge discovery in databases (KDD). Adv Protein Chem Struct Biol. 2018;113:85–117.

83 Marchant C.A., Briggs K.A., Long A. In silico tools for sharing data and knowledge on toxicity and metabolism: derek for windows, meteor, and vitic. Toxicol Mech Methods. 2008;18:177–187.

84 Lagorce D., Sperandio O., Baell J.B., Miteva M.A., Villoutreix B.O. FAF-Drugs3: a web server for compound property calculation and chemical library design. Nucleic Acids Res. 2015;43:W200–W207.

85 Sanz F. Integrative modeling strategies for predicting drug toxicities at the eTOX project. Mol Inform. 2015;34:.

86 Sushko I., Salmina E., Potemkin V.A., Poda G., Tetko I.V. ToxAlerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. J Chem Inf Model. 2012;52:2310–2316.

87
CombiGlide 2.5 User Manual. Library; 2009.

88 Friesner R.A. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem. 2004;47:1739–1749.

89 Halgren T.A. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J Med Chem. 2004;47:1750–1759.

90 Amberg A. Principles and procedures for handling out-of-domain and indeterminate results as part of ICH M7 recommended (Q)SAR analyses. Regul Toxicol Pharmacol. 2019;102:53–64.

91 Benfenati E., Manganaro A., Gini G. VEGA-QSAR: AI inside a platform for predictive toxicology. In: CEUR workshop proceedings, vol. 1107; CEUR-WS; 2013:21–28.

92 Capuzzi S.J. Chembench: a publicly accessible, integrated cheminformatics portal. J Chem Inf Model. 2017;57:105–108.

93 Dong J. ChemSAR: an online pipelining platform for molecular SAR modeling. J Cheminform. 2017;9:.

94 Patlewicz G., Jeliazkova N., Safford R.J., Worth A.P., Aleksiev B. An evaluation of the implementation of the Cramer classification scheme in the Toxtree software. SAR QSAR Environ Res. 2008;19:495–524.

95 Maunz A. Lazar: a modular predictive toxicology framework. Front Pharmacol. 2013;4:.

96 Cheng F. AdmetSAR: a comprehensive source and free tool for assessment of chemical ADMET properties. J Chem Inf Model. 2012;52:3099–3105.

97 Kemmish H., Fasnacht M., Yan L. Fully automated antibody structure prediction using BIOVIA tools: validation study. PLoS One. 2017;12:e0177923.

98 Vellay S.G.P., Latimer N.E.M., Paillard G. Interactive text mining with Pipeline Pilot: a bibliographic web-based tool for PubMed. Infect Disord Drug Targets. 2009;9:366–374.

99 Hevener K.E. Computational toxicology methods in chemical library design and high-throughput screening hit validation. Methods Mol Biol. 2018;1800:275–285.

100 Hvidsten T.R., Kryshtafovych A., Fidelis K. Local descriptors of protein structure: a systematic analysis of the sequence-structure relationship in proteins using short- and long-range interactions. Proteins Struct Funct Bioinform. 2009;75:870–884.

101 Ruiz-Blanco Y.B., Paz W., Green J., Marrero-Ponce Y. ProtDCal: a program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins. BMC Bioinform. 2015;16:.

102 Wang X., Wu Y., Wang R., Wei Y., Gui Y. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences. PLoS One. 2019;14:e0217312.

103 Segaran T. Programming collective intelligence: building smart Web 2.0 applications. Sebastopol, CA: O’Reilly Media; 2007.

104
Discussion of SImilarity metrics—Jaccard/Tanimoto coefficient. Available at: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/tani.html [Accessed 19 September 2019].

105 Tversky A. Features of similarity. Psychol Rev. 1977;84:327–352.

106
Similarity measures—Toolkits—Python. Available at: https://docs.eyesopen.com/toolkits/python/graphsimtk/measure.html [Accessed 6 February 2020].

107 Dice L.R. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302.

108 Tan P.-N., Steinbach M., Karpatne A., Kumar V. Introduction to data mining. in introduction to data mining. Pearson Addison Wesley; 2006.

109
Discussion of SImilarity Metrics—Cosine Similarity.

110 Bajusz D., Rácz A., Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. J Cheminform. 2015;7:.

111 Martin Y.C., Kofron J.L., Traphagen L.M. Do structurally similar molecules have similar biological activity?. J Med Chem. 2002;45:4350–4358.

112 Burello E. Review of (Q)SAR models for regulatory assessment of nanomaterials risks. NanoImpact. 2017;8:48–58.

113 Topliss J.G. A manual method for applying the Hansch approach to drug design. J Med Chem. 1977;20:463–469.

114 Craig P.N. Comparison of the Hansch and Free-Wilson approaches to structure-activity correlation. In: Van Valkenburg W., ed. Biological correlations—the Hansch approach. American Chemical Society; 115–129. 1974;vol. 114.

115 Cover T., Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13:21–27.

116 Idakwo G. A review of feature reduction methods for QSAR-based toxicity prediction. In: Hong H., ed. Advances in computational toxicology. Springer International Publishing; 119–139. 2019;vol. 30.

117 Waters M. CEBS—chemical effects in biological systems: a public data repository integrating study design and toxicity data with microarray and proteomics data. Nucleic Acids Res. 2008;36:D892–D900.

118 Juan-Blanco T., Duran-Frigola M., Aloy P. IntSide: a web server for the chemical and biological examination of drug side effects. Bioinformatics. 2015;31:612–613.

119 Cheng F. Adverse drug events: database construction and in silico prediction. J Chem Inf Model. 2013;53:744–752.

120 Kuhn M., Letunic I., Jensen L.J., Bork P. The SIDER database of drugs and side effects. Nucleic Acids Res. 2016;44:D1075–D1079.

121 Stark C. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–D539.

122 Juty N. BioModels: content, features, functionality, and use. CPT Pharmacometr Syst Pharmacol. 2015;4:e3.

123 Huttlin E.L. The BioPlex network: a systematic exploration of the human interactome. Cell. 2015;162:425–440.

124 Chen J.Y., Pandey R., Nguyen T.M. HAPPI-2: a comprehensive and high-quality map of human annotated and predicted protein interactions. BMC Genomics. 2017;18:182.

125 Peri S. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371.

126 Hermjakob H. IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004;1:D452–D455.

127 Li T. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat Methods. 2016;14:61–64.

128 Calderone A., Castagnoli L., Cesareni G. Mentha: a resource for browsing integrated protein-interaction networks. Nat Methods. 2013;10:690–691.

129 Türei D. NRF2-ome: an integrated web resource to discover protein interaction and regulatory networks of NRF2. Oxidative Med Cell Longev. 2013;2013:.

130 Türei D., Korcsmáros T., Saez-Rodriguez J. OmniPath: guidelines and gateway for literature-curated signaling pathway resources. Nat Methods. 2016;13:966–967.

131 Fazekas D. SignaLink 2—a signaling pathway resource with multi-layered regulatory networks. BMC Syst Biol. 2013;7:.

132 Perfetto L. SIGNOR: a database of causal relationships between biological entities. Nucleic Acids Res. 2016;44:D548–D554.

133 Szklarczyk D. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47:D607–D613.

134 Kanehisa M., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.

135 Liberzon A. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1:417–425.

136 Rodchenkov I. Pathway commons 2019 update: integration, analysis and exploration of pathway data. Nucleic Acids Res. 2020;48:D489–D497.

137 Cerami E.G. Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39:D685–D690.

138 Barbarino J.M., Whirl-Carrillo M., Altman R.B., Klein T.E. PharmGKB: a worldwide resource for pharmacogenomic information. Wiley Interdiscip Rev Syst Biol Med. 2018;10:e1417.

139 Thorn C.F., Klein T.E., Altman R.B. PharmGKB: the pharmacogenomics knowledge base. Methods Mol Biol. 2013;1015:311–320.

140 Yu J., Gu X., Yi S. Ingenuity pathway analysis of gene expression profiles in distal nerve stump following nerve injury: Insights into wallerian degeneration. Front Cell Neurosci. 2016;10:.

141 Croft D. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39:D691–D697.

142
Reactome | EMBL-EBI Train online. Available at: https://www.ebi.ac.uk/training/online/course/proteomics-introduction-ebi-resources/proteomics-resources-ebi/reactome [Accessed 10 October 2019].

143 Carbon S. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338.

144 Slenter D.N. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018;46:D661–D667.

145 Yang H. AdmetSAR 2.0: web-service for prediction and optimization of chemical ADMET properties. Bioinformatics. 2019;35:1067–1069.

146 Gilson M.K. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44:D1045–D1053.

147 Davis A.P. The Comparative Toxicogenomics Database: update 2019. Nucleic Acids Res. 2019;47:D948–D954.

148 Taboureau O. ChemProt: a disease chemical biology database. Nucleic Acids Res. 2011;39:D367–D372.

149 Kringelum J. ChemProt-3.0: a global chemical biology diseases mapping. Database (Oxford). 2016 bav123.

150 Richard A.M., Williams C.L.R. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res Fundam Mol Mech Mutagen. 2002;499:27–52.

151 Austin T., Denoyelle M., Chaudry A., Stradling S., Eadsforth C. European chemicals agency dossier submissions as an experimental data source: refinement of a fish toxicity model for predicting acute LC50 values. Environ Toxicol Chem. 2015;34:369–378.

152 Douguet D. Data sets representative of the structures and experimental properties of FDA-approved drugs. ACS Med Chem Lett. 2018;9:204–209.

153 Kim S. PubChem substance and compound databases. Nucleic Acids Res. 2016;44:D1202–D1213.

154 Schmidt U. SuperToxic: a comprehensive database of toxic compounds. Nucleic Acids Res. 2009;37:D295–D299.

155 Wishart D. T3DB: the toxic exposome database. Nucleic Acids Res. 2015;43:D928–D934.

156 Thomas R.S. The US Federal Tox21 Program: a strategic and operational plan for continued leadership. ALTEX. 2018;35:163–168.

157 Kohonen P. The ToxBank data warehouse: supporting the replacement of in vivo repeated dose systemic toxicity testing. Mol Inform. 2013;32:47–63.

158 Richard A.M. ToxCast chemical landscape: paving the road to 21st century toxicology. Chem Res Toxicol. 2016;29:1225–1251.

159 Wexler P. TOXNET: an evolving web resource for toxicology and environmental health information. Toxicology. 2001;157:3–10.

160 Chen X., Ji Z.L., Chen Y.Z. TTD: therapeutic target database. Nucleic Acids Res. 2002;30:412–415.

161 Kostich M.S. Aquatic concentrations of chemical analytes compared to ecotoxicity estimates. Sci Total Environ. 2017;579:.

162 Wishart D.S. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082.

163 Kuhn M., von Mering C., Campillos M., Jensen L.J., Bork P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 2008;36:D684–D688.

164 Subramanian A. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171: 1437–1452.e17.

165 Barel G., Herwig R. Network and pathway analysis of toxicogenomics data. Front Genet. 2018;9:.

166 Musa A., Tripathi S., Dehmer M., Emmert-Streib F. L1000 viewer: a search engine and Web interface for the LINCS data repository. Front Genet. 2019;10:.

167 Igarashi Y. Open TG-GATEs: a large-scale toxicogenomics database. Nucleic Acids Res. 2015;43:D921–D927.

168 Clough E., Barrett T. The gene expression omnibus database. Methods Mol Biol. 2016;1418:93–110.

169 Athar A. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 2019;47:D711–D715.

170 Apweiler R. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011;39:D214–D219.

171
UniProt. Available at: https://www.uniprot.org/ [Accessed 10 October 2019].

172
UniProtKB | EMBL-EBI Train online. Available at: https://www.ebi.ac.uk/training/online/course/proteomics-introduction-ebi-resources/proteomics-resources-ebi/uniprotkb [Accessed 10 October 2019].

173
RCSB PDB: homepage. Available at: http://www.rcsb.org/ [Accessed 10 October 2019].

174 Berman H.M. The protein data bank. Nicleic Acids Res. 2000;28:235–242.

175 Vizcaíno J.A. A guide to the Proteomics Identifications Database proteomics data repository. Proteomics. 2009;9:4276–4283.

176
PRIDE | EMBL-EBI Train online. Available at: https://www.ebi.ac.uk/training/online/course/proteomics-introduction-ebi-resources/proteomics-resources-ebi/pride [Accessed 10 October 2019].

177
PRIDE archive. Available at: https://www.ebi.ac.uk/pride/archive/ [Accessed 10 October 2019].

178 Schmidt T. ProteomicsDB. Nucleic Acids Res. 2018;46:D1271–D1281.

179
gnomAD. Available at: https://gnomad.broadinstitute.org/ [Accessed 5 August 2020].

180 Siramshetty V.B. WITHDRAWN—a resource for withdrawn and discontinued drugs. Nucleic Acids Res. 2016;44:D1080–D1086.

181
DisGeNET—a database of gene-disease associations. Available at: https://www.disgenet.org/ [Accessed 26 July 2020].

182
Home—open targets. Available at: https://www.opentargets.org/ [Accessed 26 July 2020].

183
Home | IUPHAR/BPS Guide to PHARMACOLOGY. (2015). Available at: https://www.guidetopharmacology.org/ [Accessed 31 July 2020].

184
SuperTarget. Available at: http://insilico.charite.de/supertarget/ [Accessed 26 July 2020].

185
Excelra | Data science to empower life science innovation. Available at: https://www.gostardb.com/about-gostar.jsp [Accessed 5 April 2018].

186
Search—SureChEMBL. Available at: https://www.surechembl.org/search/ [Accessed 31 July 2020].

187 Fonger G.C., Stroup D., Thomas P.L., Wexler P. Toxnet: a computerized collection of toxicological and environmental health information. Toxicol Ind Health. 2000;16:4–6.

188 Gaulton A. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45:D945–D954.

189 Tatonetti N.P., Ye P.P., Daneshjou R., Altman R.B. Data-driven prediction of drug effects and interactions. Sci Transl Med. 2012;4: 125ra31.

190 Kanehisa M. The KEGG database. Novartis Found Symp. 2002;247: 91–103, 119–128, 244–252.

191 Römer M., Backert L., Eichner J., Zell A. ToxDBScan: large-scale similarity screening of toxicological databases for drug candidates. Int J Mol Sci. 2014;15:19037–19055.

192
Proteomics resources at the EBI | EMBL-EBI Train online. Available at: https://www.ebi.ac.uk/training/online/course/proteomics-introduction-ebi-resources/proteomics-resources-ebi [Accessed 10 October 2019].

193
HUPO-PSI Working groups and Outputs | HUPO proteomics standards initiative. Available at: http://www.psidev.info/ [Accessed 10 October 2019].

194
Search—SureChEMBL. Available at: https://www.surechembl.org/search/ [Accessed 4 August 2017].

195 Wilson A.M., Thabane L., Holbrook A. Application of data mining techniques in pharmacovigilance. Br J Clin Pharmacol. 2004;57:127–134.

196 Duggirala H.J. Use of data mining at the Food and Drug Administration. J Am Med Inform Assoc. 2016;23:428–434.

197 Xu Z., Kass-Hout T., Anderson-Smits C., Gray G. Signal detection using change point analysis in postmarket surveillance: CHANGE POINT ANALYSIS. Pharmacoepidemiol Drug Saf. 2015;24:663–668.

198 Perner P., Bichindaritz I., Salvetti O. Advances in data mining applications in medicine, web mining, marketing, image and signal mining; proceedings. In: Industrial conference on data mining < 6 Leipzig >, Springer; 2006.

199 Ventola C., Big L. Data and pharmacovigilance: data mining for adverse drug events and interactions. P T A Peer-Review J Formul Manag. 2018;43:340–351.

200 Basile A.O., Yahi A., Tatonetti N.P. Artificial intelligence for drug toxicity and safety. Trends Pharmacol Sci. 2019;40:624–635.

201 Reinhold W.C. CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set. Cancer Res. 2012;72:3499–3511.

202 McDermott U. Identification of genotype-correlated sensitivity to selective kinase inhibitors by using high-throughput tumor cell line profiling. Proc Natl Acad Sci U S A. 2007;104:19936–19941.

203 Barretina J. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–607.

204 Ghandi M. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 2019;569:503–508.

205 Li H. The landscape of cancer cell line metabolism. Nat Med. 2019;25:850–860.

206 Garnett M.J. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012;483:570–575.

207 Iorio F. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166:740–754.

208 Basu A. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell. 2013;154:1151–1161.

209 Klijn C. A comprehensive transcriptional portrait of human cancer cell lines. Nat Biotechnol. 2015;33:306–312.

210 Lamb J. The Connectivity Map: a new tool for biomedical research. Nat Rev Cancer. 2007;7:54–60.

211 Li J. Characterization of human cancer cell lines by reverse-phase protein arrays. Cancer Cell. 2017;31:225–239.

212 Cheng H. Genomic and transcriptomic characterization links cell lines with aggressive head and neck cancers. Cell Rep. 2018;25: 1332–1345.e5.

213 Hutter C., Zenklusen J.C. The cancer genome atlas: creating lasting value beyond its data. Cell. 2018;173:283–285.

214 International Cancer Genome, C. International network of cancer genome projects. Nature. 2010;464:993–998.

215 Rudnick P.A. A description of the clinical proteomic tumor analysis consortium (CPTAC) common data analysis pipeline. J Proteome Res. 2016;15:1023–1032.

216 Zhang H. Integrated proteogenomic characterization of human high-grade serous ovarian cancer. Cell. 2016;166:755–765.

217 Mun D.G. Proteogenomic characterization of human early-onset gastric cancer. Cancer Cell. 2019;35: 111–124.e10.

218 Rees M.G. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat Chem Biol. 2016;12:109–116.

219 Seashore-Ludlow B. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov. 2015;5:.

220 Stathias V. LINCS Data Portal 2.0: next generation access point for perturbation-response signatures. Nucleic Acids Res. 2020;48:D431–D439.

221 Buniello A. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–D1012.

222 Papatheodorou I. Expression Atlas update: from tissues to single cells. Nucleic Acids Res. 2020;48:D77–D83.

223 Landrum M.J. ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985.

224 Sloan C.A. ENCODE data at the ENCODE portal. Nucleic Acids Res. 2016;44:D726–D732.

225 Regev A. The Human Cell Atlas. elife. 2017;6:.

226 Rozenblatt-Rosen O., Stubbington M.J.T., Regev A., Teichmann S.A. The Human Cell Atlas: from vision to reality. Nature. 2017;550:451–453.

227 Mele M. Human genomics. The human transcriptome across tissues and individuals. Science (80-). 2015;348:660–665.

228 Sestan E. Revealing the brain’s molecular architecture. Science (80). 2018;362:1262–1263.

229 Beecham G.W. The Alzheimer’s Disease Sequencing Project: Study design and sample selection. Neurol Genet. 2017;3:e194.

230 Lambert J.C. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat Genet. 2013;45:1452–1458.

231 Pereanu W. AutDB: a platform to decode the genetic architecture of autism. Nucleic Acids Res. 2018;46:D1049–D1054.

232 Hall D., Huerta M.F., McAuliffe M.J., Farber G.K. Sharing heterogeneous data: the national database for autism research. Neuroinformatics. 2012;10:331–339.

233 Forbes S.A. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45:D777–D783.

234 van der Meer D. Cell Model Passports—a hub for clinical, genetic and functional datasets of preclinical cancer models. Nucleic Acids Res. 2019;47:D923–D929.

235 Gao J. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6: pl1.

236 Liu J. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;173: 400–416.e11.

237 Liberzon A. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740.

238 Chen E.Y. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 2013;14:.

239 Warde-Farley D. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38:W214–W220.

240 Duan Q. L1000CDS(2): LINCS L1000 characteristic direction signatures search engine. NPJ Syst Biol Appl. 2016;2:.

241 Lachmann A. Geneshot: search engine for ranking genes from arbitrary text queries. Nucleic Acids Res. 2019;47:W571–W577.

242 Jerby-Arnon L. Predicting cancer-specific vulnerability via data-driven detection of synthetic lethality. Cell. 2014;158:1199–1209.

243 Behan F.M. Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature. 2019;568:511–516.

244 Sun R. A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study. Lancet Oncol. 2018;19:1180–1191.

245 Hoshida Y. Prognostic gene expression signature for patients with hepatitis C-related early-stage cirrhosis. Gastroenterology. 2013;144:1024–1030.

246 Allix-Beguec C. Prediction of susceptibility to first-line tuberculosis drugs by DNA sequencing. N Engl J Med. 2018;379:1403–1415.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.131