1

Introduction

Abstract:

Recent progress in omics technologies, including genomics, transcriptomics and proteomics, offers an unprecedented opportunity to measure molecular activity which underlies biomedical effects. Challenges remain to transform these measurements into solid knowledge about the human body in both health and disease. Bioinformatics is the science and practice of utilizing computer technology to handle the data used to address complex biomedical problems. The reconstruction of complex biomedical systems can be seen as an inverse problem, the solution of which usually relies on assumptions derived from prior knowledge. Thus, a two-way approach of biomedical investigations from top-down, and also from bottom-up, in an adaptive fashion is briefly introduced here and further elaborated on in subsequent chapters. Practical tools, which are useful for intermediate steps during top-down and bottom-up investigations, will also be introduced.

Key words

adaptive

inverse problem

ill posed problem

omics

systems level analysis

1.1 Complex systems: From uncertainty to predictability

Driven by curiosity, a person starts to explore their environment by using their eyes, ears and touch, immediately after birth. At this stage, these sensors are still not fully developed. Seemingly chaotic and heavy data are captured by the senses, which need to be interpreted and understood. As a person develops and gains experience, they also learn to formulate postulation and empirical rules, and adapt and refine such rules to fit newly encountered empirical data. This is a natural adaptive and iterative process. Hypotheses which have been validated, and widely accepted by the community, then become knowledge. This new knowledge in turn will provide new angles on observations, and trigger new hypotheses for future investigation. Scientific knowledge was thus accumulated iteratively over time, exerting its tremendous impact on the course of human history.

The subject that triggers the most curiosity is about ourselves. How can we maintain and restore our health, both physically and mentally? After centuries of quests, many facets of the human body still remain elusive. Established knowledge has been obtained from well designed, hypothesis testing experiments, the cornerstones of biomedical investigations. For example, we may postulate that a certain serum protein is associated with the onset of a particular disease. This association can then be detected by experiments carefully designed to compare two specific clinical conditions (diseased vs. non-diseased) in a controlled setting (e.g. the same age group).

Obviously, a high-quality hypothesis prior to an experiment is a necessary condition for a giant leap of biomedical knowledge. Such a high-quality hypothesis often comes from novel insights derived from earlier empirical observations. But how can we achieve novel insights in the first place?

1.1.1 Theoretical issue: III-posed inverse problems

Challenges during the course of biomedical investigations, either biological or technical, can often be seen as inverse problems. Examples include:

image the identification of driver genes of cancer progression;

image the assembly of genomic sequences from DNA readings;

image the reconstruction of peptide sequences from tandem MS spectra;

image the detection of homologous proteins based on amino acid sequences.

An inverse problem is the opposite of a direct problem, which is to generate observable data from underlying rules. An inverse problem tries to reconstruct the underlying rules from the data. Thus, such a problem is generally more difficult to solve than the corresponding direct problem. Usually, assumptions based on prior domain knowledge will be helpful in solving an inverse problem. As will be seen in subsequent chapters, a conceptual framework of domain knowledge is essential to guide the analysis of omics data and unveil biological mechanisms behind human health and disease.

1.1.2 Theoretical issue: Essence of biology

Biological systems have three general organizational characteristics (Hartwell et al., 1999):

image modularity;

image repeated use of the same molecular machinery;

image robustness.

Biological activities work by molecular modules, that is, some modules are activated while others are de-activated, in homeostasis or in response to outside stimuli. This modularity is the theoretical basis for module level analysis.

Biological systems were evolved by the de novo invention of new machinery and the repeated use of existing molecular machinery, either with complete duplication or with slight adaptation, to provide a redundant (duplicated) mechanism or to fulfil new functions. This is the theoretical basis of a variety of homological searches on genomic sequences and protein structures.

Similar biological functions may be performed by multiple molecular modules. The redundant design is a way to provide biological robustness (and stability) to the entire system.

1.1.3 Theoretical issue: Adaptive learning from biomedical data

The evolution point of view offers not only a powerful tool to conceptualize the world, but also to derive mathematical models of biological systems. Instead of scanning through an infinite search space of ill-posed inverse problems, an adaptive model may be used to generate sub-optimal solutions at greater speeds, then partly refined to better fit the data, in the same way that biological systems repetitively use existing machinery for new functions.

1.2 Harnessing omics technology

Life is complex in terms of rich and intensive information. Genetic messages not only instruct the function of cells, tissues, organs and the whole body, but also can be passed down from cell to cell and from generation to generation. Modern biomedical studies have been empowered by an arsenal of information-rich omics technology and platforms, particularly in the exploratory phases of an investigation. These technologies include, for example, next-generation sequencing technology, oligonucleotide microarray, mass spectrometry and time lapse microscopy. These technologies offer deep assays of multiple genes and proteins of clinical samples with constantly improving precision and detection limits. These technologies are also useful for the exploration of molecular mechanisms behind complex human physiology and pathology. They can lead toward the discovery of biomarkers and biosignatures for the prediction of disease onset or treatment response, a central theme of personalized medicine.

One important feature of omics technology is panoramic exploration. The panoramic exploration of biology at the DNA, RNA and protein levels is called genomics, transcriptomics and proteomics, respectively. This new feature enables new ways of study design such as genome-wide association studies (WTCCC, 2007) and genome-wide expression profiling (Eisen et al., 1998).

The high throughput nature of omics instruments inevitably creates a large amount of raw data in a short time frame. This creates great challenges in data analysis and interpretation. Levels of data complexity escalate as we ascend the ladder of central dogma of molecular biology from DNA, RNA and proteins to protein interaction networks. Mathematically, the number of interactions is factorial to the number of basic elements. For example, the pairwise interaction of 21,000 genes is C(21000,2), which is at the level of 108, not to mention the other interactions of more than two genes. Such large amounts of information require an innovative management of data storage, organization and interpretation. However, an adaptive strategy may be required to explore the large search space to reveal biological secrets. Informatics methodologies are thus in great demand to unleash the full power of these instruments for basic biomedical science and clinical applications alike.

1.2.1 Theoretical issues: Non-lineality and emergence

Evidence shows that humans are more familiar with the linear way of thinking of cause and effect. Unfortunately, linear rules are difficult to characterize non-linear effects. Non-linear thinking is required to comprehend biological systems which are mostly non-linear. The current design of the human body has been polished by past evolutionary events to adapt to the environment. This adaptation is subject to non-1 inear interaction with multiple species and environmental factors. All biomedical phenomena seem uncertain at first sight, and empirical data are critical to sustain all potential theories.

The non-l inearity of data particularly requires that we let the data speak for itself and let the rules emerge. Instead of imposing strong subjective ideas, we should sit back and listen carefully to what the data has to say. We can rely on the help of the computer by the use of adaptive models, which evolve and “fit” to the data.

1.3 Bioinformatics: From theory to practice

Bioinformatics is an evolving and exciting multidisciplinary area featuring integration, analysis and presentation of a complex dimension of omics data. The goal is to decipher complex biological systems. Computational technologies are crucial for integrating the heterogeneous bits and pieces into an integral whole. Otherwise, even very important data may seem miscellaneous and insignificant. It is also crucial to transform the data into a more accessible and comprehensible format. It is a sophisticated process with multiple levels of analysis (the platform level, the contrast level and the systems level) to reveal precious insights from the vast amounts of data generated by advanced technology.

Bioinformatics tools must be constantly updated alongside the advancement of biomedical science, as old problems are solved and new challenges continue to appear. The bioinformatics tools for cutting edge research have most likely not yet been developed. A good appreciation of the essence of bioinformatics is required so as to develop adequate tools. That is why in this book we maintain 20% of the content to theory and 80% to the practical issues. Fortunately, tools do not need to be constructed from scratch; many can be established by the augmentation of previous ones.

1.3.1 Multiple level data analysis

This book addresses the essence and technical details of bioinformatics in five consecutive chapters. Chapters 2, 3 and 4 cover genomics, transcriptomics and proteomics, respectively, the three common "omics". These chapters will introduce types of study design, major instruments available, and practical skills to condense down a deluge of data to address specific biomedical questions. Specific bioinformatics tasks are categorized into four distinct levels: the platform level, the contrast level, the module level and the systems level.

Platform level analysis is usually vendor specific, due to the variety of data formats and numerical distributions generated by various platforms. Adequate software and corresponding technical manuals have often been provided by the platform vendors for this level of analysis. In addition, a great deal of open-source software is also available with extensive testing and solid performance benchmarks by pioneering genomic centers. We will briefly discuss and provide literature references for this valuable software.

Contrast level analysis aims to detect resemblances or differences in genomic or expression patterns across different physical traits or clinical conditions. The major goal is to identify genes or biological entities, amongst all the genes in the organism, which play key roles in the biological effects under study.

Module level analysis handles groups of genes rather than individual genes, in light of a perception that biological systems operate in modules. Modules may manifest as co-expressed genes or interacting proteins. Different molecular modules are activated whilst others are deactivated in response to internal changes or external stimuli. The activated proteins may interact with each other to fulfill their functions.

Finally, systems level analysis features a joint extensive analysis in two or more types of assays (e.g. DNA + RNA), which can render deeper cause-and-effect insights into biomedical systems.

Chapters 5 and 6 cover systems biomedical science (SBMS) and its clinical applications. SBMS is a further extension and abstraction of the systems level analysis covered in the previous chapters, aiming to adequately integrate heterogeneous data to draw novel conclusions from a holistic point of view. The major emphasis is on an adaptive top-down and bottom-up model for the exploration of uncertain space and for capturing the implicit information from the data. Chapter 6 addresses the challenges to transform knowledge from exploratory investigations into clinical use. Case studies are provided in all main chapters, demonstrating how the aforementioned techniques were used flexibly in an integral fashion to unveil novel insights.

Although we categorize the content into genomics, transcriptomics and proteomics, many contrasting levels and module level analysis can be shared across platforms, with slight adaptations, as long as the underlying assumptions (i.e. data normality) are met.

1.4 Take home messages

image General or specific bioinformatics tasks are often ill-posed inverse problems, requiring prior knowledge or conceptual frameworks to solve them.

image Adaptive models are important for finding solutions in an infinite search space.

1.5 References

Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998; 95:14863–14868.

Hartwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W. From molecular to modular cell biology. Nature. 1999; 402:C47–C52.

WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature. 2007; 447:661–678.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.235.23