Introduction

Mark Forster and Lee Harland

Summer 2012

Sharing the source code of software used in life science research has a history spanning many decades. An early example of this being the Quantum Chemistry Program Exchange (QCPE), founded by Richard Counts at Indiana University in 1962, with the aim of distributing the codes then available in the domain of quantum chemistry. It also came to cover other areas of chemical science such as kinetics and NMR spectroscopy. Even in the 1980s, code distribution over networks was not common, so that programs were typically distributed on cartridge tapes or other media usable by the workstations of the era. Although some codes may have been available only in binary form, it was normal to make Fortran or C source code available for compilation on the receiving platform.

For code to be ‘open source’, the license that covers the code must allow users to study, modify and redistribute new versions of the code. This is often referred to as libre, or ‘free as in speech’. It is often the case that the source code is available at zero cost, which may be referred to as gratis, or ‘free as in beer’. In English, we often use the single word, free, to cover both of these distinct types of freedoms. For a more detailed explanation see the Wikipedia entry ‘Gratis versus libre’. Software that is accompanied by these freedoms may be described as Free/Libre Open Source Software, with the acronyms FOSS or FLOSS. A fuller description of the philosophy and aims of the free software movement can be found at the website of the Free Software Foundation (http://www.fsf.org). an organisation founded in 1985 by Richard M. Stallman, which aims to protect and promote the rights of computer users with regard to software tools. The open source philosophy fits well with the sharing of data and methods implicit in scientific research, the floss4science website (http://www.floss4science.com/) supports an effort to apply and utilise the open source tools and principles in scientific computing.

The best known and widely used example of open source software is perhaps the Linux Operating System (OS). This became available in the early 1990s with Linus Torvald’s creation of the Linux kernel for 386 based PCs, combined with the GNU tool chain (C compiler, linker, etc.) produced by the free software foundation. Today, this software stack, as well as relatives and derivatives of it, are very pervasive. They exist in devices ranging from set top boxes, smart-phones and embedded devices, through to the majority of the world’s largest supercomputers. Open source is ‘powering’ not only life science research but also the digital lives of countless millions of people.

It is beyond the scope of this editorial to cover the landscape of open source licenses in great detail, the open source initiative website provides definitions and a listing of approved licenses (http://opensource.org/). It is perhaps important to understand the permissive and restrictive nature of some open source licenses, as well as the idea of copyleft as opposed to copyright. Copyright covers the rights of the author over his/her work and is a long-established and well-understood legal concept. Copyleft seeks to establish a license that maintains the freedoms given to a work by the original author in any derivative, such as new versions of computer source code. The licenses used for open source may be restrictive in that they require the source code of derivative works also be made available, an example of this being the GNU public license v2 (GPLv2). This seeks to ensure code made available as open source cannot later be closed and made proprietary. Other license types may be very permissive in not requiring the source code for derivative works, one example of this being the BSD license. For further details on these licenses and others, see the web page: http://opensource.org/licenses/alphabetical. Whatever license is used, the intent of creators of open source is to freely share their work with a wider audience, promoting innovation and collaboration.

A key aspect of scientific research is that experimental methods, analysis techniques and results should be both open to scrutiny and reproducible. The issue of reproducibility in computational science has recently been discussed (Peng, Science 2011;334:1226–7; Ince et al., Nature 2012;482:485–8). Open source tools and resources support this need and ethos in a very direct way, as the code is manifestly open for inspection and recompilation by those with the technical background and interest to do so. Consequently, this directly supports the reproducible nature of computational analysis of data. The free availability of code that can be shared among researchers without restrictive licenses allows analysis methods to be more widely distributed. It is also the case that many challenges facing life science researchers are common, be they in the academic, governmental or industrial domains, or indeed working in the pharmaceutical, biotechnology, plant science/agrochemical or other life science sectors. There is a growing awareness (Barnes et al., Nature Reviews Drug Discovery 2009;8:701–8) that pre-competitive collaborations are an effective and perhaps necessary response to the challenges facing the pharmaceutical and other sectors. As research and IT budgets are static or shrinking and data volumes expand greatly, these new ways of working are central to finding cost-effective solutions. It is clear that open source development performed in a collaborative manner, feeding code back to the open source community, is very consistent with this new paradigm. The benefits of industrial collaboration exist not only in the domain of software tools but also in the data vocabularies that describe much of our industrial research data, a recent publication discusses this further (Harland et al., Drug Discovery Today 2011;16:940–7). Conferences and meetings that help to link the open source and industrial communities will be of great benefit in promoting this environment of pre-competitive collaboration. In 2011 a Wellcome Trust funded meeting on ‘molecular informatics open source software’ was organised by Forster, Steinbeck and Field (http://tinyurl.com/ MIOSS2011WTCC). and drew together many key developers, software experts and industrial and academic users seeking common solutions in the domain of chemical sciences, drug discovery, protein structure and related areas. This meeting created a possibly unique forum for linking the several threads of the open source community, building opportunities for future collaboration. Future meetings on this and other life science research areas would be valuable.

Importantly, not all free software is open source; the authors choosing a model to gain broadest use of the software (by making it free) but retaining control (and perhaps competitive advantage) of the source code. These models often morph into ‘freemium’ models where basic functionality is free, but more specialised or computationally intensive functionalities require payment. However, as eloquently described by Chris Anderson (Free: How today’s smartest businesses profit by giving something for nothing. Random House Business Publishing), this and many other models for ‘free’ have proliferated thanks to the rapid growth of the internet. In addition to splitting free versus paid customers based on functionality needs, a division is often made based on non-profit status. This therefore, offers an additional mechanism for consumers to gain easy access to often quite complex software, which traditionally might have meant significant expense.

A phrase sometimes heard in connection with free and open source is ‘What about support?’. Although many open source codes can be obtained at low or no cost, albeit with no obligation of support from the developers, there is often a thriving community of developers or other users that can answer questions of varying complexity through web based forums. Such support may sometimes be more responsive and insightful than the help mechanisms of purely commercial and closed source offerings. In addition, new business models have emerged, offering professional paid support for code that is completely or fully open source. Perhaps the best known example is that of Red Hat, offering support and update packages to customers for a fee. This business model has grown to such an extent that in 2012, Red Hat has become the first open source-focused company with one billion dollars of revenue. These ‘professional open source’ business models are increasingly used within the life science domain, offering researchers the ‘best of both worlds’.

This book seeks to describe many examples where free and open source software tools and other resources can or have been applied to practical problems in applied life science research. In researching the book we wanted to identify real-life case studies that business and technology challenged faced by industrial research today. As we received the first drafts from the chapter authors, it became clear that the book was also providing a ‘window’ on the software and decision-making practices within this sector. As is often the case with large commercial organisations, much of the innovation and creativity within is never really observed outside of the company (fire)walls. We hope that the book helps to lift this veil somewhat, allowing those on the ‘outside’ to understand the strategies companies have taken, and the reasons why. Many examples demonstrate successful deployment of open source solutions in an ‘enterprise’, organisation-wide context, challenging perceptions to the contrary. These successes highlight the confidence in the capabilities of the software stack and internal/external support mechanisms for potential problem resolution. A further conscious decision was to mix contributions by tool consumers with chapters by those that produce these tools. We felt it was very important to show both sides of the story, and understand the rationale behind the decisions by developers to ‘give away’ the systems developed through their hard work.

Our coverage starts with use-cases in the areas of laboratory data management, including chemical informatics and beyond. The first chapter illustrates how free and open source software has become woven into the fabric of industrial research. Claus Stie Kallesøe provides a detailed study of the development rationale and benefits of an end to end enterprise research data system in use within a major pharmaceutical company and based almost entirely on open source software technology. His approach and rationale makes a compelling case around the value of free and open source in today’s industrial research environment. The field of chemical toxicology is important to researchers in any company with a focus on small molecule or other bioactive molecule discovery. A powerful and flexible infrastructure for interactive predictive chemical toxicology has been developed by Egon Willighagen and other authors. Their chapter explains how the Bioclipse client and OpenTox infrastructure combine to meet the needs of scientists in this area. The creation of flexible and efficient tools for processing and dissemination of chemical information are covered by Aileen Day and co-authors, who have pioneered new methods in the field of academic publishing and freely share this knowledge through their chapter. Open source tools also find application in the fields of agrochemical discovery and plant breeding. Mark Earll gives us insight into the use of open source tools for mass spectroscopy data processing with specific application to metabolite identification. Key application examples and methodology are described by Rob Lind who covers the use of the powerful ImageJ free software for image processing. Some software tools have very wide applicability and can cover a range of application areas. Thorsten Meinl, Michael Berthold and Bernd Jagla explain the use of the KNIME workflow toolkit in chemical information processing as well as next generation sequence analysis. Finally in this area, Susanna Sansone, Philip Rocca-Serra and co-authors describe the ISA (Investigation-Study-Assay) tools, these allow flexible and facile capture of metadata for diverse experimental data sets, supporting the standardisation and sharing of information among researchers.

The book then covers case studies in the genomic data analysis and bioinformatics area, often part of the early stages of a life science discovery pipeline. The field of computational genomics has been well served by the release of the ‘GenomicTools’ toolkit covered in the chapter by Aristotelis Tsirigos and co-authors. The multifaceted nature of ‘omics data in life science research demands a flexible and efficient data repository and portal type interface suitable for a range of end-users. Ketan Patel, Misha Kapushesky and David Dean share their experiences and knowledge by describing the creation of the ‘Gene Expression Atlas’. Industrial uses are also discussed. Jolyon Holdstock shares the experience of creating an ‘omics data repository specifically aimed at meeting the needs of a small biotechnology organisation. A similar small organisation experience is shared by Dan MacLean and Michael Burrell who show how a core bioinformatics platform can be created and supported through the use of FLOSS. Although the institution in question is within academia, it is still an ‘applied’ research facility and will be of great interest to those charged with building similar support functions in small biotechnology settings.

There are numerous mature and capable open source projects in the domain of information and knowledge management, including some interesting, industry-centric takes on the social networking phenomenon. Craig Bruce and Martin Harrison give insight into the internal development of ‘Design Tracker’, a highly capable support tool for research hypothesis tracking and project team working. Next, Ben Gardner and Simon Revell describe a particularly successful story regarding the use of wikis and social media tools in creating collaboration environments in a multinational industrial research context. In the following chapter by Nick Brown and Ed Holbrook, applications for searching and visualising an array of different pharmaceutical research data are described. The contribution of the open source text-indexing system, Lucene, is central to many of the system’s capabilities. Finally, document management, publication and linking to underlying research data is an important area of life science research. As described in the chapter by Steve Pettifer, Terri Attwood, James Marsh and Dave Thorne, the Utopia Documents application provides a ‘next-generation’ of tool for accessing the scientific literature. Not only does this chapter present some exciting technology, it also describes an interesting business model around the support of ‘free’ software for commercial customers.

Our next section concerns the use of semantic technologies, which have grown in popularity in recent years and are seen as critical to success in the ‘big data’ era. To start, Laurent Alquier explains how the knowledge-sharing capabilities of a wiki can be combined with the structure of semantic data models. He then describes how the Semantic MediaWiki software has been used in the creation of an enterprise-wide research encyclopaedia. Lee Harland and co-authors follow on the Semantic MediaWiki theme, detailing the creation and capabilities of intelligent systems to manage drug-target and disease knowledge within a large pharmaceutical environment. Although the MediaWiki-based software demonstrate some very useful semantic capabilities, the more fundamental technologies of RDF and triple-stores have the potential to transform the use of biological and chemical data. To illustrate exactly this point, David Wild covers the development of the Chem2Bio2RDF semantic framework, a system which provides a unique environment for answering drug-discovery questions. This is followed by Ola Bildtsen and colleagues who describe the ‘TripleMap’ tool, that connects semantic web and collaboration technologies in a visual, user-friendly environment.

A critical and highly expensive stage in the pharmaceutical development pipeline is that of clinical trials. These areas are highly regulated and software may need complex validation in order to become accepted. Kirk Elder and Brian Ellenberger describe the use of open source tools for ‘extreme scale’ clinical analytics with a detailed discussion of tools for business intelligence, statistical analysis and efficient large-scale data storage. In a highly regulated environment such as the pharmaceutical domain, software tools must be validated along with other procedures for data generation and processing. David Stokes explains the needs and procedure for achieving validation and regulatory compliance for open source software. The final chapter in the book is written by Simon Thornber and considers the economic factors related to employing FLOSS in developing, deploying and supporting applications within an industrial environment. Simon’s chapter asks the question ‘is free really free?’ and considers the future potential business model for FLOSS in industry. The central theme of ‘hosted open source’ is reminiscent of the successful Red Hat model and argues that this model can drive increasing industry adoption of certain tools while also promoting pre-competitive collaborations.

Our own experiences of using FLOSS in industry show that this paradigm is thriving. The range of applications covers the spectrum of scientific domains and industry sectors, something reflected in the diverse range of chapters and use-cases provided here. Use of open source is likely to grow, as pre-competitive initiatives, including those between industry and public sector are further developed. We consider that industry has much to offer the open source community, perhaps contributing directly to, or leading the development of some toolkits. In some companies the climate of internalisation and secrecy may still prevail, but as knowledge is shared and the benefits of open source participation are made clear one may expect the philosophy to spread and become more embedded. Even without contribution of code or project involvement, simple use of FLOSS may support the movement, by encouraging the use of open data standards and formats in preference to closed binary formats (or ‘blobs’).

What is perhaps most important is that the ready availability of FLOSS tools, and ability to inspect and extend, provide a clear opportunity for life science industry to ‘try things’. This reduces the entry barrier to new ideas and methods and stimulates prototyping and innovation. Industry may not always contribute directly to a particular project (although in many cases it does). However, growing industry awareness of FLOSS, increasing usage and participation bode well for the future, a mutually beneficial interaction that can catalyse and invigorate the life science research sector. We hope that readers of this book get a true sense of this highly active ecosystem, and enjoy the ability to ‘peek inside’ the operations of industry and other groups. We are eternally grateful to all of the authors for the hard work putting each story together. We also thank Enoch Huang for providing the foreword to this work, providing an important perspective as both a consumer and producer of FLOSS tools. Finally, we are grateful to the publishers and all who have helped us bring this unique volume to a wide audience.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.228.246