Chapter 6

International Collaboration and the Changing Digital World – Opportunities and Constraints

Matthew Dovey,    Jisc

The remarkable growth in the power of IT and computer systems over the last few decades has had significant impact on the types of research problems that can be addressed, and on how research can be undertaken in distributed and cross-disciplinary teams. These new types and methods of research are supported by various technology e-infrastructures (network, computational and data) and online research environments. This chapter discusses these new technologies and the changes, potential and challenges, which these provide to the research community and practice.

Keywords

International collaboration; data; digital world; modelling; simulation; virtual research environments; video conferencing; high performance computing; high throughput computing; grid; cloud; e-infrastructures; open science; open data; virtual reality

6.1 Introduction

The remarkable growth in the power of IT and computer systems over the last few decades has had significant impact on both the way research can be undertaken and the nature of the research problems which can be addressed. In the book ‘The Fourth Paradigm: Data-Intensive Scientific Discovery’ (Hey, Tansley, & Tolle (2009)), four paradigms shifts in the scientific research process are outlined: experimental, theoretical, computational and data-driven. The latter two have been direct results of the explosion in computational power and digital storage.

Growth in computing power has permitted more complex, sophisticated and accurate mathematical modelling and simulations: complex mechanical, physical and electronic designs can be built and tested long before going anywhere near a fabrication plant; chemical and medical experiments can be conducted ‘in silico’ rather than in ‘wet lab.s’ real-world events such as climate or traffic flows can be modelled and predicted with increasing accuracy.

Data can be generated from an increasing range of devices and simulations, but now can also be collected from an increasing range of devices (such as sensor networks and devices, crowd-sourcing, social networks). Moreover, data can be collected within increasing accuracy and increasing granularity. Advances in computational power, network speeds and new software algorithms, enables data mining to discover new correlations within this data and to identify possible relationships between data sets which could not previously have been compared.

It is evermore common that extreme specialism is need in order to push the envelope of our current knowledge in specific disciplines. Research is typically undertaken by cross-disciplinary teams of specialists in both academics and technologists. New communication technologies allow these teams to work together even though they may be as dispersed geographically as in their specialist areas.

6.2 Research E-infrastructure

In may come as no surprise that this potential comes with significant challenges. Whilst science and innovation are recognised as ‘key drivers of economic growth and the right science an innovation policy is essential to strengthen and rebalance the economy for the future’ (HM Treasury (2013)) and essential to addressing major global challenges facing society in terms of health, energy, food, climate and ecology, (Societal Challenges, n.d.) they are not immune to the efficiency demands all industries are facing. Pressures to be both financially efficient and carbon efficient requires expensive resources, such as large-scale high-performance computing (HPC) facilities, to be shared across communities. It is not practical or economic to duplicate large-scale scientific facilities such as the Square Kilometre Array1 and the Large Hadron Collider,2 or to recreate and recollect large-scale scientific data sets such as those held by the European Bioinformatics Institute,3 so these need to be accessible and usable by geographically disparate teams. These disciplinary and geographically diverse teams need access to the same resources in order to work effectively, regardless of where they are located and regardless of which organisations (both academic and commercial) they work for. A common IT infrastructure or ‘e-infrastructure’ is needed to underpin this and these have emerged at both the national and the international level.

A vision of distributed computing which is at least as old as the Internet is one where the network itself is the computer – you can log onto the network with any device and have automatic access to computational resources and data storage devices anywhere on the network. This was the vision behind grid computing4 and more recently cloud computing.5 It is this vision which drives e-infrastructure. Broadly speaking, e-infrastructure can be broken into a number of key components: software; computation resources to run the software; storage for data which is both the input and the output of the software; the underlying network and communications to allow access to these resources; access and identity management to allow control over who has access to what; and the skills needed to use these flexible distributed ‘computers’.

6.3 Network E-infrastructure

Within Europe, at the network level, GÉANT6 connects over 50 million users at 10,000 institutions across Europe, at speeds of up to 500 Gbps. Whilst in the UK, Janet http://www.jisc.ac.uk/janet ensures researchers are linked both nationally and internationally at speeds over 100 Gbps. Janet Reach7 allows academic/industrial partnerships access to these high-bandwidth research networks, whilst Aurora8 allows the network research community to test and develop the next generation of network technologies and services. Even today, it is possible to layer dedicated point-to-point networks across the network infrastructure providing dedicated and secure high-speed links with high quality of service and low latencies. Looking forward, ‘Software Defined Networks’ (Software Defined Networking, n.d.) offer a future of ‘virtual networks’ where you can configure what appears and behaves as dedicated fibre cables between the distributed storage and compute resources, even though this is delivered via the existing networks. In addition, GÉANT is building upon its existing access and identity management services such as eduroam,9 facilitating access to wireless networks in campuses around the world and eduGAIN,10 providing interoperation between national digital identity federations to support the necessary e-infrastructure to ensure secure and trusted access to research facilities and resources (Hudson (2014)).

6.4 Computational E-infrastructure

In terms of computational e-infrastructures, these can typically be divided into HPC and high-throughput computing (HTC). HTC lends itself well to situations when a problem can be divided into smaller independent tasks, for example, if you can split a data set into small subsets and run the same algorithm on each subset; HPC is required when although you can divide a problem into tasks which can run in parallel, these tasks need to communicate between themselves. HTC can be easily provisioned using cloud computing whilst HPC requires dedicated specialised hardware. Within the UK, there are a range of HPC resource such as the Hartree Centre at Science and Technologies Facilities Council11; ARCHER which provides access to a 1.56 Petaflop Cray XC30 supercomputer12; and five regional HPC centres (e-infrastructure South Innovation Centre, N8 Research Partnership, ARCHIE West, HPC Midlands and Mid-Plus). As well as traditional HPC architectures these include new architectures such as graphic processing unit (GPU) clusters which can be applied to specialised analysis. PRACE13 provides an European platform for sharing such national HPC provision internationally. As well as access to HPC resources, PRACE offers training programmes, code porting and optimization support, and services to enable the adoption of HPC by industry, including SMEs.

For HTC applications, EGI14 provides a global high-throughput data analysis infrastructure, linking hundreds of independent research institutes, universities and organisations delivering computing resources and high scalability, whilst Helix-Nebula15 allows innovative cloud service companies to work with major IT companies and public research organisations. Meanwhile, GÉANT engages with national cloud brokerages between National Research and Education Networks (NRENs) and commercial providers (such as Janet Cloud Brokerage16) to establish an efficient and coordinated pan-European approach to procuring cloud services by building on existing experience and supplier relationships.

6.5 Data E-infrastructure

In terms of data e-infrastructures in addition to large- and small-scale data centres, and traditional and cloud-based data storage, there are initiatives to help discover, access and share data. EUDAT17 is a pan-European research data infrastructure addressing the full life cycle of research data: access and deposit, informal data sharing, long-term archiving, identification, discoverability and computability of both long tail and big data. OpenAIRE18 is extending its support and guidance for researchers at the national, institutional and local level wishing to publish in OA repositories, to include support for publishing open data sets.

All these layers need to work together in order to become ‘ a collaborative data and knowledge infrastructure, leveraging on international, national, regional and institutional initiatives… an entire digital environment or network, spanning countries and scientific disciplines… an evolving framework – a digital ecosystem – to find the most efficient ways to interconnect other systems: arrays of scientific instruments, specialised data centres, digital libraries, technical software, data services, high-speed networks’ (Open Infrastructures for Open Science (2012)). Even if this vision of open research e-infrastructures is achieved, it will be require a radical opening up of every step of the research process – this evolution in the modus operandi of doing research is often termed ‘Open Science’ or ‘Science 2.0’ (Science 2.0: Science in Transition (2014)).

Fundamentally, the motivation is to make the entire research process more transparent, easier to share and easier to reuse. The OA initiative has made the results and conclusions of research more readily accessible, whilst open data is making significant inroads in making the data which underpins that those conclusions and results. However, the results and the underlying data are just two components of the research process: the methodology itself – what you actually did to get the results – should also be open and shared. For computational research, that methodology is captured in the software or algorithms which process the data, hence the claim made by Edward Seidel, former director of the US National Science Foundation’s Office of Cyberinfrastructure, that ‘software is the modern language of science these days’ (Zverina (2011)). As well as repositories for data sets such as FigShare,19 there are emerging repositories for both data and code such as Zenodo.20 Borrowing from the concept of data journals and data citation service such as DataCite,21 there are also emergent software journals such as the Journal of Open Research Software22 and SoftwareX,23 so that both data and software can be cited by primary research outputs.

For physical research (such as chemistry or biological experiments), turning the physical lab notebook into an electronic or digital notebook enables sharing and faster reproducibility of results. Radio frequency ID tags in apparatus, chemical and samples and advanced capturing technologies via video analysis (such as object recognition and motion recognition) can automate much of the process of capturing what happened in an experiment. 3D scanning and 3D printing offers the potential of even being able to share physical apparatus in a digital form and recreate the apparatus of experiments. Research Objects24 try to pull together all of these components (results, process, software, data, workflow, notebook, methodology) so that they all become first-class citizens of scholarly discourse and collaboration. Initiatives such as Force 1125 are pushing the technology to provide more innovative ways of sharing scientific information.

This ability to digitally capture and recreate experiments is not without risks to scientific verification. Flawed conclusions due to flaws in the apparatus, the software or the data may not be readily detected if those reproducing the experiment or building upon the results of those experiments, were to also duplicate those flaws. Although technology may make the often arduous task of translating the experiment into natural language and subsequent researchers having to translate the description back into the apparatus and code, this may still have a place in checking the validity of the scientific method.

6.6 Virtual Research Environments

Above these technology layers, ‘virtual research environments’ (VREs) provide the ‘user interface’ by which researchers interact with this technology: providing them with the ability to access computational and data resources across the network; providing them with tools for running software, managing simulations, analysing data and visualising the results; providing them with collaborative environments for working with other researchers and sharing results, data, code and indeed ‘Research Objects’. VREs alongside e-infrastructure and Open Science realise some of the original vision of Wulf’s Collaboratory as a ‘center without walls, in which the nation’s researchers can perform their research without regard to physical location, interacting with colleagues, accessing instrumentation, sharing data and computational resources, [and] accessing information in digital libraries’ (Wulf (1989)).

On one level, a VRE is a collection of tools tailored for that research community so that they can concentrate on their research rather than on understanding the complexity of the underlying technology – some may be very focused to a particular research areas such as the Virtual Physiological Human Portal,26 whilst other such as Extreme Science and Engineering Discovery Environment (XSEDE)27 offer a generic toolkit of shared code, applications, workflows and resources. Many VREs offer visual workflow tools, so that researchers can easily construct data analysis processes via drag and drop interfaces. However, understanding of the underlying technologies is still required to fully optimise the use of e-infrastructure. Determining if a particular problem is more suitable to HTC or HPC architectures can be difficult task manually let alone to automate. The ideal of a research being able to take their desktop environment as a virtual machine and run it unaltered on the cloud to achieve performance magnitudes higher is still some way off.

VREs also provide interfaces to other researchers and research collaborations. As well as sharing resources, data, workflows and software, VREs allow researchers to collaborate in the creation of resources, by borrowing concepts and tools (or even integrating with) social media platforms. MyExperiment,28 for example, offers an environment similar to MySpace or Facebook for researchers to publish their workflows and in silico experiments; comment on, reuse and augment existing workflows; and to track the provenance of how a workflow has been modified by others. Researchers are not only using social networks to collaborate during their research, they are also using social networks to perform the research – recognising that people are as an important processing engine as high powered computers with many obvious cognitive advantages. Large-scale coordination of large groups to tackle a specific problem via crowdsourcing may overcome some of the inherent weaknesses and provide much higher performances than a single individual. The Polymath Blog29 shows this at the simplest level where mathematical problems are posted and solved by both professional and amateur mathematicians collaborating worldwide. At a more sophisticated level, Galaxy Zoo,30 a project to mobilise the strength of citizen science to classify galaxies, has developed a generic toolkit, Zooniverse,31 which is now used in climate science, digital humanities, biology and physics, so that crowdsourcing can become just another computational resource alongside HPC and HTC.

There are obvious risks in citizen science: it is vulnerable to malicious interference, and it is prone to feedback loops, where people can assert ‘facts’ purely on the basis of existing consensus which just increases the apparent consensus and reduces the inclination to challenge the consensus. The wisdom of crowds can very easily become the foolishness of crowds as demonstrated by the wealth of urban myths which propagate across social networks. It is not surprising that research into the limitations, strengths and behaviour of such social machines (Smart, Simperl & Shadbolt (2014)) is a branch of investigation and research in its own right.

6.7 Virtual communication

VREs such as the above primarily support asynchronous communication between teams and collaborations – a correspondence model where a response is not immediately expected or required. However, synchronous communications are also important, particularly as research teams grow larger and more multidisciplinary. Physical meetings can be time consuming and expensive when teams are geographically distributed. Technology provides tantalising glimpses of how such meetings could be conducted virtually. High-definition and super-high-definition video provides a more rewarding experience, almost equivalent to meeting in person, when compared to the blurred postage stamp picture from the early days of video conferencing. Improved bandwidth and low latency networks permit video and audio to be transmitted almost instantaneously.

In practice, videoconferencing is still beset by problems in terms of audio and video quality and failure to connect. Some of these problems are self-inflicted. Videoconferencing can work well when accessed from dedicated rooms with dedicated networks and well-positioned cameras and microphones. In a world where the Internet is ubiquitous and you can control both your home appliances and your large-scale computation simulations from a smartphone, many expect to be able to videoconference on the move and in busy locations from mobile devices and this can result in less than ideal video conference experiences.

The biggest challenge for large-scale collaborations is the handling of large number of participants in video conferences - unless the technology is bulletproof, the chances of at least one person having trouble increase exponentially for each additional participant. Academic videoconference support services such as V-Scene32 in the UK, and eduCONF33 at the international level, strive to bridge between aspiration and reality. Meanwhile is streaming ahead: motion tracking and augmented reality ‘holographic’ headsets such as Meta SpaceGlasses34 or the very recently announced Microsoft HoloLens35 bring the science fiction vision of a conference table attended by virtual holograms of the participants even closer to science fact.

6.8 Conclusion

Whilst we see increased collaboration enabled by technology, technology also raises barriers. One such barrier is the ability of researchers to use these technologies effectively, either because they lack knowledge or skills to adapt in an ever changing digital world, or they lack access to the specialised technical support required. Another barrier is the lack of interoperability of the various emerging and evolving technologies and e-infrastructures. We are seeing moves both within European through Horizon 2020 and globally though initiatives such as Open Science Commons36 and the Research Data Alliance37 to address these barriers, but there is still much to be done in terms of both technology and the capability to exploit that technology if we are bridge the gap between the visions that the technological advances tempt us with and the reality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.228.88