Roy H. Campbell
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
The adoption of cloud computing by the U.S. government, including the Department of Defense, is proceeding quickly [1–4] and is likely to become widespread [5]. As government becomes more comfortable with the technology, mission-oriented cloud computing seems inevitable. However, security remains a top concern in the use of clouds for dependable and trustworthy computing [6], even as FedRAMP [7] and other standards converge to a common set of requirements, as discussed in Chapter 8. The cloud computing environment is maturing, but we are observing the rise of new aspects of cloud computing – such as mobiles interconnected into clouds, real-time concerns, edge computing, and machine learning – that are challenging the existing techniques for testing, validation, verification, robustness, and resistance to attack. As reflected in this book, academia and industry are attempting to respond quickly to rapidly changing cloud technologies, as driven by the value of these technologies in today's society.
The preceding chapters of this book have touched on many of the concerns arising from cloud technology: survivability, risks, benefits, detection, security, scalability, workloads, performance, resource management, validation and verification, theoretical problems, and certification. In this final chapter, we will consider what has been learned since 2007 and what issues and obstacles remain in any mission-critical system deployment on cloud computing.
Cloud computing systems as a cyberinfrastructure supporting mission-oriented tasks must survive long enough for the missions to accomplish their goals. Any number of challenges face cloud computing survivability, including the unpredictability of technological advances. In Chapter 2, we focused on design, formal modeling, and validation, giving cloud storage systems as an example. Without excellent requirements and design, survivability is problematic. (The security and dependability aspects of survivability were addressed in separate chapters.) Key to cloud computing survivability is the reality that both the infrastructure and applications are built on distributed computing. Survivability for the cloud requires correct requirements and design of systems in which the fundamental concerns include parallelism, communications, distributed algorithms, and undecidability.
Maude and its real-time extensions were used formally to specify and analyze the correctness and performance of Apache Cassandra, Megastore, and ZooKeeper (all industrial systems) as well as RAMP (an academic system). The approach was also used to design and formalize significant extensions of these systems (a variant of Cassandra, Megastore-CGC, a key management system on top of ZooKeeper, and variations of RAMP), building confidence that there is an approach to formalize and design clouds with assurance that they satisfy desired correctness properties and performance. Furthermore, in the case of Cassandra, we compared the performance estimates provided with the performance actually observed when running the real Cassandra code on representative workloads; they differed by only 10–15%.
The abovementioned findings represent the first published work on the use of formal methods to model and analyze such a wide swathe of industrial cloud storage systems and demonstrate that these distributed, real-time cloud systems are amenable to formal description and analysis. Since many of the faults, failures, and security breaches of systems are caused by human error in requirements and design, the work paves the way for cloud computing assurance by showing that it is feasible to verify and validate cloud computing support for missions. Such formal studies should be made of cloud computing environments and their applications to help assure the missions using them.
Cloud computing infrastructure for mission-oriented tasks allows many new avenues of attack, and some of the opportunities for launching an attack remain obscure until an attack occurs. Cloud service providers may mitigate vulnerabilities and repel attempted attacks, but providing an assured cloud computing environment continues to be a struggle. Cloud users may build applications with the goal of ensuring secure services, while others may have different objectives requiring less security. Risk assessment for an application depends on many factors, like the security of the provider's services, user applications, and the values and costs associated with security breaches. In a multitenancy environment, risk is also associated with issues of externality: Can an attacker find a way to compromise a cloud application by using some less secure component running on that cloud, from which an attack could be more easily launched and would be more likely to succeed against the target cloud application? Clearly, when left to choose a cloud, cloud users will move their applications to clouds that reduce their risk, even while they improve their applications' security provisions. However, a cloud that attracts many applications with high-risk assets, even with improved application security, would become a high-value target for attack. Chapter 3 presented one of the first theoretical ways to use game theory to evaluate risk externality in terms of the benefit versus risk of using a cloud computing environment (the cloud environment, other cloud users, and risks of successful attacks).
As cloud providers add functionality to their services, cloud users can create systems with higher value and benefit, including applications that implement mission-oriented tasks. Understanding how to assess the risk posed by the externalities of the interactions within multitenancy clouds becomes increasingly more critical. Hence, we believe that applying game theory to the problems of risk, attack, and security in cloud systems is an increasingly important concern. Mission-oriented tasks also raise many new game-theoretic problems in clouds, including ones related to geo-distributed clouds, applications that integrate mobile devices with clouds, mobile cloud computing, large-scale sensor systems generating data for clouds, edge computing for clouds, transport and transmission of encrypted data, use of blue (military) or gray (commercial) networks with clouds, active security, and security response systems. Game theory may also be a useful tool in understanding various trade-offs among authorization, authentication, security policies, and insider attacks.
Many of the key topics in security provisions, detection, monitoring, and access control for assured cloud computing are addressed in Chapter 4. Although the problems here are open-ended – because of both the nature of the technologies and the increasing sophistication of attacks – clear themes are emerging in our own research and that of others. While it is not possible to prove that a system is secure, layered security provisions that force attacks to have multiple stages improve the likelihood of success and help guarantee detection, response, and recovery. We discussed examples that highlighted active and passive monitoring as a way to provide situational awareness about a system and users' state and behavior; automated reasoning about system/application state based on observations from monitoring tools; coordination of monitoring and system activities to provide a robust response to accidental failures and malicious attacks; and use of smart access control methods to reduce the attack surface and limit the likelihood of an unauthorized access to the system.
Scalable cloud resources allow more flexible cloud computing. However, as attack and failure modes have increased in their complexity and impact, the effort to protect flexible cloud computing has also increased. In Chapter 4, we explained how virtual machine monitoring plays an essential role in achieving resiliency. Virtual monitoring can be integrated with Trusted Platform Module hardware [8] to build resilient and resistant monitoring solutions. However, existing virtual monitoring systems are not a panacea, as multiple operating system versions and requirements have added complexity.
Recognizing common operating system design patterns is suggested as a way to infer monitoring parameters from a guest operating system. The patterns and parameters might be extracted directly from a cloud user's guest operating system. However, monitoring can create overhead in performance. Further, virtual machine monitoring systems may require setup and configuration as part of the boot process, or modification of guest operating system internals. In our approach, monitoring probes using low-level hardware traps (hprobes) may be inserted dynamically according to perceptions about the dynamic nature of possible attacks. The hprobe framework is characterized by its simplicity, dynamism, and ability to perform application-level monitoring. The prototype for this framework uses hardware-assisted virtualization and satisfies protection requirements presented in the literature. Compared to past work, the simplicity with which the detectors can be implemented and inserted/removed at runtime allows monitoring solutions to be developed quickly.
As a proof of concept, some virtual machine monitors have been built through application of this technique, including a return-to-user attack detector and a process-based keylogger detector. Extending this approach to other behaviors that might be used in attacks will be an interesting direction for future research. Best practices suggest that robust and efficient monitoring and protection require formal and experimental validation. Further research is needed on validation frameworks that integrate the use of tools such as model checkers (for formal analysis and symbolic execution of software) and fault/attack injectors (for experimental assessment).
An additional security concern arises because of multitenancy concerns and the use of hypervisor-supported virtual machines in cloud architectures. Side-channel attacks, in which cache behavior is used to determine information about the nature of the processing within a virtual machine, has been shown to reduce the time needed to deduce encryption keys or identify the processing of algorithms of interest to an attacker, like monitoring.
Chapter 4 discussed hypervisor introspection as a technique to determine the presence of and evade a passive virtual machine introspection monitoring system through a timing side-channel. Through hypervisor introspection, hypervisor activity was shown to be not perfectly isolated from the guest virtual machine. In addition, an example of an insider threat attack model was shown that utilizes hypervisor introspection to hide malicious activity from a realistic, passive virtual machine introspection system. Some inherent weaknesses of passive virtual machine introspection monitoring can be avoided by using active virtual machine introspection monitoring.
A Bayesian network modeling approach was described and used to detect compromised users in a shared computing infrastructure. The approach was validated using real incident data collected over 3 years at the National Center for Supercomputing Applications (NCSA). The results demonstrate that the Bayesian network approach is a valuable strategy for driving the investigative efforts of security personnel. Furthermore, it was able to significantly reduce the number of false positives (by 80%, with respect to the analyzed data). However, the deficiencies of the underlying monitoring tools could affect the effectiveness of this network approach.
Access control plays an essential role in preventing potentially malicious actors from entering a system. RBAC is a popular access scheme but has weaknesses: a pure RBAC system lacks flexibility to adapt efficiently to changing users, objects, and security policies. In particular, it is time-consuming to make and maintain manual user-to-role assignments and role-to-permission assignments in the context of a cloud that might have a large number of users and/or security objects. One solution to this problem is to combine ABAC with RBAC, bringing together the advantages of both models. We developed our model in two levels: aboveground and underground. A simple and standard RBAC model is extended with environment constraints, which retains the simplicity of RBAC and supports straightforward security administration and review. Attribute-based policies are used to create the simple RBAC behavior automatically. The attribute-based policies bring the advantages of ABAC: They are easy to build and easy to change for a dynamic application. The approach can be applied to RBAC system design for large-scale Internet cloud system applications.
Clearly, much work remains to be done in detection and security in cloud computing. However, recent work shows progress toward making intrusions in cloud infrastructure more difficult for the attacker. Virtual machines provide some measure of isolation, and implementations of the hypervisors that enable virtual machines are becoming more secure. Container technologies [8] are now often used in cloud computing. Isolation and security for container technologies are a little more difficult than for virtual machines. However, use of cache partitioning in recent Intel products [16] coupled with compiler and other techniques may lead to isolation and assurance for container-based cloud applications.
The work discussed in Chapter 5 was aimed at the scalability, performance, algorithms, and application workloads of assured cloud computing and addressed issues of network performance, geographic distribution, and stream processing. In cloud computing, scalability is a key parameter underlying performance. As a consequence, this chapter focused on issues related to change in size, volume, and geographical distribution. It considered the evaluation of scaling solutions by using traces and synthetic workload generators.
A key practical solution to many of these issues is to provide appropriate availability to cloud resources, including redundant resources, services, networks, file systems, and nodes. However, this raises a corresponding problem of how to offer cloud applications appropriate consistency. The analysis of the issues and results of the research have been encapsulated in prototypes or systems that exemplify many assured cloud computing problems, solutions, and tools.
Specifically, in Chapter 5 we described DARE, a distributed data replication and placement algorithm that adapts to workload, synthetic workload generation, and clustered renewal processes; Ambry, a geographically distributed blob store; and, briefly, Samza, a stream processing system.
The growth of data analytics for big data encourages the design of next-generation storage systems to handle peta- and exascale storage requirements. DARE demonstrated that a better understanding of the workloads for big data is becoming critical for proper design and tuning. Specifically, popularity, temporal locality, and arrival patterns were studied for a 6-month period from file access patterns of two multipetabyte Hadoop clusters at Yahoo! across several dimensions. Data popularity measures accesses: both the number of and intensity of those accesses. The workloads were dominated by high file churn in which most files were accessed fewer than 10 times. A small percentage of files were highly popular. Young files accounted for a high percentage of accesses but a small percentage of bytes stored. The observed request interarrivals (opens, creates, and deletes) were bursty and exhibited self-similar behavior. The files were very short-lived. From the point of view of an individual data node, the DARE algorithm quickly identifies the most popular set of data and creates replicas for this set.
Data analytics of the behavior of big data systems suggested new algorithms and data to improve the performance of applications using Hadoop and HDFS. In general, it is difficult to obtain real traces of systems. Often, when data are available, the traces must be de-identified to be used for research. However, workload generation can often be used in simulations and real experiments to help reveal how a system reacts to variations in the load. Such experiments can be used to validate new designs, find potential bottlenecks, evaluate performance, and do capacity planning based on observed or predicted workloads.
Two important characteristics of object request streams are popularity (access counts) and temporal reference locality. For the purpose of synthetic workload generation, it is desirable to simultaneously reproduce the access counts and the request interarrivals of each individual object, as both of these dimensions can affect system performance. In Chapter 5, single-distribution approaches – which summarize the behavior of different types of objects with a single distribution per dimension – were shown not to reproduce both behaviors accurately at the same time. In particular, the common practice of collapsing the per-object interarrival distributions into a single system-wide distribution (instead of individual per-object distributions) obscures the identity of the object being accessed, thus homogenizing the otherwise distinct per-object behavior. Further, as big data applications lead to emerging workloads and these workloads keep growing in scale, the need for workload generators that can scale up the workload and/or facilitate its modification based on predicted behavior is increasingly urgent. Chapter 5 described a lightweight model that used unsupervised statistical clustering to identify groups of objects with similar behavior, and this significantly reduced the model space by modeling “types of objects” instead of individual objects. As a result, the clustered model can be suitable for synthetic generation and scaled as needed. The synthetic trace generator used this approach, which was evaluated across several dimensions. Using a big data storage workload from Yahoo!, we validated the approach by demonstrating its ability to approximate the original request interarrivals and popularity distributions.
New applications for clouds, such as social networks and file sharing (e.g., LinkedIn, Facebook, and YouTube), demonstrate a need for low-latency worldwide services. High fault tolerance and reliability for these systems necessitate geo-distributed storage with replicas in multiple locations. Chapter 5 discusses the design trade-offs of Ambry, a scalable geo-distributed object store. For several years, Ambry has been the mainstream storage for all of LinkedIn's media objects across all four of its data centers, serving more than 450 million users. The experimental results show that Ambry reaches high throughput and low latency, works efficiently across multiple geo-distributed data centers, and improves the imbalance among disks, while moving minimal data. The chapter also mentioned a collaboration with LinkedIn to develop Samza, a scalable large-state stream-processing system. The system is designed to handle very large state in stream-processing systems, which enables large joins and aggregations over streams. The large-state handling on Samza can reach two orders of magnitude better performance than the traditional way of handling state. Further, the failure recovery can be in parallel and almost constant irrespective of the number of failures. Overhead for failure recovery is reduced by preventing state rebuild as much as possible. Overall, the mechanism reaches low latency, high throughput, and almost constant failure recovery time. Samza is a currently running system of LinkedIn, and the code is open-source.
Future research is needed to address the coming data analytics and machine learning innovations now occurring. The reliability, robustness, accuracy, and performance of cloud-based large machine learning computations is likely to become a major issue in cloud computing. In the few months prior to the time of this writing, production implementations of TensorFlow and other deep-learning systems have come online at Google, Amazon, and Microsoft. The models from these systems are used for inferencing in both clouds and local devices like cell phones. These systems, when coupled with Edge learning systems or smartphones, form complex distributed learning systems that require performance analysis and evaluation. Ubiquitous sensors, autonomous vehicles that exchange state information about traffic conditions, and a host of close-to-real-time and health applications continue to expand the boundaries of cloud computing. The techniques discussed in this chapter, including measurement, modeling, and optimization based on performance, will govern the design of such systems and contribute to assured cloud computing.
Building assured cloud computing applications and services that perform predictably remains one of the biggest challenges today, both in mission-critical scenarios and in nonreal-time scenarios. The work outlined in Chapter 6 has made deep inroads toward solving key issues in this area. Specifically, the work described constituted the starting steps toward realization of a truly autonomous and self-aware cloud system for which the mission team merely needs to specify SLAs/SLOs (service-level agreements and objectives), and the system then reconfigures itself automatically and continuously over the lifetime of the mission to ensure that these requirements are always met. This chapter described some key design techniques and algorithms, outlined designs and implementation details, and touched on key experimental results. The experimental results were obtained by deploying both the original system(s) and modified systems on real clusters and subjecting them to real workloads.
We overviewed five systems that are oriented toward offering performance assuredness in cloud computing frameworks, even while the system is under change. They include Morphus, which supports reconfigurations in sharded distributed NoSQL databases/storage systems; Parqua, which supports reconfigurations in distributed ring-based key-value stores; Stela, which supports scale-out/in in distributed stream processing systems; an unnamed system to support scale-out/in in distributed graph processing systems; and Natjam, which supports priorities and deadlines for jobs in batch processing systems.
For each system, the motivations, design, implementation, and experimental results were presented. Our systems have been implemented in popular open-source cloud computing frameworks, including MongoDB (Morphus), Cassandra (Parqua), Storm (Stela), LFGraph, and Hadoop (Natjam). We described multiple new approaches to attacking different pieces of the broad problem and incorporated performance assuredness into cloud storage/database systems, stream processing systems, and batch processing systems.
Specifically, the Morphus system supports reconfigurations in NoSQL distributed storage/database systems such as shard key change in sharded databases such as MongoDB. Optimal decisions place new chunks by using bipartite graph matching (which minimizes network volume transferred). Morphus supports concurrent migration of data, processing of foreground queries, and replay of logged writes received while reconfiguration was in progress. The Parqua system extended the Morphus design to ring-based key-value stores; the implementation is in Apache Cassandra, the most popular key-value store in industry today.
The design of NoSQL storage, database systems, and key value stores will continue to evolve as more performance studies characterize new and different applications and specific storage device concerns. The design of such stores must balance competing performance goals, including low latency for searches, high write throughput, high read throughput, concurrency, low write amplification (in solid-state drives), and reconfiguration overhead. Devices such as phase-change memory and memristors offer new opportunities for research and study.
The Stela system supports automated scale-out/in of distributed stream processing systems. The implementation uses Apache Storm, the most popular stream-processing system in industry today. The congestion levels of operators in the stream-processing job (which is a DAG of operators) are identified. For scale-out, more resources are provided to the most congested operators, while for scale-in, resources are removed from the least congested operators. The changes occur in the background without affecting ongoing processing at other operators in the job.
Trends in data analytics show increasing adoption of low-latency, high-throughput stream processing [9–11]. Stream processing is mainly based either on a record-at-a-time or by bulk synchronous processing using batched records, and many optimizations apply to record-at-a-time and batched records [12]. Providing mechanisms to scale-out/in the building blocks of such stream processing in systems that support SLAs/SLOs for multitenant clusters will increase the rapidity of their adoption and make them more readily available for assured cloud computing.
To demonstrate graph processing elasticity, scale-out/in facilities were designed into LFGraph. The approach repartitions the vertices of the graphs (e.g., a Web graph or Facebook-style social network) among the remaining servers, so as to minimize the amount of data moved and thus the time to reconfigure. Migration occurs in the background, while the iterative computation proceeds normally. The Natjam system incorporates support for job priorities and deadlines (i.e., SLAs/SLOs) in Apache YARN. The implementation is in Apache Hadoop, the most popular distributed batch processing system in industry today. Individual tasks are checkpointed so that if one is preempted by a task of a higher-priority (or lower-deadline) job, then it is resumed from where it left off (thus avoiding wasted work). Scheduling policies for both job-level and task-level eviction decide which running components are victimized when a more important job arrives.
Assured cloud systems would benefit from using declarative ways of specifying requirements from users and developers. SLAs/SLOs should be standardized. Further work is needed on (i) extending the richness of these SLAs/SLOs while still keeping them user-facing and away from the innards of the system, and (ii) extending the notion of such requirements to other emerging areas of distributed systems, such as distributed machine learning, for example, through tools such as TensorFlow, PyTorch, or Caffe [13].
Enormous numbers of mobile devices and smart sensors are incrementally relatively inexpensive to scale up and have limited computational resources (memory, processing capability, and energy). At the same time, cloud computing provides elastic on-demand access to virtually unlimited resources at an affordable price. Integrating mobile devices, smart sensors, and elastic on-demand clouds provides new functionality and quality of service for mobile users. Such an integrated system is referred to here as a mobile cloud.
Chapter 7 considered these hybrid mobile clouds using an actor-based approach to programming. By using actors, we explored functionality and efficiency while enforcing security and privacy. Two key ideas were studied. First, the fine granularity of actor units of computation allows agility in a mobile cloud by facilitating migration of components. Migration occurs in response to a system context (including dynamic variables such as available bandwidth, processing power, and energy) while respecting constraints on information containment boundaries. Second, information flow between actors can be observed and suspicious activity flagged or prevented by specifying constraints on actor interaction patterns. Our approach facilitates a “holistic” form of assured cloud computing, which we have realized in a prototype mobile hybrid cloud platform; as we discussed in the chapter, suitable formalisms can be used to capture interaction patterns to improve the safety and security of computation in the mobile cloud.
The actor mobile cloud framework can enable effective balancing of resource use with performance while preserving security. There is a particular concern, however, with energy management in mobile devices. Further research is needed in techniques to infer the energy consumed by a specific application on a mobile device. Monitoring techniques that infer coordination constraints and session types are another area in which more research is needed in order to allow patterns of interaction in a running system to be inferred. System assurance will be enhanced by flagging suspicious deviations as well as preventing harmful actions. Moreover, the current notations for representing session types are formal in nature and not suitable for use by system developers. A friendly interface could help programmers visualize the interaction behaviors as well as reduce errors in the system. Moreover, session types may help detect information leaks, as certain sequences may reveal more information than would a single message interaction. In addition, statistical sampling can help detect violations of quantitative coordination constraints when not just information sequences, but also the sum total of information revealed from a very large number of sources, need to be constrained.
The challenge of addressing the impact of failures and faults on interaction patterns needs to be studied further. Such failures are not explicitly incorporated either in the language of synchronizers or in session types. Adding support for dynamic process creation will be an important direction for future work in session types for actor systems. In its current form, System-A cannot express actor creation as a behavior, and global types assume that all participants already exist. Matching a created actor with its subsequent use in a type requires an extra step that is not obvious. Furthermore, System-A omits support for session delegation, and does not deal with issues of progress. Finally, it does not consider overlapping nested indexed names nested and included in multiple operators. That omission disallows, for example, all-to-all communication.
Security standards are tools that help improve security and privacy by establishing a security baseline and supporting the implementation of privacy- and security-enhancing measures. Such standards are revised and improved over time in an effort to keep pace with technological change and the emergence of new threats, and security certifications are a widely used mechanism for demonstrating compliance with standards. In Chapter 8, we looked at how three of the most popular and highly regarded certifications used for cloud security – ISO/IEC 27001, SOC 2, and FedRAMP – have evolved over time, how they stack up in comparison to each other, and what the implications are of having multiple certification options instead of a single universally used standard.
We concluded that the three standards have important similarities and similar goals; all three are based on a set list of criteria and controls that must be assessed by independent, accredited third parties. We found a number of key differences as well. For example, as discussed in the chapter, while FedRAMP focuses specifically on cloud services offered by private outsourcers to the federal government [14, 15], ISO/IEC 27001 and SOC 2 have a more general scope; their security assessment tools are available for use by any kind of service organization. At the same time, FedRAMP requires more than twice as many controls as the other standards, reflecting FedRAMP's more specific and detailed controls, while those in ISO/IEC 27001 and SOC/TSPC are more general. Likewise, SOC 2 and ISO/IEC 27001 are more flexible than FedRAMP in specifying how to implement the criteria and controls that they require.
Naturally, the value of a security standard lies in how well the approaches that it enforces actually protect systems against threats. Therefore, the existence of multiple standards with notable similarities raises the question of why multiple standards are needed, if everyone is trying to achieve the same goal of having the best possible security?
Upon close examination, we found that the three standards are not in fact redundant; rather, they show high complementarity and compensate for each other's weaknesses and omissions. There is good reason for a cloud service provider to invest in complying with multiple standards instead of only one. Even so, given that obtaining certifications is costly, it would still be desirable for the cloud computing community to develop a single standard that offers all the protections that are currently articulated piecemeal across multiple standards.
However, another challenge to standardization is the reality that new vulnerabilities and threats are continually appearing, and new defensive countermeasures are being continually developed in response. The impact of the “Treacherous Twelve” on the effectiveness of IT security standards in cloud environments points toward a possible path to the improvement of IT security standards. Observations of threats, issues, and vulnerabilities can help cloud providers and users understand the need for new or different control measures, and their connection to security standards can lead to better effectiveness, completeness, and efficiency.
The goal of locking down standardized protection methods will for the foreseeable future be in tension with the constant evolution of the threat landscape that those methods must handle. Worse, to date, the academic and industrial stakeholders have tended to study new threats more or less in isolation, and coordination with standards developers has been limited. To optimize the responsiveness of standards, the study of vulnerabilities should be more actively and closely connected to the maintenance of security standards.
18.221.234.179