This chapter covers the following topics from Domain 14 of the CSA Guidance:
• Big Data
• Internet of Things
• Mobile
• Serverless Computing
Once a new technology rolls over you, if you’re not part of the steamroller, you’re part of the road.
—Stewart Brand
In my opinion, this quote from Mr. Brand perfectly summarizes your career in information technology. There are always new developments that you need to understand to some degree. This is not to say that you need to be an expert in every new technology, but you at the very least need to understand the benefits that new technology brings.
This chapter looks at a few key technologies that, while not cloud-specific by any extent, are frequently found in cloud environments. These technologies often rely on the massive amounts of available resources that can quickly (and even automatically) scale up and down to meet demand.
In preparation for your CCSK exam, remember that the mission of the CSA and its Guidance document is to help organizations determine who is responsible for choosing the best practices that should be adopted and implemented (that is, provider side or customer side) and why these controls are important. This chapter focuses on the security concerns associated with these technologies, rather than on how controls are configured in a particular vendor’s implementation.
If you are interested in learning more about one or more of these technologies, check out the Cloud Security Alliance web site for whitepapers and research regarding each of these areas.
The term “big data” refers to extremely large data sets from which you can derive valuable information. Big data can handle volumes of data that traditional data-processing tools are simply unable to manage. You can’t go to a store and buy a big data solution, and big data isn’t a single technology. It refers to a set of distributed collection, storage, and data-processing frameworks. According to Gartner, “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization.”
The CSA refers to the qualities from the Gartner quote as the “Three Vs.” Let’s define those now:
• High Volume A large amount of data in terms of the number of records or attributes
• High Velocity Fast generation and processing of data (such as real-time or data stream)
• High Variety Structured, semistructured, or unstructured data
The Three Vs of big data make it very practical for cloud deployments, because of the attributes of elasticity and massive storage capabilities available in Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) deployment models. Additionally, big data technologies can be integrated into cloud-computing applications.
Big data systems typically are typically associated with three common components:
• Distributed data collection This component refers to the system’s ability to ingest large volumes of data, often as streamed data. Ingested data could range from simple web clickstream analytics to scientific and sensor data. Not all big data relies on distributed or streaming data collection, but it is a core big data technology. See the “Distributed Data Collection Backgrounder” for further information on the data types mentioned.
• Distributed storage This refers to the system’s ability to store the large data sets in distributed file systems (such as Google File System, Hadoop Distributed File System, and so on) or databases (such as NoSQL). NoSQL (Not only SQL) is a nonrelational distributed and scalable database system that works well in big data scenarios and is often required because of the limitations of nondistributed storage technologies.
• Distributed processing Tools and techniques are capable of distributing processing jobs (such as MapReduce, Spark, and so on) for the effective analysis of data sets that are so massive and rapidly changing that single-origin processing can’t effectively handle them. See the “Distributed Data Collection Backgrounder” for further information.
A few of the terms used in the preceding bulleted list deserve a bit more of an explanation. They are covered in the following backgrounders.
Unlike typical distributed data that is often sent in a bulk fashion (such as structured database records from a previous week), streaming data is continuously generated by many data sources, which typically send data records simultaneously and in small sizes (kilobytes). Streaming data can include log files generated by customers using mobile or web applications, information from social networks, and telemetry from connected devices or instrumentation in data centers. Streaming data processing is beneficial in most scenarios where new and dynamic data is generated on a continuous basis.
Web clickstream analytics provide data that is generated when tracking how users interact with your web sites. There are generally two types of web clickstream analytics:
• Traffic analytics Operates at the server level and delivers performance data such as tracking the number of pages accessed by a user, page load times, how long it takes each page to load, and other interaction data.
• E-commerce–based analytics Uses web clickstream data to determine the effectiveness of e-commerce functionality. It analyzes the web pages on which shoppers linger, shopping cart analytics, items purchased, what the shopper puts in or takes out of a shopping cart, what items the shopper purchases, coupon codes used, and payment methods.
These two use cases demonstrate the potential of vast amounts of data that can be generated in day-to-day operations of a company and the need for tools that can interpret this data into actionable information the company can use to improve revenues.
Hadoop is fairly synonymous with big data. In fact, it is estimated that roughly half of Fortune 500 companies use Hadoop for big data, so it merits its own backgrounder. Believe it or not, what we now know as big data started off with Google trying to create a system they could use to index the Internet (called Google File System). They released the inner workings of their invention to the world as a whitepaper in 2003. In 2005, Doug Cutting and Mike Cafarella leveraged this knowledge to create the open source big data framework called Hadoop. Hadoop is now maintained by the Apache Software Foundation.
The following quote from the Hadoop project itself best explains what Hadoop is:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Notice that Hadoop allows for distributed processing of data in large data sets. To achieve this, Hadoop runs both storage and processing on a number of separate x86 systems in a cluster. Why did they make it so that data and processing are both running on individual x86 systems? Cost and performance. These x86 systems (such as your laptop or work PC) are cheap in comparison to customized hardware. This decentralized approach means you don’t need super-costly, powerful, high-performance computers to analyze huge amounts of data.
This storage and processing capability is broken down into two major components:
• Hadoop Distributed File System (HDFS) This is the storage part of Hadoop. When data is stored in an HDFS system, the data is broken down into smaller blocks that are spread out across multiple systems in a cluster. HDFS itself sits on top of the native file system on the operating system you operate (likely Linux, but Windows is supported as well). HDFS allows for multiple data types (structured, unstructured, streaming data) to be used in a cluster. I’m not going to get into the details of the various components (SQOOP for databases and Flume for streaming data) that enable this ingestion to occur, because that’s getting way too deep for this brief explanation of Hadoop (and especially HDFS).
• MapReduce This is the processing part of Hadoop. MapReduce is a distributed computation algorithm—its name is actually a combination of mapping and reducing. The map part filters and sorts data, while the reduce part performs summary operations. Consider, for example, trying to determine the number of pennies in a jar. You could either count these out by yourself, or you could work with a team of four by dividing up the jar into four sets (map function) and having each person count their own and write down their findings (reduce). The four people working together would be considered a cluster. This is the divide-and-conquer approach used by MapReduce. Now let’s say that you have a 4TB database you want perform analytics on with a cluster of four Hadoop nodes. The 4TB file would be split into four 1TB files and would be processed on the four individual nodes that would deliver the results you asked for.
To consider a real-life big data scenario, consider a very large retailer that needs to read sales data from all cash registers on a real-time basis from 11,000 stores conducting 500 sales an hour, so they can determine and forecast what needs to be ordered from partners and shipped out on a daily basis. That’s a pretty complex computation that needs to be done. This data may very well be streaming data that we discussed earlier. This would be ingested into the Hadoop system, distributed among the nodes in the cluster, and then required orders and deliveries could be processed and sent to the appropriate systems via REST APIs.
These two components were the basis of the original Hadoop framework. Additional components have been added over time to improve functionality, including these:
• Spark Spark is another data processing function that may replace MapReduce. It allows for more in-memory processing options than MapReduce does.
• YARN Yet Another Resource Negotiator, as you can guess, performs a resource management function (cluster resource management specifically) in the Hadoop system.
• Hadoop Common These tools enable the operating system to read data stored in the Hadoop file system.
This completes your 101-level crash course of big data analytics using Hadoop as an example. It’s a field that is big today and is only going to become bigger. One last thing: There are commercial big data offerings out there that you can check out. Like most other new areas, mergers and acquisitions are common. For example, two of the larger big data solution providers, Cloudera and Hortonworks, completed their merger in 2019.
You know that big data is a framework that uses multiple modules across multiple nodes to process high volumes of data with a high velocity and high variety of sources. This makes security and privacy challenging when you’re using a patchwork of different tools and platforms.
This is a great opportunity to discuss how security basics can be applied to technologies with which you may be unfamiliar, such as big data. At its most basic level, you need to authenticate, authorize, and audit (AAA) least-privilege access to all components and modules in the Hadoop environment. This, of course, includes everything from the physical layer all the way up to the modules themselves. For application-level components, your vendor should have their best practices documented (for example, Cloudera’s security document is roughly 500 pages long) and should quickly address any vulnerabilities with patches. Only after these AAA basics are addressed should you consider encryption requirements, both in-transit and at-rest as required.
When data is collected, it will likely go through some form of intermediary storage device before it is stored in the big data analytics system. Data in this device (virtual machine, instance, container, and so on) will also need to be secured, as discussed in the previous section. Intermediary storage could be swap space (held in memory). Your provider should have documentation available for customers to address their own security requirements.
3.22.248.208