As mentioned in the preface, there are three major players in the cloud provider space at the time of writing: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. According to a report from Gartner, AWS has the largest market share. This is mostly due to its first mover advantage and its ability to lock in customers once they start building infrastructure/services within the AWS platform. AWS is followed by Azure, and then GCP in overall market share.
Although there are many cloud computing companies out there, this book mostly focuses on the big three, with a slight emphasis on AWS since it holds far more market share than Azure and GCP combined. Among the large swath of cloud service offerings available, the big three mostly focus on what would fall under infrastructure as a service (IaaS) and platform as a service (PaaS) offerings. Although the big three do provide their own versions of software as a service or SaaS (technically you could easily argue that IaaS and PaaS are just forms of software as a service), there is a solid distinction for what the big three provide compared to companies like Salesforce.
As most savvy companies do, cloud providers tend to build upon what they already offer to their clientele (on occasion, these services are even broken up into their component parts to be sold individually). These offerings are normally bucketed under the category of managed services. A managed service essentially allows for one company to pay another to license an already established, but continuously maintained and upgraded piece of technology. On occasion these contracts will contain some form of continual support or a limited development engagement where the managed service company will help integrate the new tech.
Most commonly, managed services are a product that one company offers in one configuration to multiple customers. However, for an extra fee the offering company will sometimes offer development support to expand upon those offerings based on the needs of the customer. Both Databricks and SAP are known for doing this based on the type of contract you have with them. At a base level, most “as a service” offerings fall into this managed service category. This includes both the PaaS and IaaS offerings provided by the cloud providers as well as software suites provided by companies such as Salesforce and Adobe.
Due to the fact that the onus of the development and maintenance of these services are on the contracted company, managed services are desirable for teams lacking in the personnel, funds, expertise, or just plain time for spinning up their own versions of those services. Depending on the team/project, contracting for these services can lead to faster development, less internal engineering burden, and sometimes even a lower monetary resource burn rate. However, this is not always the case due to a few reasons.
When using/spinning up managed services, there is a possibility that teams/decision makers will fall into what I call the it just works syndrome. This syndrome pops up when teams are evaluating a new tool or managed service and, based on the marketing descriptions, anticipate the new piece of tech/managed service to seamlessly integrate with their tech stack (most of the time this assumption proves to be incorrect). If a new piece of tech or service does not miraculously just work after it was already purchased/licensed or a contract was already negotiated, teams may accrue quite a bit of overhead in both cost and engineering to adapt to the new service. Even if the work is front loaded before a service is purchased to make sure compatibility is present, there is always a decent chance that some piece of software will need to be upgraded or an integration changed in some manner to be compatible. Though these changes might not necessitate additional resources, the needed upgrades/changes can be disruptive for engineers/normal business functions in a way that potentially causes delays. These compatibility considerations become amplified in long-running systems with lots of legacy dependencies/integrations and by the size of the organization. There are just too many factors to assume that each integration will go smoothly without a decent chunk of engineering intervention in more complicated systems. Even in a situation where there is some agreement that the service provider will make changes to its platform in order to support the needs of the client where solutions engineers or other customer service facing personnel are interfacing directly with the client to gather requirements, its development and rollout of features will be subject to its own internal engineering pipelines. Although some large government agencies or private companies might be able to sway the development timeline of these new features, the external team still has their own internal processes for testing and then deploying new features that might not always fully align with the expectations of their client.
The cloud providers have actually come to realize the downsides to the purchase first, proof out later model that is seen with some managed services from the customer point of view. To counteract the aversion to that type of potentially risky business/technical decision, they have created an environment where you pay for what you use with no subscriptions required, and allow for some free resource time to try things out.
To recap: managed services can provide significant benefit over custom building an in-house solution. They are popular due to how effective they can be, but as with any significant decision/new technology integration, the proper thought and evaluation process needs to go into determining first whether a managed service is appropriate for your team’s/project’s situation and then which service would actually be the best fit. Everyone’s situation is a bit different, but general considerations and how to weigh tradeoffs are covered a bit later in this chapter. First though, let’s dive a bit into the different types of services the big three cloud providers have on tap.
At their core, the cloud providers give you a virtual machine (VM) or instance to run a piece of code or a process. Where this instance lives depends on the service (and occasionally on the price tier), but for the most part, the instance is spun up in a server farm owned by the provider. This instance can be wrapped in a few different packages and provided with a few different configurations of resources, but for all intents and purposes the instance acts as a normal virtual machine.
At a high level, this is the closest you can get to running code on your own local machine except for the fact that the instance is ephemeral and loses all memory of what was run unless the instance’s state is persisted in some manner. The way information is persisted can occur in a few different ways, discussed in depth in Chapter 5.
Although this paradigm is very flexible and provides you with the necessary compute resources when you need them, there is another offering that takes the idea of a borrowed instance to another level. These offerings normally fall under the category of serverless compute. AWS defines serverless compute as an environment where “infrastructure management tasks like capacity provisioning and patching are handled by AWS, so you can focus on only writing code that serves your customers.” AWS introduced its first popular version of serverless compute through its Lambda offering in November 2014. Soon after, Azure introduced Azure Functions, and GCP introduced Cloud Functions.
The main concept behind serverless compute is that the provider supplies a relatively constrained execution environment in exchange for rapid spin up/down of resources for fast execution. Normally these serverless applications would have really low execution and resource costs in order to promote heavy use at scale. Until recently, AWS’s Lambda only supported a handful of languages and a max package size (amount of code and relevant packages) of 250 MB unzipped. However, now AWS even provides full Docker container support.
For the most part, these base services act as the building blocks for most other services provided. For example, EC2 provides each of the individual instances in a cluster created by AWS’s Elastic MapReduce (EMR). Another example can be seen GCP’s default architecture for a serverless machine learning model that utilizes both their Cloud Functions and Machine Learning Engine offerings.
As you may have picked up by now, so far I’ve only discussed different services related to compute resources. However, anyone somewhat versed in what is required for data processing (or any type of programmatic workload for that matter) will know that an even more important factor is the need for a place to store all that data, information, code to execute, configuration files, and so forth, whether at rest in between workloads or mid-flight while that workload is being processed. This type of storage can range anywhere from onboard random-access memory (RAM) or durable long-term blob storage. This section breaks out the storage types into three categories, and Chapter 5 does a deeper dive into the intricacies.
Though RAM is not normally considered a standard storage mechanism, the amount of RAM utilized for a data processing workload can have a massive impact on runtime and can even impact whether or not a workload fails. The amount of RAM assigned to a workload is normally configured at the resource allocation phase of the spin up process and is occasionally directly tied to the number/type of CPU cores assigned to a specific resource. For example, AWS’s EC2 instances are normally packaged in different CPU and RAM combinations where you can sometimes get memory-optimized instances or compute-optimized instances. Although RAM is normally the fastest memory option available, scaling up instances with more RAM allocation can be very expensive and often a tad wasteful if you’re not using the extra compute resources.
Instead of directly scaling up the RAM associated with a resource, the next best bet is normally some type of block storage. This block storage is normally provided via the cloud provider, allowing you to attach a hard disk drive (HDD) or a solid state drive (SDD) to the spun-up compute resource. Not all applications will need attached storage, but those that do will need varying rates of performance that will eventually impact the overall cost of a process.
Attached block storage is very handy for holding files mid-process, but the files are more often than not ephemeral and spun down after their corresponding resource is terminated. Some block storage solutions like AWS’s Elastic File Store (EFS) provide the ability to keep the drives hot and swappable between compute resources, but this is often accompanied by a cost premium over other solutions that isn’t always justifiable. The cost premium becomes even more apparent when taking into account how cheap and widely available blob storage is.
Blob (or object) storage is probably the most recognizable of all cloud services, whether we are talking about AWS’s Simple Storage Service (S3), Azure’s Blob Storage, or GCP’s Cloud Storage. When it comes to durable long-term storage, the blob storage services are normally the way to go. These solutions are normally considered highly durable because they save backups across data centers, unlike their block storage counterparts. On top of that, blob storage tends to be pretty cheap compared to longer-term storage due to variable pricing tiers. At a high level, block and blob storage have a few differences. Block storage is essentially how data is stored in a filesystem on a HDD or SDD, where data is stored in blocks that can be read and written to rapidly in small portions. Blob storage is normally accessed over an API (which can be abstracted out to act like a filesystem) where files are interacted with in their entirety. Since the files cannot be read/written in portions, they actually cannot be edited like files stored via block storage and must be fully rewritten.
The pricing and durability difference between block and blob storage has been recognized and leveraged by companies like Databricks when it comes to building out data lake solutions. Databricks in particular has built out a solution for its platform to treat blob storage (as of this writing, it has integrations for both S3 and Azure’s Blob Storage) as a Hadoop Distributed File Store (HDFS) to back the Spark clusters spun up by its platform. This integration has been so successful that AWS even rolled out its own flavor, called Elastic Map Reduce File Store (EMRFS), for use with its own Spark and MapReduce solutions. Microsoft has also identified the power of blob storage when it comes to long-term and large-scale data storage and has rolled out its own data lake solution that sits on top of its blob store.
There are two major downsides when it comes to using blob stores: files written out to the store cannot be edited, and due to the nature of how high availability is achieved across data centers, file operations often suffer from a phenomenon called eventual consistency. Eventual consistency occurs where “the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value” Eventual consistency can cause a lot of pain, and for certain workloads can be a major reason for the use of a block storage solution in lieu of blob for workloads requiring rapid file changes. However, as I write, AWS has stated that all S3 operations are now immediately consistent as of completion. This completely changes a lot of the initial paradigms around using S3 as a filestore that, due its immensely transformative impact, is covered in much more depth in later chapters.
Now that the two main needs for running any type of computer-based workload (compute for doing the actual work and storage for holding onto the input/outputs/intermediaries of said process) have been covered, let’s talk about how the provision of the resources can be built upon.
After compute and storage resources have been spun up, normally the basic next step is to spin up an environment where code or processes can actually be run. In order to attract the widest audience possible, the cloud providers facilitate a few different options here. At a very base level, you could spin up a compute unit with your favorite flavor of Linux just like you would a virtual machine. However, cloud providers have identified that one way they can add value for their basic compute services is to provide some standardized boot images.
For example, when you spin up an AWS EC2 instance, you’re prompted to select a certain Amazon Machine Image (AMI) to be preinstalled on the “machine.” Since these AMIs are basically boot/virtual machine images, they are easily shared and can provide a wide variety of functionality. On top of that, these AMIs can be custom brewed images, obtained through a marketplace for verified company created AMIs, or shared among the community. As a result, these AMIs can range anywhere from barebones operating systems like Windows Server and numerous flavors of Linux/Unix (from Ubuntu to Red Hat) to fully fledged software suites such as Wordpress. AWS is also now providing macOS instances, although those are currently instantiated via a different process than the AWS marketplace.
Marketplace AMIs might have an extra charge to them depending on the software provided. Most open source software obviously doesn’t have an extra cost tagged on, but other images might. The pricing/licensing structures (as is the case with most software) might vary. In some cases, such as in the case of Splunk Enterprise, AWS does not charge the user for the license, but the user must purchase a license through Splunk itself. When the instance is spun up, the user must authenticate via the Splunk user interface to verify that they own the relevant license.
GCP’s Compute Engine (its version of EC2) has a similar concept for boot images. Its Public Images contain open source/official third-party/Google-maintained, and Custom Images can be used by individual customers to store their own custom boot images. Azure provides support for Linux/Windows virtual machines alongside custom images, but as of this writing, Azure’s third-party image support is quite lacking compared to GCP and AWS.
Virtual machines are not the only method for which the cloud providers allow a user to perform programmatic tasks within their platform. As mentioned before, for specific serverless offerings, the cloud provider takes away the need for managing a full operating system/image and provides just a runtime execution environment. Within that environment, all a developer needs to submit are any input config/data files, the code to execute in the language of their choice (with some limitations depending on service), and any language-specific packages/libraries they might need in order to run their process. The developer then has to set the requested resources, normally to the tune of amount of CPU units, RAM, attached storage in the form of an SSD/HDD, and in certain instances, auto-scaling limits. After selecting the desired resources, they give the resource the requisite permissions and then upload their deploy package in order to start running their code in the cloud. Services that provide this kind of functionality, such as AWS’s Lambda, Azure’s Functions, and GCP’s Cloud Functions have become extremely popular over the past few years as they have continually been refined by the cloud providers. Services of this ilk are normally said to have a serverless architecture and are often the centerpiece of microservice applications hosted in the cloud.
The sharp-eyed reader is probably already asking about the differences between using serverless applications, utilizing containers, running a Rails or Node.js application, or submitting tasks to a Kubernetes cluster. There are few differences, but they are distinctive. Chapter 5 has a more in-depth comparison, but at a high level it comes down to packaging and the developer’s relation to the underlying infrastructure. With serverless architecture, the cloud provider owns and maintains the underlying server for where a piece of code is executed. This means that there is no management of a server for the Node.js app to run, no Kubernetes cluster to maintain, and no worries about two simultaneously running applications running into resource contention with each other. Due to the popularity of serverless architecture, the cloud providers have expanded their lineups of serverless applications to accept code/dependencies in a variety of forms. Services like Azure Functions, Cloud Functions, and AWS Lambda accept packages of code bundled in zipped folders, code written directly via the console, or code pushed via an API. More recently, Docker has picked up popularity, and more services are beginning to allow for the uploading of Docker containers as well.
The ability to easily bring whatever software your team needs to a cloud environment is very powerful and can satisfy almost any use case. Despite this, the cloud providers have also realized that there is a large demand for products that allow devs to focus less on the maintenance of their own tooling. One of the most popular areas for services that provide value by instead letting developers focus on the needs of their business lies in the realm of data processing and analytics.
Some of the most common types of managed services provided by the cloud providers are those related to databases and data access. For on-premises (on-prem) systems, when a database was needed, a server would be stood up in a data center, and some flavor of database tech would be installed to serve the current needs of the business. This database would be constrained to the physical resources available in the data center. Database upgrades or maintenance could often be lengthy, and if an unlucky event occurred, the business could be without that database for days at a time. Older tech like Microsoft SQL Server and MySQL were once common, but recently, more advanced technology has been developed to service the sudden abundance of data available to businesses. Technologies in this category include suites like Apache’s Cassandra and MapReduce, and Cloudera’s Impala.
Compared to just having a singular always-on database to service a single team’s needs, a common paradigm nowadays is to have both a data lake and a data warehouse. The data lake commonly houses all data that an enterprise might generate, often with data sitting in a Hadoop-based system and only turning on compute resources to complete a specific task. The data warehouse is a more common database technology with a constrained subset of the data from the data lake, often broken into data marts, that facilitate analyst-related exploration/workloads. With an on-prem system, this data lake and data warehouse setup could be very costly considering the hardware needed to house/process the vast sums of data normally collected within data lakes. The cost may be compounded by the fact that most data centers need to buy physical storage and processing capacity in advance, based on often incorrect predictions, which can sit underutilized for long periods of time. As a result, storing and processing data in a large manner with a clear delineation between a data lake and data warehouse would seem out of reach for most companies.
Once cloud services came into the picture, this data lake/data warehouse pattern became much more feasible for a few different reasons. First, the pay-as-you-go model allowed for companies to keep costs down while still essentially being able to scale indefinitely. Second, the managed database service offerings removed the need for an on-staff engineer to specifically manage each database’s upgrades, permissions, and general maintenance. Most tasks could easily be scheduled to occur automatically through the service provider’s user interface. The ease of scheduling automated maintenance, replication, creating automated/more resilient backups, and removing hardware maintenance/fault resolution has increased the reliability of these often system-critical databases past those of their on-prem counterparts. Lastly, the plethora of data extraction, analytical/visualization services, and hosted environments have reduced the in-depth engineering, DevOps, and infrastructure knowledge that was previously needed to support these types of on-prem systems. They have been reduced to the point where users of all skill levels can easily process data up to terabytes in scale (if they are willing to pay the potentially hefty fee, of course).
At a high level, there are a few different services that the providers have for working specifically with data. Popular services such as AWS’s S3, Azure’s Blob Storage, and GCP’s Cloud Storage have become the de facto backing for data lakes due to their cheap data storage. Often, they will provide other handy features such as automated replication, recovery options, tiered storage, and more recently, ways to query certain types of data directly through their management console. For actually getting data into blob storage, you could use one of the runtime environments mentioned in the previous section. However, there are a few other offerings that are specifically built for spinning up high-throughput data processing environments and specifically built clusters for large-scale data processing/analytics. One example of this is AWS’s EMR, which spins distributed clusters with a slew of different software packages that are prime for processing large swaths of data for deposit back into S3. If spinning up an open-ended environment is a bit too daunting, the providers also have services that offer a layer of handholding and/or abstraction to reduce the need for in-depth knowledge of how different data processing technologies work. Examples include GCP’s Dataflow which provides a fully managed data processing service and AWS’s Glue/Data Pipeline for easily integrating multiple data sources. There are also offerings now that provide ways to easily stream data into/through the platform such as Kinesis Data Firehouse, Event Hubs, and stream analytics.
After the data has been brought into the environment, it needs to be made queryable and accessible by both engineers and analysts. For live systems that need short turnaround times, there are services for rapid access caching, like Elasticsearch. If a traditional relational database is preferred, there are options such as Azure’s Database for MySQL, Azure’s Database for MariaDB, or of course, Azure SQL Database for Microsoft’s SQL (MSSQL). Most services also provide document stores, such as MongoDB, and systems specifically tuned for dealing with rapid access for key-value pair structured data, such as DynamoDB. Services like DynamoDB provide cost-effective optimizations for very narrow use cases, but can balloon in cost or just plain not work for more complicated scenarios, data structure types, or analytical workloads. GraphQL, which has started to gain traction over the past few years, has also become more prevalent among the providers as well.
Although the previously mentioned service types are great for rapid access and narrow workloads, most do not scale to an enterprise level. They might scale well for individual teams or services, but they tend to become functionality and cost prohibitive when supporting analytics across an enterprise, especially for a more mature software company. This can clearly be seen by the fact that the AWS Relational Database Service (RDS) offerings can scale up by EC2 instance size, but are limited to one instance (disregarding replication). This is where data warehouses come into play. At a mature organization, the data warehouse should eventually become the source of truth for the enterprise. The data warehouse will normally become the source of most reporting served up to the business, any data marts needed for analytical workloads, and the final say between two systems that may come into conflict over a piece of information. Options like AWS’s Redshift, GCP’s BigQuery, and Azure’s Synapse Analytics are the current behemoths when it comes to popular data warehouses. As time goes and more services are becoming available, the line between data warehouses and data lakes is starting to blur -- more on that topic later in this book.
The number of different data processing options out there can be a bit overwhelming at first, and there are a lot of considerations for picking the right tech stack for dealing with a firm/project’s specific data needs. A large part of narrowing down the ever-widening swath of tooling begins with identifying the team’s specific skill sets and the project’s specific needs. However, there are a few paradigms and considerations that can be termed as global. These combinations of considerations and how to evaluate the potential costs of different tech stacks are covered in greater detail in Part 2.
After settling on a tech stack, the next step is to actually stand up the necessary resources in a way that hired developers/engineers can seamlessly move features through development to production. There are a few different paradigms for how this can happen. For now, let’s look at it from a bird’s eye view.
Any form of cloud engineering will require a few basic tools that can be bucketed into a handful of categories. The most important fall under the following:
Infrastructure creation and maintenance in whichever cloud environment is chosen
Deploying code from local development to the cloud for testing/production (deploy tools),
Documentation stores (think more Confluence and less MongoDB) and ticketing systems for tracking work
Log/metric generation and analytics
Fleshed out alerting for a variety of different types
The security aspect will be dissected more in the next chapter, but at a very base level: do not make things public that you do not want to be public! The cloud providers make it relatively easy to keep everything private by making some services private by default or making it easy to revoke public access. If a port to the outside world/overall internet needs to be opened, that can be accomplished in a safe manner (covered in Chapter 2).
How a team manages their infrastructure can greatly impact their speed of rolling out features and how well that team manages their cloud spend. While the cloud providers offer infrastructure as a service (IaaS), a new paradigm is arising called infrastructure as code. Due to the ease of spinning up and down resources as needed, a manageable, programmatic way of deploying these resources has become a must. This becomes even more important when managing multiple environments and dealing with the complexity of most enterprise environments. Infrastructure as code tools such as Hashicorp’s Terraform have allowed for the reuse and deployment of cloud infrastructure across testing and production environments in a programmatic fashion. With the use of modules and config files, you can ensure that the environments are perfect replicas of each from both a security and resourcing standpoint. The Terraform tool also allows for a preview of the changes made before an actual deployment to verify the differences between the current state and expected future state. Terraform and tools like it also provide built-in secret management integration due to their ability to pass in credentials by calling a sacred management store like AWS’s Key Management Service (KMS), Azure’s Key Vault, and GCP’s Secret Manager at runtime.
Despite using infrastructure as code to deploy new infrastructure, it is still important to keep an eye on what is currently running in your environments. Almost every team has most likely caused a bit of extra spend by accidentally leaving up resources while developing, testing, or putting out fires.
One of the hardest parts about developing in an enterprise environment can be setting up local development environments that accurately mirror working within the cloud. The intricacies and potential pitfalls around deploying cloud to different cloud services shouldn’t be underestimated. When it comes to deploying application code, the trickiness comes from the fact that there is no one-size-fits-all approach to deploying code to each of the different services. In some cases, you can point towards zip files stored in S3, code uploaded via the UI, or even pushing code with the API when a resource is created. Some services provide their own code/image management such as the Serverless Application Model (SAM) + CodeDeploy for Lambda and Elastic Container Registry (ECR) for Elastic Container Service (ECS) + Fargate (and now for Lambda as well). Azure also has its own deployment functionality through their Pipelines, which acts very similarly to Jenkins.
Don’t forget about library or package dependencies. If your system is closed off to the open internet, you won’t be able to download/install libraries from pip or maven. Instead replication those libraries will need to be hosted somewhere in your environment that can be accessed by the resources needing them.
Before deploying the code to production or development environments with resources in the cloud, most software engineers begin by developing/testing their code locally. This is done to avoid racking up cloud compute costs for the first couple iterations of development, where the engineer is just starting to put together their code. In order to get an environment that is as close as possible to production, developers will use tools like Docker or Vagrant. However, while some services provide support for containers, many do not. This methodology gets the engineers close, but has a few pitfalls. Often, the cloud environments have more compute/memory resources available to them than the local, which limits the amount of data that a process that is being developed locally can interact with. This often leaves load testing for higher environments and some potentially harsh and costly lessons about writing efficiently for some devs that might be a bit earlier on in their career. Alongside the lack of compute/memory resources, there’s a decent chance that a lot of the functionality from certain solutions won’t be available in the local Docker containers. There are a few potential workarounds for this. You could establish an environment in the cloud that is accessible via the local machines. Alternatively, you could find similar software that can be included in lieu of a particular service (even though often there isn’t one that can easily be plugged in). Each option provides its own set of tradeoffs that need to be assessed based on the team, situation, and technology involved.
For example, if an engineer uses Redshift as their data warehouse and wants to work on some new functionality locally, they can spin up a Redshift cluster in the cloud, connect their local Docker container, and begin doing local development that way. However, with this comes the extra cost of another Redshift cluster in the cloud, the potential that this cluster will remain up indefinitely (resources being left up in the cloud after development has been finished for the day has happened enough times that it should be something that is accounted for), and the potential security hole that can be opened by having a local machine connecting to resources within the cloud. This can then be contrasted with the scenario where a local Postgres database (Redshift was initially based off of Postgres) is spun up in the local Docker environment instead. Then the developer has to spend a good two or three extra days of their sprint determining what changed between Postgres and Redshift as they’ve diverged over the years after their code worked in their local environment, but failed during the first deployment to a lower cloud environment. Even though the team is saving in cloud compute costs, they end up paying for it in development delays, which can be more expensive. Unfortunately, there is no general paradigm here, and discussions around what will work best will need to be had, alongside some experimentation to find what works for each situation/team.
Now that all of this code has been created, and deploys are being attempted/are successful, you need a way to track how the processes are actually performing or, in some cases, why certain processes/deploys failed. This is where proper collection around logging and metrics followed up by proper analytics of those logs/metrics comes into play. Oftentimes, you’ll also see alerts being generated off of those logs/metrics. Whether or not a team decides to use a third-party tool like Sumo Logic, Splunk, Prometheus, Grafana, a custom Spark based system, or a cloud provided service such as AWS’s CloudWatch, Azure Monitor, and GCP’s Cloud Monitoring, emphasis should be placed on making sure the logs flowing through the system have consistent formats, are concise, and are easy to parse. It can also be very helpful to have the logs be easily tied back to the metrics of the resources being utilized. This doesn’t just mean that each log has timestamps attached. Having a way to easily dig through, visualize, and look at these logs/metrics from an aggregate/historic view can prove to be key when diagnosing some non-obvious errors that tend to arise when running production systems in a cloud environment. This ease of performing diagnostics can easily save on both development and disaster recovery turnaround time.
Often, cloud providers provide some level of history tracking for each of their products through their user interfaces. However, if those logs, metrics, and transient system information aren’t written out to a separate system, the history might be lost forever. This can make persistent, yet sparse errors almost impossible to diagnose, especially if there is no way to easily compare a healthy run to the one that failed.
In addition to cloud infrastructure history not being around forever, the logs generated by the software itself might be lost after a transient instance is spun down and the allocated storage is wiped clean. This can make it almost impossible to tell why a certain resource/program/run/job failed. Some services do track the error message that might accompany a failure, but those are inconsistent in value based on language, error type, service, and so on. Even if a particular service does provide the full logs, an effort should be put into making those logs easily parsable, aggregatable, and joined to the state of the instance/execution environment. The error alone might miss crucial contextual information that may explain where/why a certain error occurred. For example, with Spark clusters, often when a cluster dies due to an out-of-memory error or garbage collection loop issues, the last error written often does not necessarily call out that a memory threshold kept getting bumped into. Instead, tying the collection of logs together with the cluster’s resource utilization by their timestamps gives you the full picture of what was going on in the cluster to cause a failure. Some folks over at Databricks even recommend using log trends to diagnose what is going on with a Spark cluster.
The focus so far has been on error handling, but proper metric tracking and logging can also provide a huge boon when it comes to performance testing. Because you pay for what you use in the cloud, tracking historical runs for comparison purposes can be very beneficial. Performance tuning can prove to be much more efficient when historical runs are abundantly available for comparison instead of continuously running before/after tests. The cost savings are amplified when working with larger datasets or processes that require more resources. Proper tracking where inputs/outputs are identified and easily replicated can save a lot of time and money over the course of a project’s lifetime. This paradigm can even allow for cohort-based comparisons versus only using one-off runs that might not give full insight into the impacts of the changes, especially for variable workloads. With larger datasets/more resource intensive jobs, this becomes crucial when needing to adhere to specific SLAs.
When it comes to alerting, there is a large swath of options. As of the time of writing, AWS itself has a slew of options including a full custom brew microservice approach with Lambda, a more managed service approach with Simple Notification Service (SNS), and a fully-managed service with Chatbot. It is even possible to see a system where all three are used in conjunction with CloudWatch. Integrations with messaging platforms like Slack are simple enough through webhooks, whereas setting up email or cell notifications can be a bit more tricky when it comes to the security side of things. There are some items to keep in mind though. It may be tempting to use a tool like Terraform to roll out the exact same logging/alerting to every environment from sandbox to prod. However, though easy to do, this can easily create alerting fatigue among the team and cause important errors in prod to be overlooked. Please, no more @channel pings for non-prod environments. Due to the differences in how logs/error messages are generated across services, it can also be easy to fall in the trap of having obtuse, indecipherable, intrusive, and/or unhelpful alerts. Though you may want to print out the full stack trace in an alert, it is normally better to link to the both offending process and their accompanying logs/metrics instead.
Once the infrastructure and code have been established, you have to actually connect the cloud services together and any producers/consumers that the system might interact with. Cloud providers make this both a bit of a challenge and relatively seamless depending on what technology is involved. Since the cloud providers are the ones providing the networking infrastructure, they handle the burden of making the physical connections, dealing with routing and load balancing, and making sure all of their internal services can talk to each other. From a user’s standpoint when connecting individual native services within one provider and account together, this can be as “simple” as updating each service’s permissions to allow the different services to interact with each other.
There are exceptions, of course, especially in environments where you bring your own software and in the case of relatively new/changed services. There can be a case where a new service is launched, but an older service hasn’t been updated to integrate seamlessly yet. It might even take a few versions for the old service to catch up, as was the case with EMR, EMRFS, and S3 Access Points. Chapter 2 covers this in more detail, but by default each account/most services stood up in that environment are built into their own subnet. Each provider has its own flavor of this, both AWS and GCP use a service called Virtual Private Clouds (VPC) and Azure uses one called Virtual Network. These individual networks essentially cut off the resources from one account to the rest of the world. Each VPC can have their own version of Security Groups that contain security permissions and can open connections to external resources.
One other aspect of networking that should not be undervalued is the technical specifications of the connections that happen between the individual components of a specific service. The easiest example is the connections that need to occur between an individual compute instance and any attached storage. With the attached storage, there will be a need for data transfer across a network (even if just the provider’s inter data center network). As a result, there is always the chance of networking errors occurring causing broken connections or delays in communications between the two layers. This can become extremely prevalent in Spark workloads where there is not enough RAM supplied for specific workloads that have a lot of shuffling (reads and writes of potentially large files) to disk.
When feasible, it is wise to use the interconnectivity supplied by the cloud providers. Often this reduces the burden on the engineering teams for creating and maintaining these connections and, depending on the service, can greatly reduce any potential security concerns and/or performance impacts compared to a non-managed service connection. However, there are trade offs for this. Sometimes the cookie-cutter connectors are not ideal for the rest of the technology used in the system or some of the abstracted out configurations of the connections don’t allow for some necessary performance optimizations. The connectivity piece should obviously not be the deciding factor when it comes to selecting services or setup configurations, but it should be a consideration.
Here’s why all this effort is being put into building enterprise data systems: every company/team will eventually need a way to analyze all the collected data in order to get a proper pulse on how their activities are operating. On top of reviewing the data through more traditional analytics, the introduction of cloud compute resources and distributed systems has made more advanced analytic techniques such as machine learning (ML) and artificial intelligence (AI) more accessible after decades of being theoretical. Over the past few years, not only has the amount of open source models exploded, but so has the amount of managed service offerings by the cloud providers. Due to the popularity and overall presence that AI and ML have had over the past few years, these providers offer a wide range of resources that range from basic Graphic Processing Unit (GPU) included environment hosting for popular packages (such as SageMaker for PyTorch and Tensorflow) to plug-and-play models where you can submit an image and automatically have personal protective equipment identified. The number of these types of offerings will only grow as time goes on, and for most teams, some combination between custom modeling and managed services will most likely be the best fit.
For non data science workloads, there are quite a few different options as well. At a base level for data warehousing type workloads, there are options like Redshift and BigQuery for larger and more data warehouse style workloads. On the other hand, database systems like RDS might make more sense for smaller data stores. On top of that, Spark hosting and other environmental hosting services are not just limited to the data ingestion side of the house. If the analysts have a stronger programmatic and software engineering background (or they have the right engineers to support them), using Spark with their libraries of choice can be a very powerful combination, especially for the larger workloads normally found in enterprise environments. If some quick insights are needed from files stored in blob storage, AWS’s Athena or Azure’s Data Lake Analytics are solid bets. The biggest challenge arises when trying to integrate visualization tooling.
At the time of writing AWS does not provide any managed services for visualization tooling out of the box. GCP on the other hand does. GCP’s BigQuery can be hooked up to Looker to provide an analytics platform frontend with features that are useful for enterprise organizations as well as analyst friendly report and dashboard tooling. GCP also has its Google Data Studio which can integrate directly with BigQuery. Not to be one-upped, Azure has integrated Microsoft’s renowned Power BI through Power BI Embedded. Azure also offers Azure Synapse Analytics which provides querying functionality and visualization, although Power BI is much more powerful. Despite the small handful of visualization options, it is fairly easy to bring in a third party visualization tool due to the availability of hosting services. For example, Tableau is often utilized with cloud data hosting to serve up reporting for anyone from analytic to exec level consumers. Due to the flexible nature of cloud services (when set up correctly), delivering data to a large swath of technologies can be relatively seamless. Bottlenecks can arise however, as the amount of data transferred grows. The biggest catch is making sure that data delivery doesn’t end up costing more than the insight is actually worth.
There are a lot of considerations when it comes to selecting the right technology stack for a team or project. For most projects, one of the major considerations will be cost, of which the biggest driving factor will almost always be usage. In the cloud, very much unlike on-prem, teams pay for what they use. Thus, the name of the game will be using the least amount possible while still providing reliable, scalable, and performant services. The pay-for-what-you-use billing model with charges at a per second basis have become one of the main reasons why microservice architectures have become so popular. Usage covers everything from storage to compute time and, on occasion, licensing costs. Alongside that, the tendency for cloud providers to build new products on already existing functionality results in a billing model where you spin up a managed service database and pay for the underlying virtual machines (EC2 instances), attached hard drives, the licensing fee if using a system such as MSSQL, and the upcharge for using the provided service. If you are willing to pay a bit extra or at least dig around in the options, some of the available parameters can be fine tuned to eek out some more performance. For example, with EBS you can provision HDDs or SDDs. The SDDs will be faster, but have a slightly higher charge per GB and I/O used.
There are a few other things to keep in mind when evaluating costs besides utilization. For example, with cloud services, you also pay for data transfer. The details of those costs vary greatly based on provider, but a good rule of thumb is that ingress (bringing data into the account) is free, but egress (taking data out of the account) will cost. Normally a fee can be paid to speed up larger transfers up to the cloud as well, even going to the extent of having petabytes worth of data migrated using a semi-truck. On top of in/out of account data transfers, you will even pay for input/output for attached drive volumes. Going back to the EBS example, you pay for the number of I/O operations per second and the overall data throughput.
Another thing to keep in mind is that with long-term storage services like the different blob stores, there are also different tiers at which data can sit at rest. Most services at a minimum provide normal storage and cold (or infrequent access) storage. Cold storage is normally cheaper at rest, but might have extra fees attached and turnaround times when it comes to actually accessing the data. The benefit of cold storage comes from storing data that needs to be held onto, but rarely touched. Leveraging cold storage can definitely be powerful especially for larger data lakes, but you want to make sure that extra charges for data access don’t eclipse the savings from the cheaper storage fees. The particulars on how each service will bill their users can vary across different services, but the overall driver of cost is still the same. You pay for what you use in the cloud, and a heavy emphasis should be put on finding the most efficient way of doing things.
You should avoid making a common mistake with the usage/cloud cost-focused mindset: the most efficient way of doing something does not solely revolve around the compute process, but also involves (and arguably is mostly influenced by) the human interaction between development, testing, and maintenance of said systems. Not to mention all the other ancillary business-related influences and upstream/downstream impacts that come along with building out enterprise data systems.
When choosing a tech stack in the cloud (or making any business/tech related decisions, really), there are always tradeoffs to consider. In some instances, it might make sense to go with a more managed solution if the team lacks expertise in what is needed for maintaining something a bit more hands on. However, by using a tightly managed service, you tend to lose a bit of the flexibility that would come along with amore hands on approach. Another consideration is what happens when the team outgrows the hand holding from the more managed service. Porting some of the logic from that system to something more hands on may prove to be a big enough lift that skilling up the team to use the more hands on approach would have been the more efficient from the get go. These kinds of considerations should always be in the mind of the technical leaders and architects on a project alongside the most efficient ways to process, store, and expose certain datasets, tooling, and so forth.
In addition to the team’s technical skill, other considerations must be kept in mind when working within cloud environments, which if ignored can end up being very costly in both time and money. One of the more prevalent considerations is the rate of change of cloud technologies. Services provided by the cloud providers are constantly changing and evolving due to many of these services and technologies being relatively young in the grand scheme of things, alongside the ease of pushing updates via the internet, and the now predominant software as a service (SaaS) business model. With older versions of services becoming unsupported at inconsistent rates and in a manner where the changes are not reversible unless there is a huge backlash from a major customer(s), the rate of change can be overwhelming for some teams. Sometimes it is possible to pin versions of certain tech, like with EMR and Spark versions, but sometimes the EMR service itself will change. Not just the runtime can change (which can be pinned to an extent, but will eventually become unsupported), but also the UI/cluster APIs/management tools themselves can change and those changes mostly cannot be pinned/reverted. This is not even taking into account the chance of an integrating service changing and how that will impact the base service. These changes, which become even more pronounced and impactful the more managed a service, can institute a huge toll on development teams if they are not well prepared.
Communication by the cloud providers has not always been consistent when it comes to changes to their platforms, so even teams that constantly monitor all of the different sources of information about the platform (from blog posts, to news articles, to dev Twitter accounts, newsletters, and alert notifications) can be caught off guard by unexpected changes and updates that result in breakages/unexpected behavior. Furthermore, the cloud provider engineers are also human and make mistakes/break functionality in prod from time to time. These issues aren’t delegated to just provided services by cloud providers, but as anyone in the dev communities can attest to, also extends to any open source technology used by both the provider or the development teams themselves.
When it comes down to it, selecting the right tech stack or individual technologies is a balancing act. You have to weigh the project requirements, the team’s expertise, the project’s appetite for change, the need for mobility of logic, processes, and data, as well as the overall cost both in engineering hours and service costs to really be able to get an idea of what services would best fit specific use cases. Even once all those items have been taken into consideration, you should reevaluate when new technology is released, older/current technology has been updated, or even every year to see if the project is truly being efficient in the way you are utilizing the cloud services available.