Chapter 14. Myths of Cloud Computing

Steve Francia

Myths are an important and natural part of the emergence of any new technology, product, or idea as identified by the hype cycle. Like any myth, technology myths originate in a variety of ways, each revealing intriguing aspects of the human psyche. Some myths come from early adopters, whose naive excitement and need to defend their higher risk decision introduce hopeful, yet mistaken myths. Others come from vendors who, with eagerness, over-promise to their customers. By picking apart some of the more prominent myths surrounding the cloud, we gain better understanding of not only this technology, but hopefully the broader ability to discern truth from hype.

Introduction to the Cloud

In some ways, cloud computing myths are easily among the most pervasive of all technology myths. Myths about the cloud are quickly perpetuated through a blend of ambiguity of what “the cloud” actually means and the excitement surrounding the hype of a new technology promising to be the solution to all our problems. As the hype around “the cloud” grows, each new vendor adopts that term while simultaneously redefining it to fit their product offerings.

What Is “The Cloud”?

For the purposes of this text, we will be using the term “the cloud” to refer to virtualized nodes on elastic demand as provided by vendors like Amazon’s EC2, Rackspace, Microsoft Azurel, Joyent, and more. Even with this somewhat restricting definition, there are significant differences between the different vendors.

The Cloud and Big Data

You may be wondering what cloud computing has to do with big data. A significant percentage of companies today are using cloud computing and that number is increasing daily. While some positions exist where a data scientist can leave things completely to an infrastructure team, in many jobs they may be responsible for the infrastructure. In a startup, it’s quite likely, at least to some degree. In all jobs, some knowledge and awareness of infrastructure strengths and best practices would benefit the diligent data scientist. It’s natural for someone to think that the infrastructure isn’t her problem; but in a smaller firm, a data scientist may have to make decisions about storage and when the data disappears, it’s everyone’s problem. The hope is that through understanding these myths and through them the strength of cloud computing, the astute data practitioner will be able to leverage the cloud to be more productive while avoiding disasters along the way.

I’m going to take a slightly different approach from the sections you have already read. Rather than sharing a single experience, I’ll be sharing with you many experiences to which I’ve been privy courtesy of working for 10gen. 10gen develops and supports MongoDB and as a result, we benefit from sharing experiences with our customers, many of whom are on the cloud. In order to protect their privacy, I have taken some isolated experiences and woven them into a cohesive story about a fictional startup. To quote Dragnet, “The story you are about to read is true; only the names have been changed to protect the innocent….” I’ve also done my best to abstract the specific technologies and vendors used to their root principles, as the experiences included could have easily been had across any of the cloud vendors.

Introducing Fred

The central character in our story is Fred. Fred is the CTO of a six-month old data driven fictitious social startup called ProdigiousData. He and his team have finished their initial prototype and are about to launch the product. They recognize that while their immediate needs are small, with a small degree of success they will have big data needs very soon. Fred decides that they will launch their product on the cloud due to its easy ability to scale to handle their big data needs. Fred, and more importantly his CFO, are excited about the low cost that the cloud provides, especially without needing to purchase anything up front.

At First Everything Is Great

After months of preparation, sweat, and lots of coffee, the launch happens and it’s an immediate success. All the important things are happening just right. User growth is steadily increasing and more importantly people seem to really like the product. The system they have designed is quite capable of handling the load. Fred and his team couldn’t be happier.

They Put 100% of Their Infrastructure in the Cloud

Under the time and pressure constraints surrounding a startup that hadn’t yet launched, they decided to go with a fairly simple and straightforward infrastructure. They are using all cloud-based machines and services, from the load balancer and firewall to the database and data processing nodes. The current makeup is two smaller machines running software load balancers in an active-passive configuration. The load balancers distribute requests to three application nodes. At the back, they have a pair of database nodes configured in a master-slave setup. They feel they have eliminated any single points of failure and their virtual cluster is optimally utilized.

As Things Grow, They Scale Easily at First

As the load increases, they are able to stand up another app node with ease. They simply clone an existing node that is running and within minutes they have additional capacity. This is the horizontal scalability that they were expecting.

Then Things Start Having Trouble

A couple of weeks go by before they have their first blip. It’s manifesting in some users getting timeouts. It’s pretty irregular, but it definitely has Fred worried. The fact that they haven’t yet set up a sophisticated monitoring system keeps him up at night, but with only a handful of machines, it never seemed like much of a priority. Upon inspecting the application logs, they discover the problem is the application is hanging on database operations. The CPU on the database machine is working around the clock. Load is in the double digits. They reboot the database and boost the specs on the virtual node on which it’s running. They jump up to the largest size available, increasing the number of cores and RAM on the virtual machine. While the team thinks the problem is solved, Fred knows better. He knows that they haven’t solved anything, but simply delayed the inevitable. At some point, that larger database node will also reach a point of saturation and at the rate their load is increasing, it’s going to happen soon.

They Need to Improve Performance

In an effort to delay this even further, they begin to optimize their database. While their dataset is growing in size, it’s growing in use far more. They are doing much more read and writes than they expected at this stage. They need to find some way to increase the database performance. Running iostat on the database nodes is very telling. Their IO performance is poor and seek times are worse. They’ve gone with a popular cloud provider that has ephemeral local storage. Data persistence is achieved via networked storage. As a result, durable block stores in the cloud will have slower performance and less predictable throughput when compared to a local disk.

Higher IO Becomes Critical

Fred’s no rookie. He knows that to increase IO performance you either need faster drives or more of them. Since his provider only has one tier of drives, they go with attaching 4 volumes configured in RAID 10. RAID 10 gives them the best of both worlds, providing double read and write performance and full redundancy. After the change is made to both database nodes things stabilize for the most part. Now that the fire is out they set up a more sophisticated monitoring system, one that not only provides better diagnostics into what is happening by tracking stats and graphing them over time, but also alerts when certain conditions and thresholds are met. It’s a lot of work, but they’ve gotten a wake up call from this initial scare that they have been flying blind and and they won’t likely be this lucky next time.

A Major Regional Outage Causes Massive Downtime

Seemingly out of nowhere, disaster strikes. A regional outage occurs with their cloud provider. They are completely offline and it provides them little comfort to know that it’s not just them, but also some other fairly notable websites. Hours go by without any information other than the occasional status update from their vendor. Ten hours later the provider is back online. This is a very short-lived victory, for it is only when they try to bring their machines back online do they realize their disaster has just begun. They haven’t yet automated the build of each machine and now isn’t the time to do it. Because each machine was ephemeral, with this full power outage they lost all setups and are more or less starting from scratch. They manually configure each machine. First the app servers and then the database. Luckily the data is there, but the database won’t start. It’s complaining that it shut down uncleanly (duh) and the data files need to be repaired. After a lengthy repair, it looks like all their data is there, and 21 coffee-filled hours later they are back online. They have learned that managing nodes in the cloud requires a lot of work and that automation is essential. While an outage could have just as easily happened at a data center, there is no question that if they had an account at a data center they would have had more feedback from their account manager. A dedicated data center would be working with them to bring their machines back online and of course every host provides persistent storage, so the restoration of their infrastructure would be trivial. They certainly wouldn’t have needed to rebuild all their nodes.

Higher IO Comes with a Cost

In the weeks that follow, no real issues occur. With all that has happened, they aren’t taking chances. They are keeping a close watch on their infrastructure, especially the database. With RAID 10 and monitoring in place, they know that for the most part they are in good shape. Over time they begin to notice some strange behaviors and they struggle to explain it. It seems that overall performance has increased dramatically but one particular operation has actually degraded. The nightly bulk import is actually taking longer than it did before, even though the data imported is relatively the same. After crunching a bunch of data in an external system, they load the data in a large batch. This behavior is contradictory to all their expectations and they struggle to make sense of what’s happening. Google searches produce some forum answers, but no clear explanation emerges. They logically think that due to the somewhat random high latency present on these multitenancy network drives that by doubling the output and bandwidth they would be hedging against these issues. After spending a lot of time trying to diagnose, including trying to launch their cluster on a different region, they eventually abandon their pursuit, assuming it’s either an issue with their monitoring or a problem without a solution. They accept this degradation of performance because overall performance has increased. Most operations have improved in measurable and expected ways.

Data Sizes Increase

As user growth increases, the data increases even faster. They proactively realize that they will need to partition their data across multiple machines as they can not sustain growth on one pair of servers. They knew this all along, in fact they planned on it, but with the recent outage they have adjusted their approach.

Geo Redundancy Becomes a Priority

One of the nice features of the cloud is that cloud providers seamlessly provide hardware/services in many regions or availability zones. ProdigiousData realizes that to achieve their desired uptime, they need to be in at least two zones. They now have leveraged chef to be able to quickly create nodes.[71] They can easily create load balancers and app servers in the new region as they are predominantly stateless, but what about the database servers? How do they replicate or partition the data effectively? They do a quick test and realize that there is about a 0.250 ms latency between the two different regions and at times it’s considerably higher.

Horizontal Scale Isn’t as Easy as They Hoped

They come to the realization that this is going to be a lot of work. While the stateless application nodes scale effortlessly, the stateful database nodes are far less portable. Even though their underlying database technology makes it quite easy to add more nodes to the database cluster, each new node begins empty. Data needs to be migrated from existing nodes to the new one. Beyond that, they need to worry about where to place the nodes for maximum performance and minimum downtime. Fred concedes that for their business needs, they can survive with slightly stale data as caused from replicating over the wan from nodes in one region to nodes in another. They place some app nodes into each of the two locations and database nodes in each, but in a creative way. The data is partitioned into geographical regions and then evenly distributed across the different nodes in each region. Each primary node (the one accepting writes) then replicates to two different nodes, one local to that region and one in the other region. The application writes and reads to the locally writing database and reads from the stale local data that was replicated from the other region. Setting all of this up in a automated fashion was a big task that took weeks. Unfortunately, software doesn’t currently exist to both set up and coordinate efforts across many different nodes playing different roles in a cluster.

Costs Increase Dramatically

It wasn’t easy, but they got there. They now have a pretty scalable application running in the cloud. It has multiple location redundancy and is even optimized to route users to two different availability zones depending on their location. Everything is going well…well, until Fred gets the bill for the month. Something must be wrong. He never paid this much in a month with his dedicated colocation hosting company. Fred began to think of all the things they had added. Multiple locations, three copies of each node, six copies of all pieces of data (2 per RAID 10 configuration x 3 replicated database nodes). They also maxed out the configuration on each of those nodes. He began to do a cost analysis against an old statement. He discovered that when running on the cloud, it often required more nodes and resources to achieve similar performance. While the cost per node was often cheaper, cloud nodes and a server running on hardware customized for a task were not the same.

Fred’s Follies

Does Fred’s story sound familiar? Perhaps it reminds you of your own. What myths did Fred fall prey to and how can you best learn from Fred’s follies to avoid these pitfalls yourself?

Myth 1: Cloud Is a Great Solution for All Infrastructure Components

For a few years now, the cloud has been billed as the future of infrastructure. Marketers use such language as “Enterprise data centers will be largely replaced by cloud computing within 20 years” and “Public cloud computing offers many incredible possibilities, like the prospect of doing supercomputer-level processing on demand and at an incredibly low cost.”[72] Consumers have largely been taken in. They view the cloud as an easy solution for all their infrastructure needs. In a world where software is able to emulate nearly any hardware, people often are using the cloud for everything because they can, often with blissful ignorance. The lure of being able to stand up a load balancer, firewall, or RAID controller without any expensive hardware entices many.

How This Myth Relates to Fred’s Story

Another place where virtualized nodes can’t come close to the performance and functionality of dedicated hardware are load balancers. While expensive up front, an F5 or Stingray will produce amazing results and are much easier to configure than any purely software tool. Both have fully redundant options and work well when distributing load across multiple locations. It is true that some vendors provide some load balancing offerings, but none offer the flexibility or performance that one can obtain through hardware. Additionally, they are all designed to load balance between the Internet and your cluster, but not for uses within the cluster.

Myth 2: Cloud Will Save Us Money

Let me begin this section by stating clearly that using the cloud effectively can result in cost savings. Additionally, from a purely financial perspective, when using the cloud instead of your own equipment, the expense switches from a capital expenditure (CapEx) into a more flexible operational expenditure (OpEx). This appeals to many CFOs, which makes the CTO/CIO look like a financial genius. While this doesn’t make sense in every situation as there are still some times when years of CapEx tax depreciation are preferred, for a majority of companies OpEx is preferred over CapEx.

To better illustrate this point I’d like to use a simple analogy. There are three common ways to obtain a car. You can rent, lease, or buy. Each has its own place and to many people, it’s pretty straightforward which one makes the most sense for their situation. If you are in a place for a few days, renting a car makes the most sense. If you want to reduce your upfront cost and intend to use a car longer term, then leasing makes a lot of sense. If you are using a car longer term and want to completely customize it for your situation, purchasing makes the most sense. The world of computing presents these same three options. You can rent (cloud computing), lease (managed hosting), or purchase (colocation/data center). Similar logic applies here. If you intend a node for long-term use, purchasing will save you significantly over renting.

One place where this analogy fails is that in cloud computing, you aren’t given a lot of choices. You can increase the number of CPUs and amount of RAM independently of one another. While in time, it’s quite possible that more levers will be made available by vendors, historically, options have been quite limited. Choices have been boiled down to the most simple terms such as small, medium, and large. In our analogy, it’s more like you can rent any car as long as it’s a sedan. If you need to carry more than five people, you can rent two. You can rent a fast sedan or an economic one, but they are all sedans.

Often, running application servers that can easily adjust to needs by scaling up or down the number of nodes can produce real savings. But savings require you to manage and adjust the number of nodes appropriately.

For many uses, you could drastically reduce the cost by optimizing the hardware for your needs. One such case is with databases. On the cloud, the value (and performance) is often just not there. Databases benefit heavily from high IO with low random seek times. SSDs in a RAID 10 configuration powered by a reliable high performance hardware RAID controller will produce amazing results untouchable with any cloud configuration. What’s worse here is that to achieve acceptable performance on the cloud, you’ll end up spending a lot more on faster drives and more nodes, and consequently more replicated nodes for high availability.

This principle applies to all databases, relational and nonrelational alike. Some of the newer databases like MongoDB have been designed with the cloud in mind. In the case of MongoDB, the database utilizes memory mapped files as a cache to alleviate many read and write operations from the slower IO present on cloud. This is an extremely beneficial improvement, but eventually all data needs to be read from and written to disk. As a consequence of this memory management style, money spent for the purchase of hardware to be used with MongoDB is better spent on additional memory than on more powerful CPUs. Unfortunately, with the cloud you can’t adjust RAM independently from CPU, and most software and services don’t have a linear relationship between their processing and memory needs.

How This Myth Relates to Fred’s Story

While the cost savings up front were significant, over time they found themselves spending more and more. When you factor in the overhead of managing more nodes than needed it’s impossible to make the claim that they saved money by running on the cloud. Fred, like many CTOs, flew under the radar on this one. The reality is that very few people are watching this closely enough and doing a cost comparison. Like boiling a frog, the temperature slowly rises and before you know it, you’re cooked. Sometimes the switch from a CapEx to an OpEx itself is enough of a selling point to justify the cost.

Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID

Perhaps this isn’t a myth as much as a misunderstood behavior. As mentioned in the previous myths, people put a lot of trust in software being capable of replacing hardware. And to some degree it can, from the perspective of meeting the minimum feature requirements. In some areas of computing, we have gone over completely to software-based solutions. It used to be that sound processing was done completely by add-in sound cards. I remember bragging about my SoundBlaster AWE64, which meant I could play 64 simultaneous MIDI instruments, but more importantly to me, any game on the market. While a few sound cards are still on the market, they have nearly all been replaced by software-based solutions with negligible impact on performance. It would seem like RAID would be in the same boat. It’s a fairly low-level, but simple function. Linux’s md feature provides virtually all the RAID functionality present from the various hardware vendors.

How This Myth Relates to Fred’s Story

In our story, Fred and the ProdigiousData team fell prey to this myth. They put all of their infrastructure on virtualized nodes, using software-based solutions for many things that would have previously been done using hardware.

Unfortunately in many circumstances, hardware simply trumps software. Recall in the story that odd decrease in behavior when they increased their IO by switching from one drive to four in a RAID 10 configuration. They simply gave up trying to solve it, assuming it was just a fluke. It wasn’t a fluke; it’s easily isolatable and repeatable. It occurs as a result of Linux md (multi disk) functionality falling far short of the performance achievable by a good RAID controller. One area where this is quite noticeable is in random-write performance, which is quite poor. Similar things happen throughout computing. Whether discussing hardware or software, we often make the trade-off between the convenience of virtualized or interpreted over the performance of native. For example, compare dynamic languages to complied languages, or iPhone apps (native code) to Android apps (virtual machine). In cases where performance matters, it’s often an expensive mistake to make this compromise.

See http://engineering.foursquare.com/2012/06/20/stability-in-the-midst-of-chaos/

Myth 4: Cloud Computing Makes Horizontal Scaling Easy

Dr. Werner Vogel, Amazon’s CTO, explained that building horizontally scaling, reliable geo-redundant systems on cloud platforms “becomes relatively easy.” Of all the myths presented here, this is the most pervasive one about the cloud.

It is a common misconception that you can simply deploy your application to the cloud and it “just works.” It is commonly believed that the cloud eliminates the need for (careful) planning on how to scale an application. Geographic redundancy and 24/7 global access is easy because they are able to fire up nodes in multiple data centers. This may be the vision of the future, but it’s certainly not the present. Without careful planning and the appropriate infrastructure, there are common factors among cloud providers that make horizontal scaling even more difficult.

How This Myth Relates to Fred’s Story

The ProdigiousData group learned this through many hard experiences. When all was said and done, they had a pretty robust (and complicated) infrastructure, and they earned it. They invested heavily into each solution and experienced their share of downtimes and sleepless nights getting there. After all their experiences, they didn’t regret what they had done, but wondered if there was a better way.

The fact that there is a cottage industry around making the cloud manageable should itself dispel this very persistent myth, and yet to no avail. Perhaps the reason this myth is so prevalent is that some myths drive us. Great accomplishments are often driven by great visions. These optimistic goals define the future. Perhaps this myth isn’t a false tale as much as a vision we are all hoping comes to pass. It’s conceivable that in the near future, advancements in virtualization, operating systems, data storage, data processing, and the glue tying it all together will make this myth true. Of all of the above myths, this one may be a vision that may actually come to fruition.

Conclusion and Recommendations

The idea to treat computing as a utility, commodified and available at the turn of a switch, has revolutionized the industry. The gamble Amazon made years ago has been the most significant advancement to computer infrastructure in the last decade. As wonderful as this advancement is, it’s not the end of all computing. While cloud computing has its obvious benefits, it’s still relatively early in its development, with rapid advancements from various vendors all trying to one-up each other complicating the market. Not only is it a rapidly emerging technology, but every technology has its sweet spot. There is great power that comes from using the right tool for the job. Cloud computing is fantastic for stateless, process heavy jobs, such as most application servers. The cloud has historically been weaker at jobs where state matters. Data processing typically falls in the middle of these two. For me, the ideal infrastructure would include the best of both worlds: easy management of stateful machines running on optimized hardware connected via LAN to commoditized cloud nodes for application processing. It’s important to recognize that these cloud offerings are still infants in their life cycles. In time, as offerings develop and improve, it’s likely that the cloud’s current weaknesses will be erased and it will become an increasingly viable solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.94.171