CHAPTER 6: Planning for Performance and Reliability

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C H A P T E R 6

Planning for Performance and Reliability

This topic has always been a bit of a groaner. Everyone knows the importance of planning (who hasn’t heard the adage “failing to plan is planning to fail”?) but no one really wants to do it. Let’s face it, planning is simply not as fun as writing code. It’s very easy to come up with reasons why the planning can wait.

The problem with this approach is that by the time you start thinking that planning might have actually been quite a good idea, you’re already miles out to sea and heading off into even deeper water. So deep, in fact, that a sign proclaiming, “Here be monsters!” would not only not go amiss but might even be considered mandatory. Now you’re thinking that starting off with a plan might have been a swell idea. But if you’re not entirely sure where it all went wrong, it probably won’t be all that obvious what you can do to fix it.

Fear not! Help is at hand! As the famous Chinese saying goes, “if you want to know what lies ahead, ask someone coming back.” In short, the best way to avoid ending up waist-deep in a swamp is to speak to more experienced developers and ask them for help and guidance.

This chapter aims to provide a map that you can use to help steer yourself past the worst bits. Some of the things covered here will seem obvious, others hopefully less so. There’s no magic sauce here; all of this stuff is fairly straightforward. (But everything is straightforward once you know about it!)

We’ll start off with a very basic look at project management and how to plan your way through any project. It’s a lightweight process with a focus on giving you some structure to follow. You are most welcome to take what works for you and ignore anything that doesn’t. Feel free to enhance it in any way that suits your needs. What we provide here is really just a starting point; you should change whatever you need to change to make it as useful as possible to you.

The last section is admittedly a bit of a stretch for a chapter on planning in a book on performance. However, the number of sites we’ve seen taken down or severally damaged due to poor backups is really quite disturbing. So we’ll cover some of the basics of making good backups, the different types, and how to go about it.

yoU MAke DInner In TiME

As promised, we’re going to start off with good old planning and management. You might be strongly considering skipping this section. We strongly suggest that you at least skim through this section! Even if you only pick up one new idea that rings true, it will pay dividends from now on!

Because project management can be boring and because few people really want to read huge tomes on the subject (certainly not us, anyway), we’re going to introduce a very powerful yet refreshingly simple system. As luck would have it, the system has an easy to remember mnemonic of

yoU MAke DInner In TiME

which breaks down to

Understanding
Making Decisions
Design and Implementation
Installation
Testing
Maintenance
Evaluation

We cover each of these points in the following sections. Although it looks pretty full on, it’s really easy to use; quite honestly we’ve never found anything better! Although these ideas are mostly applied to development projects, they can also be easily applied to developing your system for performance. In fact, you can apply these techniques to pretty much any project successfully. The idea is that with this roadmap you’ll stray from the path less than if you just winged it (which is what most of us tend to do). So even a passing familiarity with this stuff will likely be quite helpful in the long run.

yoU

Geeks (or computer professionals, if we want to be politically correct—but we don’t) are often a little trigger-happy when it comes to problem solving. There is nothing like a good problem to stretch those mental muscles, especially as we spend most of the day explaining where the Start menu is and why when you left-click you really do need to click the left mouse button and not the one on the right. Perhaps this is why we tend to jump into a problem with both feet—to recover from the mental starvation we have been suffering. Whatever the reason, this approach has a nasty habit of either backfiring immediately or helping us to paint ourselves into a corner.

Now, before anyone else says it, we know that sometimes those really simple problems really are that simple. Sometimes when we say “It will just take a minute,” it actually does. More often that not, though, what was described as a really easy problem turns out to be a rat’s nest of tangled requirements, dependencies, and confusion. And let’s not forget deadlines, which are usually as realistic as founding a unicorn ranch staffed by fairies. The problem is that most of the time we find out about the hidden hell under the carpet once there’s no possibility of turning back.

Fortunately, there is a solution. Before jumping into the pool of project happiness, have a look around first. Speak to people and try to understand more about the situation. If you find out during your conversation about the piranha-infested swimming pool, you will be very grateful you didn’t decide to test the waters using yourself as bait.

Joking aside, it’s important to get a thorough understanding of the problem before you get started. This invariably starts with discovering whether the problem you’ve been given is actually the problem you want to solve.

For example, we were working on a project where the user wanted us to put together a solution using OCR (optical character recognition; see the note for more info) to scan in the student number on the 5,000 forms she received on a regular basis. She currently had to type those in by hand. This is the bit where we had to resist the urge to Google for the latest OCR libraries. Instead, we asked her questions to try to get to the root of the problem. Why did she want to scan the student numbers? Well, she wanted to save time inputting all the forms. What were the forms? Why was she inputting data? The forms were student confirmation forms that the college sent out to every student at the beginning of term. The data she was inputting was whether or not the form had been returned.

Note Optical character recognition is something the computing industry has teased us with for many years. It goes right along with voice recognition software in that it offers a lot but often falls short of your hopes and expectations. OCR software has greatly improved (as has voice recognition software, to be fair) but it is far from perfect. In effect, the computer has to look at a picture and work out whether it contains letters or words. This is exceptionally easy for a human but very challenging for a machine. When you consider all the different forms the letter “e” might take, and how each person’s handwriting is unique (and there are thousands of computer fonts), it’s easy to understand why there could be some mistakes. Plus recognizing a character at any angle is easy for a human but requires vastly more effort from a computer, so the potential for mistakes is quite large. When you need to reliably enter information (in this case a student ID number), OCR is not an ideal solution; in fact, it’s something you’d want to avoid.

This gave us some useful information right away. First of all, the forms were generated in house. Second, all she was doing was typing in a student number and pressing Enter. So we spoke to the people who created the form. They had a little application that queried the student database to produce a PDF form ready for printing. We asked if they’d be adverse to us adding anything to the form, and they were quite happy to oblige as long as we didn’t alter the data on it.

The solution: a bar code on each of the printed forms. Then our user scanned the barcode to confirm that the student had returned the form.

Had we simply taken the user’s problem as stated and run with it, we’d have built a complex OCR system and still the user would have had to at least double-check what had been scanned. Even if the application worked perfectly (ever seen OCR work perfectly all the time?), it would still be far inferior to the bar code scanning solution. By taking the time to look deeper, we were able to uncover other options and, in this case, a better solution. You won’t always uncover a better solution but at least you can be more confident that the solution you are proposing is the best one.

The same issues arise when you have great ideas of your own. It is very easy to have a fantastic idea and sit down to build it without truly understanding what it is you’re building or why. What makes this idea so awesome? What problem are you trying to solve with this idea? Is there a better solution already available? Don’t waste time trying to reinvent the wheel.

This doesn’t mean you can’t be creative. In fact, just being technical is not enough; you need that spark and curiosity to be a great developer. But you also need to be smart. You need to stack the odds in your favor so that the chances of making your latest project a success are as high as possible. Sure, you can beat the odds just by being awesome, but why not take all the help you can get?

So, where do you start? You start by asking questions. You ask as many people as many questions as you can come up with. Dig deeper into the problem. People have a tendency to see the solution in terms of the problem rather than thinking about the best solution overall. In our earlier example, by digging deeper and finding the underlying issue (how to quickly record a student’s return), we were able to come up with a much better solution.

In other words, once you know the problem, take a step back and decide whether this is the underlying problem or whether it’s just a surface issue. If you’re not convinced it’s the underlying issue, keep on digging until you’re sure!

Tip Emanuel Lasker, world chess champion for an unmatched 27 years, once said “When you see a good move, look for a better one.” If you’ve come up with a fantastic solution, keep looking for an even better one. The more great solutions you come up with, the more likely you are to find the very best solution of all.

Once you’re done asking all your questions, you need to do some research. This doesn’t have to be heavy-handed or involve hours staring at a screen. The idea is to take a look around and see how others have solved the problem you’re now facing. If it’s landed on your desk, chances are it has landed on someone else’s desk at some point in time. If you’re really lucky, they will have written up a nice blog post telling you what to avoid and how to fix the problem. Even finding a post to an e-mail list with the exact problem but little in the way of answer tells you something about what you’re trying to do (most likely that it’s not going to be an easy ride). The Internet is a wealth of information; in a very short time you should be able to dig up some useful information on solutions, libraries, or tools that might help you.

By this stage, you should be confident you’ve identified the real problem and you should have some ideas on how it has been solved (or not) elsewhere and what tools might help you build a solution. Armed with this information, you’re ready to go on to the next stage!

Wait! Before you go running off to work on your design, take a minute to think about how you see this project ending.

Have you covered everything you need to cover?
Can you see in your mind’s eye how you want this to play out?
Based on what you’ve found so far, is your master plan a good fit?
How hard will it be to backup and maintain this solution in the long term?
Are you painting yourself into a corner without even realizing it?

Before you move on, take a moment to think about what you’ve decided and check that everything is still on track.

MAke

This stage is where you need to make some decisions. You know what needs to be done and you have a few ways that you might go about doing it.

Although this section is about making decisions, you first need to think about the constraints your project is facing. The most common constraints are time and cost but can also include things such as available hardware, operating systems, or a particular programming language. In the example from the last section, one of the criteria was that we use the .NET platform, which was the college’s standard. This meant that the library we wanted to use (which was only available for Python at the time) was not an option. There were also cost restrictions, which affected our ability to buy a commercial library. From these restrictions alone it became clear that we’d need to find a .NET-based open source library or, failing that, we’d have to write our own.

No matter what solution you come up with, it has to have an overall positive effect. If it costs a lot of money to run, that might be acceptable if it quadruples the number of users. If it takes six months to build, that might be okay if over a period of years it will generate vast amounts of income. These criteria are like shifting sands; they are different for every project and can actually change during the project’s lifetime.

This is why it’s important to understand the goals and constraints of a particular project so you can decide the best course of action. As the reigning professional, it is up to you to look at the pros and cons and then start making decisions. This can either be very easy or it can pose a serious challenge with trade-offs. You must also consider the possibility that there might be no feasible solution for a given problem and set of constraints. Many of the failed projects we’ve seen didn’t fail because of poor design or programming; they failed because they tried to do the impossible.

We don’t want to talk too much about decision making because you’ve probably already done most of this (at least in your head) during the understanding stage and thus this task is really just a formality. Still, it does help to get things down on paper and to make sure that you’ve covered all your bases. Once you’ve got this section down pat, it’s time to move on to design.

Or not! It’s that time again! Take a break and think about the issues we raised at the end of the last section. Once again, think through your decisions and think about how you see this project ending.

Are you still on course?
Has anything changed that you need to be worried about?

When you’re happy all is well, you can move on to design.

DInner

At last, it’s time to start designing and building the solution! This part is pretty much down to individual taste, and the methods involved are as unique as the problem itself. However, it can still be broken down into at least three separate parts.

First, you need to have some sort of design in mind. Although writing code is a form of design in its own right, if you start coding right off the bat, you might very well code yourself into a corner.

Once you have a solid design, it’s time to finally build it. There are various ways to approach this, and everyone has their personal favorite. We’re assuming your technique isn’t horrific, so we won’t focus too much on the prototyping stage.

Next, it’s time to configure the application and make it production-ready. When we say production-ready, we’re not suggesting that you should have something immediately deployable (after all, you haven’t fully tested it yet). However, when you finish this part of the process, your application should be in the same form that you see it being deployed in.

So how do you go about designing the solution? A lot of books at this stage would tell you the best way to go about the design process. We’re not going to go that deep into it because the best technique depends on your own experience and skill set. From your research, you should have a strategical view of what you want to build. Now you need to start filling in some of the blanks and fleshing out the design itself. Because you already have a good idea of what it will look like, and the tools and constraints that you have to work with, the design process can be pretty straightforward. In most cases, it’s just a matter of connecting the dots you created during the research stages.

But what if your overview doesn’t work quite the way you planned now that you’ve take into account some of the finer details? If you need to change your strategy, it’s not a problem. At this stage you haven’t started building anything yet so you’re free to make whatever changes you see fit. This gives you a great deal of freedom to experiment and try a wide range of possibilities with little cost, risk, or waste of time. Once you write your first line of code, you are already creating a mental line in the sand that makes it much harder to go back and try new ideas. Even if you do decide quite quickly that you’re heading in the wrong direction, there is always that urge to keep on going. As soon as you realize you’re travelling on the wrong path, turn around and head back!

When you do push an issue back to the research stage, you should also fix it there. Whatever your new overview looks like, it should be supported by evidence from the research phase rather than a gut reaction. Keep your foundation solid so that you can always confidently justify your decisions.

Once you have a workable design, it’s time to build it. At last, you get to have some fun! As promised, we won’t talk about actually writing code in this section but if you find you’re building something that is pushing the limits of your design, you should stop and review the design. If you find that your design needs changing or doesn’t quite fit your overall strategy, it’s time for a quick trip back the research phase. Each trip doesn’t require a complete overhaul; you just need to confirm or refute any new issues.

So now you have your solution, one that solves the problem within the original constraints, is based on solid research, and is well planned and designed. Now you just need to configure it and set it up in the way you see it actually being used in production. Again, that doesn’t mean you take the solution live, but you want it to the point where it looks the same as if you were going to deploy it.

Wait! It’s time for yet another pause and a good think on what you just did.

Are you still on track?
Is all progressing as planned?
Have you made sure that your design takes into account your end goals and hasn’t been distracted by any cool technologies or great ideas for new features?

Stay focused on the goal. You can always tweak later, but solve the problem you set out to solve first.

In

Implementation can be either exciting or terrifying, depending on how the other stages went. Either way, it’s time to get your application hooked up and ready to save the world. Of course, it’s not as easy as just switching your application on. You need to make sure that people know how to use your application and that they actually want to use it. Your project will be significantly more successful if the users are waving the flag for you. You also need to make sure that the migration and cutover from the old system happens as smoothly as possible.

Installation should be fairly straightforward because the production environment should be all but identical to the system you set up in the last section. Even when you’re building things for your own interest, you still need to make sure that when you install your solution, you do it right. This sounds obvious, but we’ve seen it happen more than once where someone has deployed their latest and greatest creation but forgotten about a key dependency or didn’t remember to change the database settings. Then the whole thing implodes when they go to fire it up—and we all know that first impressions are the most important!

So take your time with this stage. Make sure you know your solution needs. Double-check all of the dependencies and configuration details. If your application needs to send e-mail, finding out from the administrator that direct e-mail is blocked and that you must go through a relay is information you want well in advance when you can easily do something about it. It is also helpful to have some form of test script that exercises all of the key points in your application. This might not be possible depending on the solution, but if you can use such a script, you should do so! Every little bit helps!

But even when your application is installed, you are far from finished. You should have a plan for switching out the old system with your new and improved solution.

Do you need to import any data?
Can you simply work with the existing database or do you need to set up something new?
Will people start using your application immediately or will you gradually phase it in over a period of time?

These issues aren’t unique to development; any solutions where you are making changes to critical systems (such as installing caches or load balancers) should be rolled out on a small part of the system rather than doing an en masse migration. Sometimes the softly, softly approach is your best bet. Sometimes, though, you really have to do it in one big jump, in which case you need to have a plan to follow if it goes wrong.

These sorts of plans don’t have to be intricate. You don’t have to answer every single question that might come up. However, you should know what to do when a particular issue arrives. In the world of financial trading, the markets are in a constant state of flux. When you take a position, you need to already know the conditions for getting out of that position. Then, when something goes wrong, you can react immediately before it gets any worse. If you haven’t considered the issue, then you freeze while you think about what to do next. Freezing can cost a trader an awful lot of money, just as not having a plan to deal with any issues in your system might damage your reputation. So, like the professional trader, have a good idea of what could blow up in your face and a rough idea of what to do about it.

The next thing to look at is training staff and getting their buy-in. Many people greatly overlook the importance of getting a buy-in from the people who are actually going to be using a project. It’s important to realize that an inconsequential change from a developer’s point of view can ruin a particular user’s workflow. Taking away their most relied-upon feature is not going to make them your friend. However, you can head this off before it even happens by talking to the users and showing them what you’re doing and why. Most importantly, when you get feedback, make sure you act on it. Even if you don’t agree with it, you can demonstrate why that idea won’t work or would cause problems. Show that you didn’t simply just dismiss it out of hand. You might also get some excellent ideas that you never considered because you’re not working at the sharp end. If users are already lining up to use your solution before you’ve even released it, then even if there are teething problems you’ll find that your users are willing to give you an awful lot of goodwill and support.

This leads nicely into user training. The ideal piece of software requires little in the way of user training. Wherever possible, find someone to test your app—someone who has no idea what it is supposed to do or how to go about doing it. Ideally, even they should find the software easy to use, though whether or not they can actually make use of it is an entirely different matter. This sort of test can unearth glaring issues that the intended user of the software might simply work around or not consider worth mentioning.

Many solutions (such as deploying a web cluster) don’t involve training users. However, there will be administrators that need to know what you’ve done. Even if it’s just you running your own cluster, it doesn’t hurt to spend some time writing up documentation. It’s all clear as day to you now. But at 4 am six months from now when something catches fire, it won’t be nearly so fresh in your mind. You might need to really dig around to solve a problem that could have been solved with a single command, had you just remembered which one to use.

So training isn’t just for the users; it’s for anyone who might come into contact with your solution—and that includes you. (We won’t go on about the different ways of providing training as there are thousands of books on that already.)

Wait! It’s time for another sanity check. If you sanity check yourself each step of the way, you’re never going to get truly caught out; any issues that turn up are still relatively easy to fix. So take a look at the original plan, the original problem, and your solution.

Is it still focused?
Have you solved all the problems you intended to solve when you started out?
Have you drifted from the one true course?

If you’ve drifted a bit, no problem. Just bring it back on course. If you’re still on the right heading, it’s time to jump to the next section.

TiME

TiME breaks down into testing, maintenance, and evaluation. The type of testing mentioned here isn’t the same as the testing you did during the design and implementation phase. This testing is done throughout the life of the solution simply by using it. When looking at how it is used over a period of time, the stresses and problem points begin to make themselves evident. For example, maybe the solution isn’t fast enough or perhaps one of the components needs some tweaking. The issues that you pick up probably won’t make or break the solution, but they do highlight that it is not running optimally.

Once you’ve found a particular issue and you’ve determined that it needs to be fixed, you can do this as part of the maintenance cycle. The maintenance cycle is really used for two things: for fixing the issues that you’ve highlighted and for clearing out any cruft that has built up during daily use. For example, you might clean out any temporary files or tidy up the database—anything that works towards making the application run smoother.

Evaluation involves taking a step back and determining whether your initial solution solves the problem. Ideally you’d want to have solved it on the first try and then each maintenance cycle can improve or enhance the system in some way. Often that’s not possible and it requires a few tries to actually hit the mark.

The TiME cycle continues for as long as the solution is in use. For simple or mature systems, this cycle can be almost glacial; changes might happen perhaps every six months to a year. For newer systems, it might happen on an almost daily basis. When deploying a system into production, that’s rarely the last time you’ll ever touch it. Something almost always crops up that requires anything from a minor tweak to an almost complete rebuild (see the box on the Y2K bug if you’re not convinced). So while you shouldn’t sit around worrying about it, you should be comfortable with the idea that anything you build and deploy is quite likely going to need to change at some point.

Again, stop and think about what happened in this step and your ultimate goal. If your solution is designed to be easy to support and easy to maintain, and you’ve focused on this goal throughout the process (as you will have done if it was in your original plan and you’ve been sticking to it), then by the time you get to this stage, you will have a solution that’s easy to maintain and support. You reap what you sow, but a lot of effort goes into making sure you’re actually sowing the seeds you intended to sow (in the right place, at the right time, and hopefully in a field that you own). Ultimately, you will end up with a much better solution that’s much easier to work with if you take a break every now and again to check the compass to make sure you’re still heading in the right direction.

THE Y2K BUG

The Y2K bug (named after the year 2000) was probably one of the most hyped issues leading up to the new millennium. There were fears that nuclear missile silos would get confused and fire their rockets and that planes would fall out of the sky. Or that people receiving pensions would suddenly have a negative age and so stop receiving any payments.

All of this hysteria was caused by a very simple problem—a problem that at the time it was created was not a problem at all. In fact, it was a great way to cut costs, speed up processing, and make more efficient use of very expensive computers. It all came down to storing the year as a two-digit number instead of a four-digit number. For example, rather than storing 1983, the system would only store 83. Being able to halve the amount of storage required at a time when 10MB of memory was only found on million dollar computers was a good idea! It didn’t occur to anyone that this might pose a problem in the year 2000. After all, that was some 30 years away. Surely no one would still be using these systems by then!

Guess what? We were and many companies didn’t really catch on to the problem until relatively late in the game. If you happened to know COBOL in 1997, you could (and many did) set yourself up for life updating the systems in a panicky haze.

For example, many systems would calculate the age of a person by taking the current year and then deducting the year of birth. To use the previous example, 2012 - 1983 = 29. However, if you just use the last two digits, all of a sudden that person becomes -71 years old. How much do you charge a -71 year old for a year of car insurance?

The solution was relatively easy, even though it was fiendishly tedious: simply lengthen all the data fields to support four-digit years and then go through all of the software and make sure that anything that touches dates has been updated to use the new system.

So was there really cause for panic? As it turns out, 2000 came and went without much in the way of problems. However, the issue was known decades before and many years passed when this fix could have been made part of general maintenance, a fix that would have saved an awful lot of worry not only for companies but for their customers as well.

The Importance of Planning

And so ends our quick detour into project management. We tried to keep it as light as possible and avoid any of the current buzzwords. The most important thing we’d like you to take away is that having a plan or a structure to follow can really make a huge difference in any project.

For those still wondering why we’ve just taken half a chapter to talk about project management in a book on performance, it’s because many of the performance issues we see in the wild were caused by people not truly understanding the problem or not really thinking their designs through. This led them to create suboptimal solutions that couldn’t scale. By the time we were brought in, these systems were in production and were too expensive and critical to the business to replace, so we had to do the equivalent of a quadruple bypass to get the thing working at an acceptable level.

Had the developers for those projects sat down and truly understood what it was they were building and why, they would have been able to come up with a much better design, one that probably wouldn’t have caused the performance bottleneck in the first place. As the old saying goes, an ounce of prevention is worth a pound of cure.

You can make the argument that many applications are properly designed and it’s simply that they have far more users than the current infrastructure can support. But any well designed piece of software should be relatively easy to scale. Applying this project management technique to your new load balancing and caching cluster design will ensure that you deliver the best possible solution for the application, regardless of how well it was designed.

So this part of the chapter has been all about helping you to think of the big picture while keeping track of the details that tend to trip people up. Even the professionals often forget to do this; it’s much more common even among IT professionals than we care to admit. At the very least, the ideas laid out here should help to give some structure to your ideas, even if you only take one or two of them to heart.

Although we’ve covered a good project management framework, we haven’t really looked at the issues with developing software itself. There are many interesting problems that really only apply to software engineering that other forms of engineering seem to have been able to avoid. For example, unlike building a skyscraper or a bridge, software is almost fluid in nature and has a tendency to sprawl in all sorts of unanticipated directions. When building a bridge, the distance between the two end points never changes. The distance between the ground and the deck of the bridge never changes. In software, core concepts can change all the time and frequently do. Imagine trying to build a bridge where the end points keep moving and the height is never the same two days in a row; it’s not hard to see why writing software can be so stressful.

Note It has been pointed out to us that the distance between two points can indeed change and that it is quite possible for the distance between the ground and the deck of the bridge to change as well. In fact, there are specific engineering solutions available to resolve both these issues. However, we feel our point still stands: at least when you build a bridge, you don’t suddenly find out you need to build an airport, which three weeks later becomes a skyscraper before finally (usually a day before the deadline) reverting to its previous bridge-like configuration.

When writing software, we make many design decisions. We decide where to put code and whether or not to use stored procedures in a database. Each of these decisions has an effect on the future. If we decide to use stored procedures now, what happens when our growth determines we need a new database, one that does not support our stored procedures? What happens when we need to add an API to our application and realize that the core business logic is spread throughout the whole system so adding an API will require almost a total rewrite? These things aren’t theoretical; they’ve happened to us, and we’ve seen many companies get caught in the same trap.

Hindsight is a wonderful thing. There are many things that we would have done differently had we known at the time how they were going to turn out. Until the geniuses in Cupertino release the iCrystalBall (we would also settle for CrystalBall+ from Mountainview or CrystalBallXP from Redmond) the future will remain mysterious and off limits. Because we can’t tell what will happen, many people unfortunately use this as an excuse not to worry about it. If something goes wrong, they reason, they will fix it then. The problem is that by the time it goes wrong, it is likely to be extremely difficult and expensive to fix (assuming it can be fixed at all).

So this section will give you a mini super power: you will know what problems might crop up! Now that you know about them, you can take them into account so that if they do occur, it’s a minor inconvenience rather than a full-on panic attack.

Backups

Unfortunately, making proper backups is one of those things that no one takes seriously until they lose six months worth of work. At which point, they curse themselves for not having proper backups and vow to do something about it—at some point.

If you’re one of the relatively few people who have awesome backups, feel free to skip the rest of this section. If you have tactical and strategic backups, you are already covered. If you’re wondering what a tactical back is, read on!

Note Awesome backups are defined as backups that you are absolutely certain can be restored and are located in multiple places that you are absolutely certain you can get to in a form that you are absolutely certain will not be affected when you lose your production system. If this sounds like overkill, remember, they’re called awesome for a reason!

Why Backups Are so Important

Backups are important because some things are not replaceable. For example, if you have a web site with many large high quality images, chances are you will have many thumbnail images to allow for easy browsing. These thumbnails should be backed up as a matter of course, but if you were to lose them, it’s not a massive problem because you can recreate them from the original images. However, source code and other items are irreplaceable. If you lose the source code for an important class, there is no way to magically get it back. You have to reconstruct it from scratch, which is especially difficult because a single class is enhanced and improved over a period of time to add features and abilities that might not be obvious on the first attempt.

Imagine if you ran a very popular web site akin to Facebook and you lost all your source code. You want to make a simple change? Sorry, you can’t do that until you’ve replaced everything that has been lost. All those hours of research, testing, and planning? They need to be redone. If keeping the source code safe was your responsibility, you’re probably going to lose your job. You could be in a world of legal pain as well; losing a company asset potentially valued at millions of dollars is not a good thing.

You should make good backups for no other reason than you don’t want to retype and do all that hard work for a second time. Backups can save you typing and testing—and nobody wants to do any more of that than they have to!

There May Be Trouble Ahead

Computers have the very nasty habit of blowing up at the least opportune time—usually without bothering to tell anyone and leaving enough intact that a cursory glance won’t notice anything amiss. This usually happens just before a presentation or when you’re about to launch that awesome new feature you’ve been planning for months.

That much is a given, but even knowing this in advance (and every computer user has their fair share of horror stories filed under When Technology Goes Bad) few people take this seriously. Even people who have had their servers catch fire get caught out again by the same problem because for some reason they believed that lightning never strikes twice!

But it does, and it has a tendency to know which people will be affected the most. It doesn’t matter if you have the best hardware or your computer is brand new or you’re going to make a copy tomorrow night when you get home. You need to accept the idea that your system can fail at any time for any number of reasons. Once you accept this, you will begin to appreciate the enormity of the problem. Hopefully you’ll decide right here and now to sort things out so that you know you’ll be covered.

Backups are an insurance policy. Like an insurance policy, no one likes to fork out the premium, but everyone is very glad they did when it comes to making a claim. And everyone who didn’t pay the premium wishes they had!

IT REALLY CAN HAPPEN

We were working on a project to build a custom thin client server solution to be deployed across an entire city. It had a secure connection to a central server for filtering, user authentication, and sharing files, and each server had a fairly complex desktop set up with the latest and greatest software. Add to this a huge array of fine-tuning and weeks worth of experimentation and you have the prototype server ready for deployment.

After testing the new server thoroughly we were finally convinced that it was ready for prime time. We decided to have a cup of coffee before taking an image of the disk. It was 7 pm and we had to pick up another colleague from the airport 250 miles away at 6 am the following morning to help with the deployment. So far, so good; we could image the disk, get five hours of sleep, and all would be well.

It was while we were adding the sugar to our coffees that we heard a nasty crash. On returning to our office, the server was on the floor. This was somewhat odd as it had been firmly seated on the desk. As we went to pick it up, we could hear a somewhat pitiful “whirr-clunk” sound. For those not in the know, this indicated a shattered and totally useless hard disk. The fully tested and completed build was gone.

We didn’t get any sleep that night and amazingly we did have a working prototype to install the next day—but it was six hours of solid panic. Had we taken the image before we went to make coffee, perhaps we would have had a good night’s sleep.

We still don’t know how the server ended up on the floor, but it certainly demonstrates that bad things can happen without any warning. Had we made a backup, it would have been a minor annoyance. But we didn’t, so we nearly had a collective heart attack instead. Learn from our mistake! Make regular and complete backups!

Automation is a Must

If your backup system requires that you must actually do something, then we promise you that your system will fail you at the worst possible time. The reason is that humans are inherently lazy. We tend to put things off. Why stop to make a backup now when there’s just a few more lines of code to go? I’ll just watch the rest of this episode and then I’ll definitely make the backup! Then your machine crashes and you realize that the last workable backup you have is actually from three weeks ago, long before you made the critical changes you were polishing up today.

For a backup to be safe, it needs to be automatic. This can be run from a cron job on Linux or an application that runs all the time on your Mac. Whatever the solution, you really want it to take care of the whole process for you. That way, regardless of what you’re doing or thinking, your work will be saved.

Tactical Backups

A tactical backup is a backup that runs constantly while you’re working on things. It allows you to jump back in time and recover recent work that you’ve either lost or edited to the point where it is hopelessly broken. On our Mac we use Arq2 (many people use Apple’s own Time Machine). It takes a snapshot every 20 minutes and then encrypts and sends the backup to Amazon S3 for safe storage. This means a checkpoint is set every 20 minutes that we can jump back to if necessary.

This is ideal for things like editing text (in fact we’re using it right now in case the laptop dies, as we really don’t fancy writing this chapter again from scratch!) but when you’re working on source code you might (read almost definitely) need finer control. Source control systems such as Git, Mercurial, or Subversion track and store the history of every file in your project. However, they only commit changes when you actually make a commit, and especially with distributed systems such as Git and Mercurial, the data is only transferred to a remote server when you specifically do a push.

This means that even if you’re religiously using source code management (and you should be), it is not the same as a timed backup. They do different things. Source control is there to track changes in your application and to allow you to easily move between current and previous revisions. It is not designed to protect you from disk failure. Backups, on the other hand, are not interested in tracking the minutia of each file and instead simply find the changes and copy them off the machine for safekeeping. While you can argue there is some overlap here between the two, the goals are quite clearly different.

So a tactical backup is a solution that allows you to get your hands quickly on data that you might lose or might have replaced, thinking you didn’t need it any more. Whatever solution you use, it should allow you to quickly recover data and should be easy to use. You should feel at ease pulling data back from the past and it should be second nature to you.

Strategic Backups

Strategic backups are a bit different from tactical backups in that they do not necessarily have to be instantly accessible. They are usually created based on specific times or events.

For example, every time you deploy a new version of your site, you might make a complete backup. This would be a strategical backup because you’re not planning to actually use it any time soon; it’s just there in case you need it. Source code management systems allow you to do something similar using tags so that you can easily mark a particular revision as a specific version. However, these solutions are not a replacement for backups. After all, you should be making a full backup of your code repository after a big release just in case something goes wrong.

Alternatively, you might create time-based strategic backups. This is when backups are made based on the amount of time that has passed. For example, you might schedule a full backup of your code repository once per month. Maybe you then encrypt and upload it to Amazon or burn it to a CD or DVD.

Generally, you want to do a combination of the two (i.e. make backups at each key event and on a regular basis). Again, the backups should be automated wherever possible so that you don’t need to think about them.

Incremental vs. Full

To quickly recap, a full backup is a complete and self-contained resource that can be used to reconstruct all the data. It is the only file you need to get all your data back. However, each time you take a full backup, you are going to duplicate data, which means you’ll eat up huge amounts of disk space.

The solution to this is to make incremental backups. This is when, having taken a full backup, you only store the changes from your current system. Because you’re not storing duplicate data, you save a huge amount of space. The downside is that you need all the files leading up to your current backup in order to do a restore. If you lose any of the files, you won’t be able to restore the backup.

So which is better? As always, it’s a trade-off between space and safety. Personally, we avoid incremental backups for critical things because we don’t want to be hunting around to locate the various parts of the backup that we need. That said, Arq uses incremental backups, and as it stores them all in the same place on Amazon S3, the risk of losing a critical file or it being damaged is very small. So far, this has worked really well for us.

Ultimately, you need to weigh the pros and cons and make a decision that best fits your needs and requirements.

Please, Please Perform Test Restores!

If you are going through all the hassle of making regular backups and making sure all the important files are covered, you should be given a pat on the back because you’re already ahead of the game. However, it never fails to amaze us that so many people who put in all that effort making good backups never actually test that the backups are working. Usually they find out that all is not well in the land of backups when they actually try to restore a critical file and find out that they can’t.

This happens a lot, even in big companies. A huge amount of effort is spent on making sure backups are running, but relatively little (if any) time is ever spent verifying that those backups are recoverable. For example, one of the backup tools we were using quite recently was backing up data quite happily, only we didn’t know (and apparently nor did the application in question) was that the index itself was corrupt. Data was being stored nicely on the remote server; it just wasn’t possible to actually recover any of it. Fortunately, we found out by a routine test. Every week we attempt to do a full restore to another machine, and that restore failed. This freaked us out, but the problem was easy enough to resolve: we just dropped the backup and created a new one.

There’s no telling how long this problem might have gone undetected if we hadn’t attempted the test restore. We could have potentially lost months or even years worth of data.

So please, if you’re making backups, test them on a regular basis. Depending on the system you’re using, you can also automate a test restore. However, there’s no substitute for hands-on experience. Proving to yourself that your backups are solid is a great way to help you sleep at night!

Summary

This chapter has admittedly taken us off the beaten path as far as performance goes, but we hope you agree that it was worth it. We started off by looking at a nice lightweight framework for project management and how you can apply it to practically any project you work on. We highlighted that many of the performance-related issues we see in the real world could have been prevented with a little forethought and that applications tend to be much easier to scale if they have been well designed in the first place.

Then we took another slight detour to touch on backups and what sort of things you should be looking for from your backup solution. This section covered the differences between tactical and strategic backups, the pros and cons of incremental and full backups, and the importance of testing the restore process.

We hope this chapter has sparked a few ideas or at least given you pause to think about the way you manage your own projects and its related assets. This chapter is certainly not meant to be an all-or-nothing affair; help yourself to the bits that meet your needs and feel free to ignore the ones that don’t. If you’re completely happy with your existing methodologies and strategies, that’s great, too; at least you can be more confident that nothing has slipped through the cracks!

The next chapter introduces the core concepts of load balancing, building the foundation you need for the rest of the book. It discusses the various things that can be load balanced and the importance of really knowing your system so that you can get the most out of it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 6: Planning for Performance and Reliability

Create new playlist

Sign In

Sign Up

C H A P T E R 6

Planning for Performance and Reliability

yoU MAke DInner In TiME

yoU

MAke

DInner

In

TiME

The Importance of Planning

Backups

Why Backups Are so Important

There May Be Trouble Ahead

Automation is a Must

Tactical Backups

Strategic Backups

Incremental vs. Full

Please, Please Perform Test Restores!

Summary

Table of Contents for
CHAPTER 6: Planning for Performance and Reliability