Appendix A. Afterword

Failure’s bad enough;
add the human element
and things really suck.

Back around 2000, my employer’s main business was designing web applications, but once those applications were built, our clients would turn around and ask, “Where should we host this?” That’s where I came in, building and running a small but professional-grade datacenter for custom applications.

As with any new business, our hosting operation needed to make the most of existing resources. Hardware was strictly limited to cast-off equipment from the web developers, and we used only software that was free. The only major expense was a big-name commercial firewall, purchased for marketing rather than technical reasons.

With a whole mess of open source software, we built a reliable network management system that provided our clients with more insight into their equipment than their in-house people could offer. The clients paid for their own hardware, and so had fancy high-end rackmount servers with their chosen applications, platforms, and operating systems. As the business grew, we upgraded the hardware (it’s nice to have disk drives that are less than five years old), but we saw no need to replace the software.

One Monday morning, a customer who had expected to use very little bandwidth found that he had sufficient requests to devour twice the bandwidth we had for the entire datacenter. This affected every customer. If your $9.95 per month web page is slow, you have little to complain about; if your $50,000 per month web application is slow, you pick up the phone and scream until the problem stops.

To make life worse, my grandmother had died only a couple days before. Visitation was on Tuesday, and the funeral was Wednesday morning. I handed the problem to a minion and said, “Here, do something about this.” I knew the network could manage bandwidth at many points. The web servers themselves, the load balancer in front of them, the commercial firewall, and even the router claimed to have traffic-management capacity.

Tuesday, after visitation, my cell phone voicemail was full. Our version of Internet Information Server (IIS) could manage bandwidth—in 8MB increments, and only if the content was static HTML and JPEG files. With several web servers behind the load balancer, that fell somewhere between useless and laughable. The load balancer would support traffic shaping, if we bought the new feature set. If we plopped down a credit card, we could have that feature set installed by next Sunday. Our big-name commercial firewall also had traffic-shaping features available, if we upgraded our service level and paid an additional (and quite hefty) fee for the feature set. That left the router, which I had previously investigated and found would support traffic shaping with only an IOS upgrade.

I was on the phone until midnight Tuesday night, making arrangements to do an emergency router IOS upgrade on Wednesday night. I had planned to go to the funeral Wednesday morning, give a eulogy, go home and take a nap, and arrive at work at midnight ready to rock.

Unfortunately, the funeral was more dramatic than I had expected, and I showed up at work at midnight sleepless, bleary-eyed, and upright only courtesy of the twin blessings of caffeine and adrenaline. In my email, I found a note that several big clients had threatened to leave unless the problems were resolved by Thursday morning. If I hadn’t already been stressed out, the prospect of choosing a minion to lay off would have done the trick. (I work hard training my minions, and prefer not to replace them once they are beaten into shape.)

Still, only a simple router flash upgrade and some basic configuration stood between me and relief. What could possibly go wrong?

The upgrade went smoothly, but the router behaved oddly when I enabled traffic shaping. Over the next few hours, I discovered that the router didn’t have enough memory to simultaneously support all of our BGP feeds and the traffic-shaping functionality. Worse, it wouldn’t accept more memory. At about 6:00 AM, I finally got an admission from the router vendor that it could not help me.

I hung up the phone. The first client who had threatened departure would be checking in at 7:30 AM. I had slept 4 hours of the last 48, and had spent most of that time under fiendish levels of emotional stress. I had already emptied my stash of quarters for the soda machine, and had pillaged my coworkers’ desks for more change. The caffeine and adrenaline that had gotten me to the office had long since worn off, and further doses of each merely slowed my collapse. We had support contracts on every piece of equipment, and they were all useless. All the hours of work I had put in, and my team before me, left me with absolutely nothing.

I made myself sit still for two minutes simply focusing on breathing, making my head stop sliding around loose on my shoulders, and ignoring the loud ticking of the clock. What could be done in 90 minutes—no, now only 88?

I really had one only option. If it didn’t work, I would either lay off someone or file for unemployment.

At 6:05 AM, I started downloading the OpenBSD install floppy image, and then I grabbed a spare desktop machine, selecting it from among many similar machines by virtue of it being on top of the pile. The next few minutes, I alternated between hitting the few required installation commands and dismantling every unused machine unlucky enough to be in reach to find two decent network cards.

By 6:33 AM, I had two Intel EtherExpress cards in my hands and a virgin OpenBSD system. I logged in long enough to shut down the system so I could wrench the case off, slam the cards into place, and boot again. Even early versions of PF included all sorts of nifty filtering abilities, all of which I ignored in favor of the newly released traffic-shaping functions. By 6:37 AM, I was wheeling a cart with a monitor, keyboard, and my new traffic shaper over to the rack.

Then things got hard. I didn’t have a spare switch that could handle our Internet bandwidth. The router rack was jammed to overflowing, leaving me no place to put the new shaper. I lost almost half an hour finding a crossover cable, and when I discovered one, it was only two feet long. The router, of course, was mounted in the top of the rack. About 7:10 AM, I discovered that if I put the desktop PC on end, set it on an empty shipping box, and put the box on the cart, the cable just reached the router. I stacked everything so it would reach, and began rewiring the network and reconfiguring subnets.

I vaguely recall my manager coming in about 7:15 AM, asking with taut calmness if he could help. If I remember correctly, as I typed madly at the router console, I said, “Yes. Go away.”

At 7:28 AM, we had an OpenBSD traffic shaper between the hosting area and our router. All the client applications were reachable from the Internet. I collapsed in my chair and stared blankly at the wall.

While everything seemed to work, the proof would be in what happened as our offending site started its daily business. I watched with growing tension as that client’s network traffic climbed toward the red line that indicated trouble. The traffic grew to just short of the danger line, and then flatlined. Other clients called, happy that their service was restored to its usual quality. One client complained that his site was still slow, but it turned out that bandwidth problem had masked a problem with his application. The client said that his website now ran even slower than before, to which we offered to provide more bandwidth if they would agree to pay for it.

Shortly afterward, I had two new routers and new DS3s. The racks were again clean. The decrepit desktop machine was replaced by two OpenBSD boxes in a live-failover configuration, protecting our big-name commercial firewall as well as shaping traffic. And I now stock crossover cables in a variety of lengths.

If I had started with OpenBSD, I would have had a much better night.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.96.155