YOU CAN FREQUENTLY ASSESS the quality of a team by the quality of their metrics. Metrics are the lifeblood of a team lead because everything in your job is a negotiation, and metrics provide a rational foundation for discussion. If you don’t back up your statements with metrics, you’ll sound like Animal the Muppet. You also need metrics because you are constantly making judgment calls, and good data creates good (or at least defensible) judgment. Great leads live by their metrics because metrics point out problems, track progress, and celebrate success.
There’s a story, possibly apocryphal, that tells how Frito-Lay came up with one metric by which it could run its business. Frito-Lay stocks store shelves, taking up critical inventory space. Ideally, it will take up exactly the amount of space on a shelf that it needs—too much, and its products get returned. Too little, and it misses out on sales.
You can imagine multiple ways of figuring out how to create a metric for this business. You could sample the number of products on shelves every day and then forecast a trend, but that would be very time-consuming, especially if products are typically stocked every two weeks. You could measure store profitability and then look at stock levels, but that would deliver data that’s confounded by store effectiveness and size.
Frito-Lay solved this problem by measuring “stales,” the count of products that are returned at each restock event because the product is out of date. Frito-Lay wants the number of stales to equal precisely one. Taking a single bag of potato chips as a chargeback may seem a crime to you, but as a measurement cost it is very small. If the stales count is greater than one, the suppliers decrease the stock levels. If there are no stales, they increase the stock levels. It’s a fabulously simple metric that the field can measure and to which the company can react.
From Frito-Lay’s example, we can learn five key aspects of a great metric:
Stales are a wonderful measure because the data already exists as chargebacks.
Reliability and repeatability enable testing, which helps you ensure that your metric works. For example, if you were to swap out the chip stocker and you found that your stales count went up, you could investigate and determine that there was some different pattern of behavior. Perhaps the chips were stacked backward—certainly, displaying the nutrition information on Fritos is not going to help sales.
One of the most remarkable systems I’ve seen implemented is Amazon’s order tracking system. Amazon has enough orders and data that it creates a live, statistical process control model with orders. If your feature launches and damages order flow, you can rest assured that your pager will go off nearly instantly!
Like the Frito-Lay chip stocker who can react in real time to changing inventory conditions, your team needs to understand what to do when a metric changes. For example, even though Amazon’s order measurement system is clearly brilliant, it’s only a health metric. When an alarm goes off because orders are low or high relative to predictions, you know there is a problem, but you don’t know where the problem is. Furthermore, it’s great for a team to drive orders up, but the Amazon product is far too large for any team to measure their impact through the global ordering pipeline. But if your team is responsible for the shopping cart, and you measure the conversion ratio of users entering the checkout process to those who get a “Thanks for buying!” page, you will have a single number that reflects user experience and systems performance, and gives you a goal to target.
Another reason I like the ordering conversion metric is that it’s a number that reflects the customer experience. If your systems become too slow, or you add a lot of steps into the process for users, the metric will decline. But if you measured the 99.9% mean latency of the Oracle database at the backend and reported that metric, it might well have had no impact on the customer experience at all until it hit two or three seconds. Your goal should be to collect your data as close to the customer as possible. Metrics that are close to customers are meaningful and understandable.
Another aspect of focusing on the customer is measuring as late in the customer’s experience as possible. For example, if you make downloadable iPhone software, which metric is more meaningful, downloads or application starts? I’d argue it’s application starts, because downloads tell you only about marketing, whereas application starts tell you about user engagement and growth.
Your goal, therefore, is to identify the “stales” of your product. Before you go off creating a sophisticated algorithm by which to run your business, remember the fourth point, which says that you and your team need to be able to take action on a metric. Some businesses have a very hard time being run on a single number, and an attempt to do so renders the output meaningless; this was the case with many of the single-number “fitness functions” that Amazon tried to implement for its teams. While the Ordering team was able to generate a brilliant fitness function that they could live to, other teams had a far more difficult time and had to apply complex mathematical transformations to their data. Worse yet, they spent tons of engineering time coming up with the metrics! Down this path lies danger.
W. Edwards Deming once wrote “That which cannot be measured cannot be improved,” and boy, was he right. What’s more, if you work for a year on a brilliant product to improve some customer’s life, but you can’t measure its impact, how are you going to get promoted? And after you realize that you can’t get promoted, how are you going to get a new job when you have no improvement to demonstrate?
If you’re going to demonstrate improvement, you need a baseline. Therefore, you must establish metrics early and keep them updated throughout the development of your product. It’s not hard to establish basic metrics. Consider, for example, your engineering team’s ability to execute.
One measure of execution might be whether you can make your launch date. Your launch date is frequently a function of the number of bugs left to fix. Many bug tracking systems can generate find/fix ratios and bug counts as charts. Therefore, if you combine the find/fix ratio with a bug count, you can create a forecasted “zero bugs” date. For more on how to generate this metric and why it’s important, see the section Track Your Bugs and Build a Bug Burndown in Chapter 4.
The zero bugs date is a great metric for your development process because it is nearly free with most bug tracking systems, it can be measured reliably and repeatedly, it can be reported in real time, and it provides direction to your team. In this latter case, if you are distracting your team with more of your “great ideas,” your find/fix ratio will go up and push your ship date out. And since one of your goals is to minimize your ship date, you should quit it with the great ideas already!
Your metrics will probably change after you launch because you recently introduced a major new source of input: customers and customer usage. You’ll use metrics based on customers and their actions to report to your investors or management, inform your product decisions, and guide your team. There are three critical classes of post-launch metrics: progress toward goals, business performance, and systems performance.
Goal metrics report your progress toward achieving an objective. One goal metric that is a staple at Google is the “seven-day active user count.” It represents the number of unique users who used the product during the trailing seven days. This metric is much better than the typical “daily unique user count” you get out of cheap web metrics packages, because it measures current behavior and you can compare week-to-week performance easily. It’s also much more reasonable than daily users, since you will rarely build a product that you expect people to use every day.
If you are building a product that you do expect customers to use every day, then the delta between one-day and seven-day active users is very meaningful. For example, when I worked on Google’s plug-in for Microsoft Outlook—Google Apps Sync for Microsoft Outlook™—we expected that people who were using Outlook would probably check their mail daily unless our software wasn’t working well. Therefore, we paid close attention to the ratio of seven-day active users to one-day active users. If you have an infrequently used service, such as photo printing, you might care more about 30-day active users.
Other goals you might want to track include revenue, signups, downloads, or installs.
At this point you may be thinking, “I know all about goals. I know to make them S.M.A.R.T.” What’s S.M.A.R.T? Some rocket surgeon a while back came up with the notion that goals should be specific, measurable, attainable, reasonable, and time-based. This is a good, but not sufficiently specific, framework. I prefer the Great Delta Convention (described in Chapter 10). If you apply the Great Delta Convention to your goals, nobody will question them—they will almost be S.M.A.R.T. by definition (lacking only the “reasonable” part).
Business performance metrics tell you where your problems are and how you can improve your user’s experience. These metrics are frequently measured as ratios, such as conversion from when a user clicks the Buy button to when the checkout process is complete. Like goal metrics, it’s critical to measure the right aspects of your business. For example, if you want to build a great social product, you don’t need to measure friends—different segments of users have different numbers of friends. But you do want to measure user engagement so you can answer questions like “Are users spending time on the site?” and “Are they posting?” A relevant collection of metrics for these behaviors might be posts in seven days per seven-day-active-user and minutes spent on-site per seven-day active user.
Eric Ries isn’t a big fan of these growth metrics in his book The Lean Startup (Crown Business). He calls them vanity metrics because you can puff up your chest, point to a chart that goes up and to the right, and say, “Look, we’re awesome! We’re growing!” even as your product is failing 90% of the incoming new users. It’s a fair point. This is why you need to look at metrics like conversion and engagement, among others. Nearly all web analytics packages will provide conversion metrics out of the box, and they will also tell you which features are used, which buttons are clicked, and by which groups of users.
Another way to avoid “vanity” in your metrics is by measuring changes to your application. It’s always best to test in real time, rather than longitudinally, because longitudinal analysis is fraught with confounding problems and you can easily say, “See, we’re still going up!” Google Analytics provides A/B comparison tools that are incredibly powerful, but they’re just one kind of many tools you can use. Most major websites have testing frameworks that they use to roll out features incrementally and ensure that a new feature or experience has the intended effect. If it’s even remotely possible, try to build an experimentation framework in from the beginning (see Chapter 7’s discussion of launching for other benefits of experiments).
Systems performance metrics measure the health of your product in real time. Metrics like these include 99.9% mean latency, total requests per second, simultaneous users, orders per second, and other time-based metrics. When these metrics go down substantially, something has gone wrong. A pager should go off.
If you’re a very fancy person, you’ll want to look at your metrics through the lens of statistical process control (SPC). W. Edwards Deming was one of the first to popularize SPC as a mathematical way of measuring how much a metric can decline before you should page your tech lead. He learned from Walter Shewart in the ’20s. Deming assumes there is noise in your system, and within this noise there’s an envelope of acceptable performance. This is considered common cause variation, or Type I error—noise, as it were.
Then there are spikes of badness over a smaller period of time. Deming calls this special cause variation, or Type II error. A bad push or a server falling out of a virtual IP (VIP) might cause such a spike.
You can ignore common cause error—your noise—if it falls within two standard errors of the mean. The standard error is defined as the standard deviation/√N for the mean of your data. If a single data point falls outside of two standard errors of the mean, ring the pager.
It is generally true that any metric can be gamed. To continue with our previous launch date example, we could categorize more bugs as not launch blockers, or we could simply stop testing (which seems like a win-win on its face!). In reality, you and your team are unlikely to game the system because the metric is only an indicator—not the boss—so don’t worry if your core metrics can be gamed. When the metric becomes the boss and you spend days and weeks trying to justify the number, it’s time to change the metric. Or go work somewhere else.
18.188.191.11