4 Data instead of information

This chapter covers

  • Scoping dashboards for a specific purpose
  • Effectively organizing dashboards
  • Prompting the user with context

Sometimes you might have so much data in a system that you’re not really sure what’s going to be useful in a support type of situation. Instead of starting with a question of “What do I want to know?”, people start from the perspective of “What answers do I have?”

This leads to a few problems. First, you don’t challenge yourself to come up with ways to get the answers to the questions you have. And second, you tend to think of the data you have as the final answer or response. But data and information are two very different things.

Data is just raw unorganized facts, but when you take that data and give it context and structure, it becomes information. When you’re looking at your dashboards, you can quickly tell which dashboards are giving you data and which are giving you information.

4.1 Start with the user, not the data

Everyone presumes that just having the metrics about a system’s status is enough. But the way the status of a system is presented to users is as critical as having the metrics themselves. Poorly visualized metrics are useless, becoming nothing more than a sea of numbers in which the signal gets lost in the noise.

Only a monster or a data scientist would present users with a spreadsheet of metric points in a Microsoft Excel document. Even if given such a document, the first thing the user would do is convert those numbers into some sort of visualization chart. The power of pictures cannot be overstated. It’s how humans assimilate knowledge best. The same goes for your metric data.

But just knowing that you need pictures isn’t the same as knowing the best way to organize them. The field of data visualization and user experience (UX) design is a topic that could fill volumes of text alone, and since you’re not a UX designer, would probably bore you to tears. (If you are a UX designer and you’re reading this, job well done!) But you don’t need to be an expert to start designing useful dashboards for your team and other stakeholders.

This chapter will give you practical tips for making your dashboards purpose driven and accessible to their audience. I’ll give you some guidelines for designing dashboards, organizing them, and calling out key bits of information. But it all starts with understanding who the dashboard is for.

It’s a knee-jerk reaction to start your dashboarding process by picking a system or server and looking to see what metrics are available to you so that you can build the ultimate dashboard. That’s the mythical single pane of glass that corporate giants have been selling us since the 1970s. That approach is a trap and will most likely lead to a dashboard that isn’t incredibly useful to you or anyone else.

Your first step should be identifying the intended audience for your dashboard. This will help you scope your dashboard in terms of which metrics to display, which metrics need accentuation, and the granularity of the data to display. Different audiences need different things, and building a dashboard with that purpose in mind will lead to two very different dashboards. I’ll give a brief example.

Think of a database system. The database is sometimes considered the nerve center of the application. That’s where all the long-lived data for your application is stored. It also drives decision-making criteria. A reporting team queries the database for various business metrics.

In this fictional company, the database team has decided to create a read replica of the primary database. A read replica is another copy of the database that users can only read from (not write to). The database is updated through a replication mechanism from a primary database that it is tied to. As updates are made to the primary, the database system replicates those same changes to the read replica to keep it in sync. The reporting team queries the read replica to help prevent unwanted congestion on the primary database for normal end users of the application.

If you were building a dashboard for the read replica, you should decide which audience you’re building the dashboard for--the database administrator or the reporting team. The reporting team probably cares about a few key items when they’re running their report:

  • How many reports are currently running?

  • When was the last time the read replica was updated from the primary?

  • How busy is the database overall?

These are similar concerns that a database administrator might have, but the prominence and level of detail necessary might differ wildly. Take the question of “How busy is the database?” The database administrator would probably want that data broken up by CPU utilization, disk input/output (I/O), memory utilization, database buffer cache hit ratio, and more. The administrator wants this detail because their purpose is to understand not only the performance of the system, but the contributing factors to that performance. A reporting team just wants to be able to set expectations around how long their report is going to take to run. In that context, they would be served just as well with a red/yellow/green status indicator on database performance. They just need to know whether the system is going to run slower than usual or on par with typical performance.

This is why starting with your user in mind is a dashboarding best practice. Understand the motivations for what the user will be doing. A troubleshooting dashboard and a status dashboard can look extremely different because the intended use cases are so different. When working on a new dashboard, ask yourself these questions:

  • Who is going to be looking at this dashboard?

  • What is the intended purpose of the dashboard?

  • With that intended purpose in mind, what are the top three to five pieces of information the dashboard needs to convey quickly?

With these bits of information, you can begin tackling your new dashboard.

4.2 Widgets, the dashboard building blocks

After you decide what your new dashboard should display, you have to decide how you’re going to display it. Each metric that gets displayed is done so within the context of a widget. A widget is a small graphical unit used to display the visualization of a particular metric. A widget can have different display types. The dashboard comprises many widgets.

DEFINITION Widgets are graphical components used to display a metric. The dashboard is a collection of widgets. Widgets can use different display types to express the underlying data.

4.2.1 The line graph

In the technology metrics field, you can almost never go wrong with a basic line graph. A line graph gives you current values as well as historical trends, allowing you to see changes over time and the variability of the metric. When in doubt, use a line graph.

Figure 4.1 A basic line graph widget of the current user’s metric

Looking at figure 4.1, you can see how the measured metric, currently active users, goes extremely low overnight but then begins to climb with the start of the workday.

Sometimes you may need to graph multiple values but want them on the same graph. There are two common scenarios for this. In the first scenario, you have multiple processes or servers that you want to see on the same graph. In a web server example, you want to see a graph of each server’s CPU utilization or request count to get an idea of whether traffic is balancing evenly across the web cluster. In that case, placing multiple lines on the same graph makes a lot of sense.

In figure 4.2, you can see where a single widget has multiple lines being graphed, each one representing a different server. The tightness of the lines highlights that each node seems to be performing an equal amount of work.

Figure 4.2 Display of multiple servers and their CPU utilization throughout the day

In the second scenario, you may want the ability to see the sum of all of a metric’s values, but still be able to isolate a specific aspect. Using our messaging system example from the previous chapter, you may want to know the total number of consumers who exist on your messaging platform. But you also want to know where those consumers are coming from, or more precisely, which servers have the most connections.

In this scenario, you may want to consider using an areas graph (sometimes referred to as a stacked line graph). These graphs function like individual line graphs, but instead of each starting at zero, they stack on top of each other, starting from the high point of the previous line. The area between the two stacked lines is shaded in a contrasting color to highlight the difference between the two workloads.

Because the items are stacked on top of each other, you also get a total count at a quick glance. Figure 4.3 shows a stacked line graph that is displaying the total number of consumers on a messaging bus.

Figure 4.3 Stacked line graph, using different colors to separate the line fields

With this type of graph, you can communicate not only the overall volume, but also which components might be contributing more to the total than the others.

4.2.2 The bar graph

Bar graphs are a great option when you have very infrequent data or spots of missing data collection. With a line graph, many tools will draw awkward lines between missing data points, which at first glance look like a series of rising or sinking values. But in truth, the data points are just less frequent. The lack of frequency results in an inaccurate-looking line graph. Bar graphs solve this problem by not needing to draw connecting lines between the two points. Empty values just don’t get displayed.

A lot of graphing tools allow you to interpret empty values as zero and plot them accordingly. That might be alright in some cases, but depending on your metrics, a missing value could be distinctly different from a zero value. One example occurs when you’re measuring metrics from a server, and the server is responsible for sending the metrics to the metrics collection engine (pushing), instead of the engine connecting to the server and requesting information (pulling).

DEFINITION Pushing data happens when an individual agent or server is responsible for shipping data to a centralized collection service. Pulling data occurs when the collection service connects to nodes individually to request data. Knowing how data is collected and the direction of data collection can be helpful when troubleshooting collection problems.

If your metric data is being pushed from servers to a metrics collection service, a zero value (the server has sent data and reported a value of zero) is very different from missing data (the server never sent a value, perhaps because it’s down). Depending on your environment, you may want to know the difference between the two and graph them appropriately. The bar graph allows you to highlight these differences.

Figure 4.4 shows a bar graph that has missing data. The graph is showing job execution times for a process.

Figure 4.4 A bar graph that has missing data points

The lack of data here tells a very different story. The job being graphed didn’t execute during the entire period being displayed. Graphing a zero value here would lead someone to think that the job is executing more frequently than it is. During troubleshooting, this could lead to poorly generated theories and a lot of wasted effort.

I tend to avoid treating no data as zeros. You’ll need to make your own decisions on this based on your tools. Sometimes a system just doesn’t emit a metric if it doesn’t have a value. In cases like that, when no data is regularly expected, converting to zeros might make sense for you.

4.2.3 The gauge

The gauge is a great metric when you need to display a single value at a given point in time. It’s like the speedometer: when you look at it, it reflects the current value of the metric. Because it displays only a single value, the widgets for this are pretty straightforward.

Some tools allow you to represent the number by using an actual gauge, but in reality, a basic numerical display works just as well in most cases. Unless, of course, the dashboard is going to be displayed to senior management. Then the gauge makes it look more “high-tech.” You’ll die a little bit inside using it for this reason, but know that you wouldn’t be the first person to do it.

4.3 Giving context to your widgets

A widget might display many types of data, from sales numbers to CPU utilization. But those numbers are useless without context. If I gave you a gauge widget titled “Cash on hand” and that number read $2,000,000, is that good or bad? Well, it depends on a lot of factors. Are we normally keeping $18,000,000 on hand? What is cash on hand used for? If it’s lunch money, it’s probably way too much and could be put to better use through investments.

The point is that sometimes simply displaying data isn’t enough. You need to give that data context of whether it’s good or bad. I recommend giving context to data in three main ways: through color, threshold lines, and time comparisons.

4.3.1 Giving context through color

Giving context through color can be extremely easy, because most of us are already wired to associate certain colors with a particular meaning. With absolutely no context, if I were to show you a green light, you’d most likely think, “go” or “OK.” If I showed you a red light, you’d most likely think, “stop” or “danger.” You can use this wiring to your advantage to use color as a means of providing context, specifically with gauge numbers.

Most tools provide an easy way to color-code a value or to annotate the widget with a symbol when values are within a range or threshold. With this functionality, you can determine the thresholds for gauge values and dynamically color-code them appropriately. This enables users of the dashboard to quickly look at a widget and determine whether things are good or bad, just based on the color of the value or the annotation added to it. But remember to use common colors that are universally recognized: green for OK, yellow for warning, and red for danger. If you want to add more than these colors, that’s fine, but make sure your colors are transitioning through this spectrum.

It can be extremely confusing when these rules are violated. I remember going to a data center once where the hard drive lights had accidentally been wired incorrectly, so all the lights on the drives were red! I went into quite a shock when I thought the entire disk array had failed, only to be told about the mix-up a few minutes later. Everything was fine, and I was able to delete my LinkedIn post about finding a new job before anyone saw it, but that’s probably an experience that everyone has when they first visit. Remember to make sure you stick with the boring green/yellow/red pattern. If you need additional colors for whatever reason, make them along the spectrum between those colors (green to yellow, and yellow to red).

4.3.2 Giving context through threshold lines

If I’m looking at a graph, the context of that graph can change greatly based on the time horizon I’m looking at. For example, if the orders per second metric has been bad for four hours, but I’m looking at only the last hour of data, I might not realize that there’s been a significant change. The number looks perfectly steady. But if I zoom out, I can quickly see that something significant has happened!

When you design your widget, you’re never certain that users will be viewing it within the appropriate time horizon to get the context of previous values. You can solve this by giving the user context via threshold lines. Threshold lines are additional lines in the graph that are static, indicating the maximum values for that widget. This way, no matter what your time horizon is when you look at the widget, you can see that you’re below whatever limit there should be. Take a look at figure 4.5 and try to determine whether this graph is in a good or bad state.

Figure 4.5 A graph that uses threshold lines

Without knowing anything about the graph or what a RabbitMQ file descriptor is, you can probably tell that this graph is in a good state. The dotted line represents the threshold for this particular graph. The color-coding of the threshold line is a deliberate choice as well. The threshold is red to indicate that this is a bad threshold or a bad line to cross. Now if this were a green line, it might send a different connotation, making you surmise that the metric needs to be above the threshold to be considered healthy. The power of colors!

Another thing you’ll notice is that the threshold line is dashed. This is to draw attention to it so that it doesn’t blend in with other metric data. If your other metric data is displayed as dashed lines, consider changing the threshold lines to solid. Just be sure that you can easily differentiate them from other metric lines.

4.3.3 Giving context through time comparisons

In a lot of workloads, your metric pattern might be variable based on the time of day. This could be due to factors like when your users are online, when certain background processing occurs, or just a general increase in load. Sometimes it’s helpful to understand whether the metric data you’re seeing is an outlier or is matching up with previous norms. The best way to do this is a time-comparison overlay.

With a time-comparison overlay, you’re taking the metrics from the previous 24 hours and applying those datapoints in the same graph, but with slightly different display criteria. In figure 4.6, you can see that database read latency is being compared over a 24-hour period.

Figure 4.6 Displaying historical metrics on top of the current metrics to give additional context

Now when looking at a particular spike, you can easily see that it is pretty consistent with historical behavior. But be careful how much value you assign historical performance. Just because something performed this way yesterday doesn’t mean it’s not a problem. It’s just not a new problem, and that’s a subtle difference.

If what you’ve been investigating is something new that’s been happening within the last, say, hour, then the historical performance shows that nothing has changed with the pattern of this metric. Your application is doing just as bad as it was yesterday. But if you were looking for general performance improvements, tuning for this historically high volume could be beneficial. The devil is in the details.

4.4 Organizing your dashboard

Now that you have your widgets, you’re ready to start logically organizing your dashboard. You should think about the users and the key bits of information they will need when viewing this dashboard. Think of the big two to four items that you think will be the most likely reason your intended users are coming to the dashboard. For checking on things like website performance health, you want to see response-time latency, requests per second, and request error rate as your top three dashboards. When doing investigative research, these three datapoints are probably going to be the things you look at first. If they’re performing well and you can find that out quickly, you can then move to the next possible source of the problem in quick fashion.

The organization of your dashboards doesn’t need to remain static either. Over time you might find that you’ve underemphasized some graphs and overemphasized others. Or you might be experiencing a period of instability due to a new problem that’s been introduced into the system. In those cases, if you have a leading metric that forecasts problems, it’s probably worthwhile to place that metric prominently in the front during the unstable period. The goal of dashboard organization is to provide access to the data you need most often, as quickly as possible.

You might find that different people feel that different metrics should be highlighted. Again, that’s likely based on the way different types of users care about different types of data. Don’t be afraid to create more dashboards if you need to. Later in this chapter, I’ll talk about how to name dashboards to draw the attention of these various groups.

4.4.1 Working with dashboard rows

Widgets in your dashboard are aligned in rows. Your widgets should be organized from left to right, top to bottom, in order of importance. In most cases, your widgets will be further grouped into related metrics. All of the disk performance metrics should be located together, the CPU metrics grouped together, and the business key performance indicators grouped together. The one place you should break this rule is the very first row of the dashboard.

The first row of the dashboard should be reserved for your most important metrics. Where they fall in the grouping of metrics is immaterial compared to how quickly you can get access to that data. Since you designed the dashboard with a particular type of user in mind, think of what that user will most likely want access to on a regular basis. What would they check first when they log into the dashboard? These are the items that should go in your first row.

Try to limit that row to no more than five widgets if possible. It’s easy to get caught up in the idea that all the metrics are important, but if they’re all important, then none of them are. They’ll get lost in a sea of widgets, and you’ll spend more time researching CPU-level context switches instead of figuring out why revenue is dropping.

Once you’ve established the first row, think of the various groupings that you might have in your widgets. Which widgets are related and tell different angles of the same story? For example, I would group disk performance metrics together. I’d gather these metrics that surround disk performance so that they were all viewable without much scrolling necessary:

  • Disk reads

  • Disk writes

  • Disk write latency

  • Disk queue depth

All of these metrics give me an idea about the overall health of my disk subsystem. Grouping these together allows me to view the disk performance from various aspects to drill down to the source of the problem. If I have a lot more writes than reads and my queue depth is high, I can quickly tell that I’m likely overloading the disk with writes and should start looking for write-heavy processes in my system.

If your dashboarding tool allows it, I also suggest using any grouping functions that the tool has. This will allow you to move the entire group of widgets at the same time if you decide to rearrange your rows at a later date.

4.4.2 Leading the reader

Once you have defined your dashboard rows and laid them out in order of importance, you can add the finishing touches to the dashboard. I like to call this leading the reader.

You won’t know who happens upon your dashboards or their level of familiarity with the system in question. It might be a junior operations engineer troubleshooting a problem, or it might be a senior developer verifying that the last release hasn’t had any negative impact on the system. Whoever it is, you should try not to assume too much knowledge on their behalf. The more you can lead them through your dashboard, the better.

You can do this by creating note widgets. These are just simple, free-form text areas that allow you to further describe the dashboard, widgets, and widget groupings. You might even want to describe how some metrics relate to each other and how they should be read or interpreted.

Almost all metrics have a sort of known anomaly to them. That’s when a metric has a drastic swing under a well-known set of conditions. For example, in our environment, the database memory utilization suddenly drops after a deployment occurs. To someone unfamiliar with this process, this might look a little suspicious. A simple note next to the graph can help guide the reader so that they have a better understanding of that behavior. Figure 4.7 shows an example of such a note.

Figure 4.7 A note adds context to a widget.

Notes can also be a great place to leave breadcrumbs for the user. You might tell the reader about another location to look for relevant information with a link to it. Or you might have a deep link to the logging entries for the system being reviewed. Notes are a fantastic way to help people learn about the dashboards and become more self-sufficient in the future.

4.5 Naming your dashboards

They say the two hardest things in computer science are cache invalidation and naming things. Dashboards are no exception to this rule. I like to break my dashboards into three sections:

  • The intended audience (Marketing, TechOPS, Data Science)

  • The system under examination (Database, Platform, Message Bus)

  • The view of the system being taken (Web Traffic Reports, System Health Overview, Current Monthly Spending)

By naming your dashboards in this way, you can help people filter out the dashboards that they don’t really care about and find the ones that are relevant to their needs. A marketing person doesn’t have to spend time clicking through all the TechOps dashboards when their dashboard is named “Marketing - Platform - Web Traffic Reports.” They can quickly drill down into what they need and bookmark it.

If you’re lucky enough to have a dashboarding tool that supports a folder hierarchy, you can consider making the intended audience section a folder, with the system under examination section being a subfolder of that, with your dashboards listed in the folder without the long prefix name.

These are not hard-and-fast rules for organizing dashboards. They’re merely suggestions on how to think about breaking up your dashboards. You may not even need any deep categorization, based on the number of dashboards you have. If your audience is relatively small, the intended audience breakdown might not make sense for you. The goal is to have a systematic naming convention that allows people to find the dashboards they care about. If it makes sense in your organization to rename them after farm animals or famous street names, have at it. Just create some sort of pattern that you can follow and that people can understand. Like your office conference room naming scheme <sarcasm/>.

Summary

  • Design your dashboards with the end user in mind.

  • Give context to your widgets so users know whether a value is good or bad.

  • Organize your dashboards so that the most important items are first to be seen.

  • Group relevant widgets together for easy access and comparison.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.218.230