Chapter 11. Conducting a Monitoring Assessment

We’ve reached the final chapter of the book, folks. You have, I hope, learned many new things. This last chapter will take you through a fictionalized example of applying all the lessons of this book at once using an exercise I do with my own consulting clients: a monitoring assessment.

Performing a monitoring assessment on your environment is a great way to systematically determine what you should be monitoring and why. The end result is a clearer understanding of the behavior of your app and underlying infrastructure. It’s by no means exhaustive or perfect, but rather, it’s intended to be a starting point to get you thinking about what matters and what doesn’t.

Business KPIs

To start off, we need to figure out exactly what Tater.ly does. After a chat with the CEO, we’ve learned the following:

Tater.ly’s mission is to help french-fry aficionados find the best french fries in all the land. Users come to Tater.ly to look up restaurants and read reviews about their french fries, as well as post their own reviews. The french fries are also rated on a scale of one to five, with five being the best. Restaurants can create their own pages or users can create them. Restaurants can “claim” their pages if the page already exists. Tater.ly makes money through advertising by placing a Featured Fry at the top of search results, with restaurants paying an advertising fee for the slot. The ad fees are based on number of impressions—that is, the number of people that see the ad (as opposed to “clicks,” that is, the number of people who click on the ad). Because the ad price is based on impressions, restaurant owners can choose how much to spend and whether to show their ad at peak times or non-peak times. It also allows us to run multiple ads. Currently, Tater.ly has gross revenue of $250,000 annually, and that’s steadily increasing.

Now that we have have enough information to begin our assessment, let’s start with the business metrics: what are the business KPIs?

To start off, there are some basic metrics that tell us the state of the venture:

  • The number of restaurants reviewed

  • The number of active restaurants (that is, restaurant page owners logging in)

  • The number of users

  • The number of active users

  • Searches performed

  • Reviews placed

  • Ads purchased

  • The direction and rate of change for all of the above

That all looks pretty sound, though I would consider adding one more: net promoter score (NPS). Let’s add two perspectives to our list:

  • NPS from users

  • NPS from restaurants

And that wraps up the business section. Let’s move on to what we learned in Chapter 6: frontend monitoring.

Frontend Monitoring

As you might recall from Chapter 6, there is really only one big thing we need to make sure we’ve got: RUM metrics (use your favorite frontend monitoring tool). This will allow us to keep tabs on page load times from the user perspective.

Application and Server Monitoring

The first thing we’ll need for this section is an architecture diagram of Tater.ly’s infrastructure (Figure 11-1):

Tater.ly architecture diagram
Figure 11-1. Tater.ly architecture diagram

From this architecture diagram, we can see that we have a standard three-tier architecture, plus a few other bits. Traffic comes in through a CDN with the origin set to our load balancers (2x) in an active-active configuration, 4x web servers (on which the Django app also lives), a PostgreSQL database in primary-replica configuration, and a single Redis server for session storage. Tater.ly uses a web hosting provider rather than their own datacenters, so managing hardware and network is of low concern to them.

Drawing upon the lessons we learned in chapters 7 and 9, what metrics and logs can you spot? Here’s what I’ve got:

Metrics:

  • Page load time

  • User logins: successes, failures, length of time taken, daily active users, weekly active users

  • Searches: number performed, latency

  • Reviews: reviews submitted, latency

  • PostgreSQL (inside the app): query latency

  • PostgreSQL (at the database server): transactions per second

  • Redis (inside the app): query latency

  • Redis (at the Redis server): transactions per second, hit/miss ratio, cache eviction rate

  • CDN: hit/miss ratio, latency to origin

  • haproxy: requests per second, healthy/unhealthy backends, HTTP response codes at frontend and backend

  • Apache: requests per second, HTTP response codes

  • Standard OS metrics: CPU utilization, memory utilization, network throughput, disk IOPS and space

Logs:

  • User logins: user ID, context (success? failure? reason for failure?)

  • Django: exceptions/tracebacks

  • The service logs for all the server-side daemons we’re using: Apache, PostgreSQL, Redis log, and haproxy

Any synthetic website monitoring tool can provide us with another crucial piece of information: SSL certificate expiration.

Security Monitoring

As Tater.ly isn’t subject to any compliance or regulatory requirements, security monitoring is straightforward:

  • SSH: login attempts and failures

  • syslog logs

  • auditd logs

Alerting

And lastly, we going to need some alerts. Chapter 3 taught us that we don’t need a whole lot to be effective. Looking at the metrics and logs we’ve identified, I would expect these alerts to be in place:

  • Page load time increasing

  • Increasing error rates and latency on Redis, Apache, and haproxy

  • Increasing error rates and/or latency for certain application actions: searches, review submissions, user logins

  • Increasing latency on PostgreSQL queries

Make sure to write up a runbook or two for the application with all of this newfound information so your colleagues can benefit from all this knowledge and visibility.

Wrap-Up

And that’s it! Congratulations, you just finished your first monitoring assessment—that wasn’t so hard, was it? Of course, this is only the beginning of your monitoring journey. Monitoring is never done, since the business, application, and infrastructure will continue to evolve over time. This assessment is a great starting point, but don’t forget to keep improving.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.172.146