Chapter 5. Alerting

Alerting is the act of getting a notification, let's say an SMS, email, or otherwise, to raise awareness to an outage or potential issue. Alerting itself has become more and more important within companies, and the alerting tools that are in place can be very expensive, expensive to scale, and very exclusive to specific systems. They also often don't offer as much intelligence within the alert, and they may not be as customizable as a tool like Splunk to the needs of a company. Every organization wants an alerting system that sends only actionable alerts and eliminates noisy alerts. Splunk can be good for consolidating multiple alerting tools, as well as reducing alert noise.

Alerting in Splunk is much more complex than just saving a search query. Where the action may be that easy, it is also a matter of understanding the fine details to explain what the alert is doing once you receive an actionable alert that makes the alert successful or not.

In this chapter we will learn about the following topics:

Also, should there be at least some passing reference to the correlated alert capabilities in Splunk for ITSI and Splunk ES?

Setting expectations

I expect that every organization is different with respect to their criteria for what they are alerting on. However, I can tell you from working across many sectors, and at some major institutions, that leadership generally wants the same thing no matter what company you work for. Allow me to give you a breakdown list of what leaderships' expectations are of Splunk and its alerting capabilities, across industries.

Splunk should be able to:

  • Alert on the future
  • Predict the future
  • Automatically know when all users are experiencing problems simultaneously
  • Tame dragons
  • Slay gods
  • Perform prophetic-like miracles in real-time
  • Save them billions of dollars
  • Automate their departments' workflow

After all, it is a Big Data platform with machine learning capabilities, right?

I am exaggerating here, but Splunk is both a big data platform and one that has machine learning capabilities. Contrary to popular belief it is not SkyNet nor the Matrix. I've checked under the hood; I didn't see Arnold or Neo.

There are a few components of setting expectations that are often overlooked that I would like to cover, to ease your life before you, the administrator (a.k.a. the great and powerful Oz), becomes accountable for both magic and the future.

Time is literal, not relative

The concept of time has been a topic of discussion since early civilization. However, when it comes to enterprise alerting, it becomes a very real slip-stream of accountability. By this I mean that when SLAs are present, real-time costs money, and no one wants to be caught being accountable for breaching an SLA and costing their company money. The alerting system directly impacts a team's ability to be responsive, and, as Splunk is often used in monitoring, it assists in the finger-pointing game after money is lost.

Setting a real-time alert with a tool like Splunk could be exactly what the doctor ordered. It could also crash your outbound notification server (either SMTP or SMS) and shut down alerting for all of your systems.

Allow me to give you an example to explain.

I once had a member of my leadership ask me to set up a real-time alert on a system that bypassed QA and landed directly in production. The alert was going to be based on a long string that the developers had stated represented a catastrophic failure within the software log.

The logs we were monitoring at the time generated 12GB an hour, because what's better than testing in production? Why having the logger hard coded to DEBUG mode, of course!

I asked three questions before I set the alert because I'm a firm believer that if someone answers the same question the same way when asked three different ways, they are sure they want that outcome.

Here are the questions:

  • What if we were to schedule an alert to check every 5 minutes?
    • Answer: No, we want it in real time.

  • Are you sure you want real-time alerts? There could be a lot of them.
    • Answer: Of course we are sure.

  • Will anyone actually react to alerts sent from an application logging them in nanoseconds?
    • Answer: If they don't react, then it's a problem for leadership. Our 24-hour staff should be reacting in real time as well.

You may be able to see why I started asking questions like this, and where I'm leading to, and I must say it was funny. Not for leadership, but certainly to observe.

I enabled the alert at 4 p.m., and I received a phone call from my executive at 10 p.m. screaming at me to turn off the alert and asking who authorized it.

The next day we had a discussion about real-time alerting again and he asked me why I hadn't mentioned that this was a possible outcome. I responded that I actually asked that question three times, and I showed him the email I sent to both him and his colleagues.

Apparently, a couple of members of leadership that were on the distribution group that Splunk was sending to were out to dinner with external colleagues, and all of the internal leaderships, phones started buzzing at the same time from the notifications.

The notifications, however, didn't stop. Splunk was sending 5,000 emails a minute to that email distribution group, and the members that were at dinner couldn't access their phones because the phones were too busy receiving emails. The phones would also not respond to being shut off manually.

My executive couldn't call me until he could get back to his laptop at his house, login and check for my number in Outlook because Splunk and the email notification system had effectively disabled control of his phone.

I assume it is embarrassing to have your cell phone buzzing non-stop during what's supposed to be a professional dinner to discuss operational efficiency (that was actually the topic).

The other fallout was that two other tier-3 systems (network and active directory) couldn't send any notifications because the email server was too busy and Splunk had filled up its email queue.

Now the moral of the story is simply this: Be very sure you need real-time alerting.

When Splunk says real time, it does mean real time, so helping your leadership understand the concept of time is very necessary before you break things that they didn't even know were in place.

Here are some things you might be able to try saying to your leadership that will help them with what they are asking for:

  • Scheduling a real-time alert is very resource intensive for Splunk, so what if we try setting up an alert that runs every 5 minutes and checks 5 minutes of history?
  • What if we schedule an alert for every 15 minutes and only send one alert in order to reduce the noise of the alerts we receive? (Insert monitoring system) is creating a lot of noisy alerts, which is something we are trying to get away from.
  • What if we set up an alert to run every 1 minute to check 10 minutes of history? Then, if we receive an alert, we can throttle it to not send another alert for 30 minutes, and we can see how our team responds to know if we need to make adjustments to our processes.

There are millions of ways to avoid setting up real-time alerting and notifying on every result.

In order to avoid yelling, I suggest avoiding real-time alerting and notification unless absolutely necessary.

Here's the reality. It's very unlikely that any, and I do mean any, service team is going to react appropriately to sub-second alerting, and Splunk will send sub-second emails.

It's much easier to reach the goal of alerting noise reduction when we schedule an alert for human reaction time.

Depending on your service team, crafting an alert with the appropriate intelligence to trigger once in a 15 minute interval and then throttle notifications for 30 minutes is reasonable. That will give the people who are actually going to fix the problem a chance to resolve the problem before another alert triggers.

If your team has a better than 15 minute MTTR, then I congratulate you, as something like that in IT is similar to finding a unicorn for us humans. You may consider setting your thresholds differently.

To quickly summarize

If your boss asks for real-time alerting, don't assume he knows what he is asking for. They don't know that Splunk has the potential to cause bigger issues if you turn that on. Instead, try to determine what real time means to them. Does it mean an alert every minute, every ten minutes, or every sixty minutes?

In case you do end up in a situation where you need to turn on a real-time alert/notification, be sure to give ample warning to your superiors about what could happen if a machine spins out of control.

Be specific

Leadership has a way of delegating a task to Splunk admin and walking away. They don't realize that the task they just delegated may be like asking the Splunk admin to climb a beanstalk to find the golden goose after fighting the giant.

When I say "be specific" I mean be very specific about the system and data that is being displayed.

For instance, do not assume that the Splunk for Linux app will display all of the relevant charts for YOUR organization.

By this I mean the following.

If you have a very good engineering person in your organization that is well respected, you can expect them to have their own tools for monitoring.

You can also expect that they know the difference between using vmstat on Linux, and the same thing on Solaris and what it represents.

Vmstat on Solaris is probably one of the best commands to get data about memory from a Solaris machine, but any good Linux administrator knows that if you want info about memory from Red Hat or CentOS, you use free-g or top, or something of the sort, because vmstat doesn't accurately represent memory being used due to the cache.

It also just so happens that free-g doesn't exist on Solaris, and there really isn't an equivalent within the system tools packages available.

The Splunk_TA_nix only uses vmstat to gather memory information, so if you deploy that on your Linux machine and you see flat lines at the same value, then that will be why. Vmstat and free-m measure memory utilization differently on a Linux machine than on a Solaris machine.

Small details like this make all the difference in an alert being accurate and actionable, and being completely ignored by your service or ops team.

Vmstat is not the only instance of something like this within the NIX app, so before you launch ANY app, go through and be sure you're collecting the right data to alert on.

In order to present this question to your leadership, it simply requires a question like:

"Since you want to monitor Linux memory utilization, should we be using vmstat or free-m?"

Allow them to decide and get back to you. If you decide, you will be held accountable for it not being what everyone expected when something goes wrong.

To quickly summarize

Be specific about the data you are alerting on, and make sure it is the data that the system owners agree is the most actionable data to alert on. If there are varying opinions, follow up with them about the decision that they need to make, and if time drags on, let leadership know what's going on. Things like which "counter" to use is very important for everyone to buy into for Splunk alerting to be accurate.

Predictions

With new technology comes new ideas and expectations. While some live in reality, others live in fantasy. Helping your leadership, as well as your team, understand the limitations of predictive analytics in Splunk will help you manage your workload.

It won't be too long after you start using predictive analytics and machine learning technology that a member of leadership will come to you to ask you how we can alert on it.

Make no mistake, this is a very slippery slope.

If it is part of your organization's roadmap to alert on the predictions made with Splunk, then I would suggest that you do these two things, and they are not guaranteed to save you if you come in wrong:

  • Based on your time of prediction, have at least 3x the data points within your Splunk system.
    • This means that if you want to predict out 3 months use 9 months worth of data.

  • Have Splunk predictions checked by a doctor of mathematics that understands the LLP algorithm. For more info, look on http://www.splunk.com/ for the predict function.

It is often the case that people will ask you to predict things such as capacity planning, bandwidth utilization, or market swing using Splunk with no math expert to back you up.

To my fellow Splunkers, Splunkers holding accountability for one of the aforementioned things right now should be treated like a bomb that will explode if you tip it. Slowly set it down on the floor, stand up slowly while watching it, and walk away. When at a safe distance, run like the wind!

I exaggerate, but I would personally avoid predictive alerting altogether, unless there is a Ph.D involved, simply because of my experiences with it.

There is no biological or non-biological method known on earth that is capable of predicting the future. If there was one, I have no doubt everyone would know, and it would not be an enterprise software program.

If you have a mathematician with you to fact check for you, then knock yourself out. Splunk isn't a bad way to automate predicting functions, and it leverages big data clustering architecture for the increase of resource utilization quite well. It may make your calculations very quickly, depending on your architecture.

In order to avoid using predictive analytics to alert without a Ph.D, you can ask your leadership member questions/statements like:

  • What if the data isn't right? I'm not a doctor in math, so I won't know, and then you will take that to your boss and it starts getting really bad if we're wrong.
  • We need 3 years' worth of data to predict what it will look like next year and we don't have it.
  • We need 10PB of SSD storage to hold 3 years' worth of capacity metrics. That will likely have a financial impact. Can we afford that kind of refresh?
  • Splunk cannot tell the future, no matter what the sales guy said.

Things like this can help you to avoid predictive alerting.

To quickly summarize

It is best to avoid predictive alerting. If you have to use it, have someone well educated in math fact check Splunk's results and/or let your leadership know that you may not be alerting on what you'd hoped. Run a test that they would want, notifying only you and your leadership member to test this use case and see if Splunk is giving you the desired results before you publish them to other members of leadership.

Above are a lot of pages that seem to repeat the same message of be careful what you alert on at great length, repetitively, without ever getting to the how of the matter. You may wish to consider whether the preceding six pages could be capably reduced to four or five paragraphs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.40.4