Anomalies

Anomaly detection, also known as outlier detection, is a branch of data mining that deals with identification of events, items, observations, or patterns that do not comply to a set of expected events or patterns. Basically, a different (anomalous) behavior is a sign of an issue that could be arising in the given dataset. Splunk provides commands to detect anomalies in real time, and this can useful in detecting fraudulent transaction of bank credit cards, network and IT security frauds, hacking activity, and so on. Splunk has various commands that can be used to detect anomalies. There is also a Splunk app named Prelert Anomaly Detective App for Splunk on the app store. It can be used to mine the data for anomaly detection. The following commands can be either used to group similar events or to create a cluster of anomalous or outlier events.

The anomalies command

The anomalies Splunk command is used to detect the unexpectedness in the given data. This command assigns a score to each event, and depending on the threshold value, the events are then classified as anomalous or not. The event will be reported as anomalous if the unexpected score generated by the anomalies command under the unexpectedness field is greater than the threshold value. Due to this, it is very important to decide and specify the appropriate threshold value to detect anomalies in the given dataset.

According to Splunk documentations, the unexpectedness score of an event is calculated based on the similarity of that event (X) to a set of previous events (P) based on the following formula:

unexpectedness = [s (P and X) - s(P)] / [s(P) + s(X)]

The syntax for the anomalies command is as follows:

… | anomalies
    threshold=threshold_value
    normalize= True / False
    field=Field_Name
    blacklist=Blacklist_Filename

All the parameters for this command are optional. Running the anomalies command creates the unexpectedness field with the unexpectedness score. The parameter description of the anomalies command is as follows:

  • Threshold: The threshold_value parameter is the upper limit of normal events. All the events having the unexpectedness field value greater than this threshold value will be reported as anomalous.
  • Normalize: The default value of this parameter is true, which means the numeric text in the events will be normalized. In the process of normalizing, all the numeric characters from 0 to 9 are considered identical to calculate the unexpectedness value.
  • Field: Using this parameter, the field on which the unexpectedness value is to be calculated to detect the anomaly can be specified. The default value for this parameter is _raw.
  • Blacklist: The name of the file located at $SPLUNK_HOME/var/run/splunk/containing the list of events that should be ignored while calculating the unexpectedness score.

Take a look at the following sample query:

source="outlierData" | anomalies labelonly=false by Strength | table Strength Latitude unexpectedness

The output of the preceding query would look similar to the following screenshot:

The anomalies command

The example dataset is of mobile signal strength with respect to location. The dataset has respective signal strength (fieldname—Strength) reported by the mobile device at the given location (fieldname—Latitude). The Splunk anomalies command resulted in 12 anomalies in the dataset, with their respective unexpectedness value. Thus, using the anomalies command can help find out the anomalies in the given dataset, along with the unexpectedness value. The threshold parameter can be set to get the result with less or more unexpectedness value.

The anomalousvalue command

The Splunk anomalousvalue command, as the name suggests, is used to find the anomalous value from the given dataset. This command calculates the anomaly score for the specified field-list by calculating the frequency of occurrence or by means of standard deviation. This command can be used to find anomalous values that are less frequent or the values that are at a distance from the other values of respective fields of the dataset.

The syntax of the anomalousvalue command is as follows:

… | anomalousvalue
    action = filter / annotate / summary
    pthresh = Threshold_value
    field-list

The parameter description of the anomalousvalue command is as follows:

  • Action: This parameter defines what action is to be taken on the result. If the value of this parameter is filter, which is also the default value of this parameter, it will show only the anomalous value in the result. The non-anomalous values are ignored in the result. If the value of this parameter is summary, then the result shows the statistical table containing fields such as count, distinct count, mean, Standard deviation, Support, and various statistical frequencies. If the action is set to annotate, then the result will show a new field containing the anomalous value.
  • pthresh: This parameter is used to specify the threshold value to mark a value as an anomalous value. The default value of this parameter is 0.01.
  • field-list: The list of fields for which the anomalous value is to be outputted. If the field list is not specified, then all the fields of the events will be considered to calculate the anomalous value.

Refer to the following example for better clarity:

source="outlierData" |table Strength Latitude | anomalousvalue Strength
The anomalousvalue command

The dataset used for this example is the same as the preceding example of the anomalies command. The anomalousvalue Splunk command on the strength field, Strength, resulted in five events out of a total of 831 events. This means that for the respective Latitude values, the corresponding Strength value is anomalous in the result. This command also resulted in Anomaly_score for the Strength field, which depicts the anomaly score of the respective anomalous value.

The cluster command

Clustering is a process of grouping events on the basis of their similarity. The cluster Splunk command is used to create groups based on content of events. According to the Splunk documentation, Splunk has its own algorithm of grouping the fields into clusters. The events are broken into terms (match=termlist), and then the vectors between events are computed. This command creates two custom fields, one that is the size of the cluster and the other cluster has the grouped events in it.

The syntax of the cluster command is as follows:

… | cluster
      t = Threshold_value
      field = Fieldname
      match = termlist / termset / ngramset
      countfield = Count_FieldName
      labelfield = Label_FieldName

The description of the parameters of the preceding query is as follows.

There are no compulsory parameters for this command. All the parameters are optional:

  • T: This parameter is used to specify threshold_value to create the clusters. The default value for this parameter is 0.8, which can range from 0.0 to 1.0. Let's say if threshold_value is set to 1, that means a greater number of similar events will be required to be placed in one cluster than if the value is 0.8.
  • Field: This parameter can be used to specify on which field of every event the clusters are to be created. The default value for this parameter is the _raw field.
    • Match: The grouping to create clusters in Splunk is done in the following three ways, which can be specified in this parameter:
    • Termlist: This is the default value for a match parameter that required the exact same ordering of the terms to create a cluster.
    • Termset: An unordered list of terms will be considered to create the cluster.
    • Ngramset: Compares sets of three character substrings (trigram).

Take a look at the following example:

source="DataSet.csv" |cluster

The output of the earlier query would generate an output like the following:

The cluster command

The dataset used for this example contains sepal length, sepal width, petal length, and petal width of three different species of plants. Given the values of sepal length, sepal width, petal length, and petal width their species could be determined. The cluster Splunk command creates three clusters, each containing each of the species in the given data. This is a very simple example for explanatory purpose, but this command can be very useful in creating clusters of events with similarities. Splunk provides the match parameter, which can be used for different grouping methods such as Termlist, Termset, and ngramset. If the algorithm is not giving accurate results of the clusters, then the threshold value can be set accordingly by proving value to the T parameter in this command.

The kmeans command

K-means is an algorithm of cluster analysis in data mining. The kmeans Splunk command is used to create clusters of events defined by its mean values. The k-means clustering can be explained with the help of an example. Let's say I have a dataset that has information about Jaguar cars, jaguar animals, and Jaguar OS. Using k-means, three clusters can be created, with each cluster having events of respective types only. Basically, k-means creates a cluster of events on the basis of their occurrence of other events. If event X occurs, then almost 90 percent of the time, event Y also occurs. Hence, k-means can be used to detect issues, frauds, network outages, and so on in real time.

Take a look at the following query syntax:

… | kmeans
    k = k_value
    field_list

The list that follows describes the parameters of the preceding query.

There are no mandatory parameters for this command. All the parameters are optional.:

  • K: Specifies the k_value, which is the integer value defining the number of clusters to use in the algorithm. The default value of k is 2.
  • Field_list: List of fields that are to be considered to compute the algorithm. By default, all the numeric fields are considered and non-numeric fields are ignored.

An example of the kmeans query looks like the one that follows:

sourcetype=kmeans | table Group Alcohol diluted_wines |kmeans k=3

The output would look similar to this:

The kmeans command

The dataset used in the preceding example is data containing various ingredients of three different alcohols. The Splunk command kmeans creates three cluster (k=3) under the CLUSTERNUM fieldname. To verify the result, if the clusters made by kmeans match with the actual group, the Group field is shown in the preceding example image. Cluster 1 matches with group 1, and cluster 2 matches with group 2. The kmeans command can be useful in creating clusters as per requirement. Let's suppose we are aware that the dataset is of three different alcohol types but want to cluster it into two groups only. In this case, k=2 can be used in the command. The kmeans command also calculates the centroid of each field and displays it in the result. K-means is one of the efficient algorithms of clustering.

The outlier command

According to statistics, an outlier is an event that is at a distance from other events in the typical distribution of data points. An outlier can be caused due to issues or errors in the system from where the dataset is generated. The outlier Splunk command is not used to find out the outliers, but it removes the outlier events from the data. This command removes the outlying numeric values from the specified fields, and if no fields are specified, then the command is processed on all the fields.

The Splunk documentation states the filtering method used in the outlier command is Inter-quartile range (IQR).that is; if the value of a field in an event is less than (25th percentile) - param*IQR or greater than (75th percentile) + param*IQR, that field is transformed or that event is removed based on the action parameter.

The syntax for the outlier command is as follows:

… | outlier
      action = remove / transform
      mark = true / false
      param = param_value
      uselower = true / false 

The parameter description of the outlier command is as follows.

There are no mandatory parameters for this command. All the parameters are optional:

  • Action: This parameter specifies the action to be performed on the outliers. If set to remove, then the outliers containing events are removed, whereas if set to transform, then it truncates the outlying values with the threshold value. The default option for this parameter is transform.
  • Mark: This command prefixes the outlying value with 000 if action is set to transform and this parameter is set to true. If action is set to remove, then this parameter is ignored. The default value for this parameter is false.
  • Param: This parameter defines the threshold value for the outlier command with the default value as 2.5.
  • Userlower: If set to true, then the values below the median will also be considered for the outlier calculation. The default is set to false, which only considers the values above the median.

Take a look at the following example:

source = "outlier2.csv" | outlier action=remove Strength

The output should look like that shown in the following screenshot:

The outlier command

As explained earlier, the outlier Splunk command can be used to either remove or transform the outlier values. In the preceding example, action is set to remove for the outlier command on the strength field which removes the outlying values from the result. In the preceding screenshot, the last three entries of strength are not available as those values of the strength field were outliers. Using this command and setting action to transform can transform the outlying values into the threshold limit. Thus, this command can be useful in finding out outlier values for the specified or, by default, for all the numeric fields.

The rare command

As the name suggests, the rare Splunk command finds the least frequent or rare values of the specified field or field list. This command works exactly the opposite of top commands, which return the most frequent values. The rare command returns the least frequent values.

The syntax for the rare command is as follows:

… | rare
    countfield=Count_FieldName
    limit= Limit_Value
    percentfield= Percentage_FieldName
    showcount= true / false
    showperc= true / false
    Field_List… by-clause

The description of the parameters of the preceding query is as follows.

Of all the preceding parameters, Field_List is the compulsory field. The rest are optional and can be used as per requirement:

  • Field List: This is the only compulsory field of this command is used to specify the list of fields on which the rare command is to be run to calculate the rare values. The specified fields or the field list's rare values will be calculated and shown in the results. The field lists can be followed by the by clause to group one or more fields.
  • CountField: This parameter defines the field name (Count_FieldName) where the count of rare values is written. The default value for this parameter is count.
  • Limit: This parameter defines the number of results returned by this command. The default value is 10.
  • PercentField: The fieldname (Percentage_FieldName) in which the percentage values are to be stored can be specified in this parameter.
  • Showcount: If this field is set to false, then the Count field is not shown in the results. The default value of this parameter is true.
  • Showperc: If this field is set to false, then the Percentage field is not shown in the results. The default value of this parameter is true.

The sample query should look like this:

index="web_server" | rare limit=6 countfield=RareIPCount PercentField=PercentageRareValuesdevice_ip

The above query will generate an output like the following screenshot:

The rare command

In the preceding screenshot, for the rare Splunk command, we have used data that contains visitor information on an Apache-based web server. Using this command on the device_ip field with a limit of 6 resulted in the top six rare IP addresses, along with the count (RareIPCount) and the percentage (PercentageRareValues). Thus, this command can be used to find rare values from the given dataset, along with the count and percentage of their occurrence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.152.162