Putting it all together

The two-step process of first categorizing then counting the message-based log lines is implemented as a single configuration step in the ML job. However, two key pieces of the ML job configuration need to exist:

The definition of categorization_field_name as the field within the Elasticsearch document that contains the text to be categorized by ML
The use of the mlcategory field as part of the detector configuration

Note that the mlcategory field is not part of the actual document of the raw data being analyzed; it is similar to a scripted field that only comes into existence if categorization_field_name is defined as part of the job configuration.

Let's have a look at the following steps:

Given a set of example log lines ingested into Elasticsearch that look like the following (in JSON format, only showing the relevant fields):

        { 
          "@timestamp": "2016-02-08T15:21:06.000Z", 
          "message": "REC Not INSERTED [DB TRAN] Table", 
        } 
         { 
          "@timestamp": "2016-02-08T15:21:06.000Z", 
          "message": "Fail To Connect Database   ReActivate Application / Check Connection String", 
        } 
        { 
          "@timestamp": "2016-02-08T15:21:06.000Z", 
          "message": "Opening Database = DRIVER={SQL Server};SERVER=127.0.0.1;network=dbmssocn;address=127.0.0.1 1433;DATABASE=svc_prod;;Trusted_Connection=Yes;AnsiNPW=No", 
        } 
        { 
          "@timestamp": "2016-02-08T15:21:23.000Z", 
          "message": "REC Not INSERTED [DB TRAN] Table", 
        } 
        { 
          "@timestamp": "2016-02-02T07:36:00.000Z", 
          "message": "012 Head Office Link Active 127.0.0.1", 
        } 
        { 
          "@timestamp": "2016-02-02T10:52:00.000Z", 
          "message": "Transaction Match In DB / Duplicate Transaction", 
        }

We will leverage the message field as the categorization_field_name in an Advanced job:

Then, in the detector configuration, we can split the count detector using by_field_name of mlcategory:

The end result is that the ML job will look for unusual counts of documents split using this dynamic categorization. The output may look like the following screenshot:

Here, we see that the ML job has identified a few categories of messages that have increased volume of occurrence during this time frame. An uptick in messages relating to database problems (Fail to Connect Database, DBMS ERROR, and so on) is evident.

Additionally, notice the category examples column in the table. In it, ML will show you (by default) up to four example log messages that have matched and were grouped into that category. In some cases, there's only one example (because subsequent messages were exactly the same), whereas if there are multiple, then there are subtle differences (such as a host name or IP address). The storing of these samples is the only time ML will store a copy of a log message that was analyzed as part of the execution of the ML job. In all other cases, only summarized information about the data is stored.

More information on parameters that control how categorization works and how to view the results of categorization can be found in the Documentation section of the Elastic website at https://www.elastic.co/guide/en/elastic-stack-overview/current/ml-configuring-categories.html, and the Get categories API documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-get-category.html.

Table of Contents for Putting it all together

Create new playlist

Sign In

Sign Up

Table of Contents for
Putting it all together