Ingesting the data

We used this dataset to keep things simple for ingestion; Apache logs don't require a lot of configuration in Logstash to be parsed. The basic structure of our Logstash configuration will contain the following:

A file input to consume the two files that are provided in the dataset; one for the month of July and one for the month of August
Two filters: A grok filter to parse the log line and a date filter to transform the timestamp into a well-formatted timestamp
One output to tell Logstash to send the data to our Elasticsearch cluster

Here is the complete Logstash pipeline configuration:

input { 
  file { 
    id => "nasa_file" 
    path => "/Users/baha/Downloads/data/*.log" 
    start_position => "beginning" 
    sincedb_path => "/dev/null" 
  } 
} 
 
filter { 
  grok { 
    id => "nasa_grok_filter" 
    match => { "message" => "%{COMMONAPACHELOG}" } 
  } 
  date { 
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"] 
    target => "@timestamp" 
    remove_field => [ "timestamp" ] 
  } 
} 
 
output { 
  elasticsearch { 
    id => "nasa_elasticsearch_output" 
    hosts => "localhost:9200" 
    user => "elastic" 
    password => "********" 
    index => "nasa-%{+YYYY.MM}" 
    template => "/Users/baha/Downloads/data/template.conf" 
    template_name => "nasa"

    template_overwrite => "false" 
  } 
}

The file input consumes all of the log files that are contained in the specified directory (in my case, NASA_access_log_Aug95.log and NASA_access_log_Jul95.log). I've added a .log extension to the file compared to what is provided by default. For testing purposes, I've set the start_position parameter to start at the beginning of the file, so we combined this with sincedb_path, where the last position of ingestion is saved, pointing to /dev/null. The file is then consumed whenever I update the config or restart Logstash.

Notice that in the filter section, we are using the COMMONAPACHELOG grok pattern to parse the log contained in the input message field. Also, the date filter transforms the timestamp of the file into a date that we store in a @timestamp field.

Finally, we are using an output to index the data in Elasticsearch using the template.conf template file to set the index settings and mapping.

See the template.conf file in the GitHub repository for this book at https://github.com/PacktPublishing/Machine-Learning-with-the-Elastic-Stack/blob/master/Chapter07/template.conf for more information.

The mapping is essential to have a proper mapping of field names to their desired type (bytes as integer and response as keyword, for example). Otherwise, the fields may not be available if we want to effectively use them in either Elasticsearch aggregations or ML job configurations.

In this example, Logstash has been configured to use centralized monitoring and management. The required configuration has been added at the end of the Logstash configuration's logstash.yml file, which is located in the conf directory, as shown here:

# X-Pack Monitoring 
xpack.monitoring.enabled: true 
xpack.monitoring.elasticsearch.username: elastic 
xpack.monitoring.elasticsearch.password: ********* 
xpack.monitoring.elasticsearch.url: ["http://localhost:9200"] 
# X-Pack Management 
xpack.management.enabled: true 
xpack.management.pipeline.id: ["nasa-apache-log"] 
xpack.management.elasticsearch.username: elastic 
xpack.management.elasticsearch.password: ********* 
xpack.management.elasticsearch.url: ["http://localhost:9200"]

Only supply username/password credentials if X-Pack Security is enabled.

In this manner, you will be able to add the Logstash pipeline in Kibana, and Logstash will connect and load it at bootstrap. To proceed, take note of the following steps:

Connect to Kibana and go to the Management | Logstash | Pipelines section.
Click on the Create Pipeline button.

This will display the following UI:

From there, launch Logstash and you should see data coming through the pipeline that's been indexed in Elasticsearch, like in the following monitoring screenshot:

The relative index monitoring stats are as follows:

To summarize, we have ingested about 3.5 million documents, representing about 1 GB of data on disk. If you check the document structure by querying the Elasticsearch index, you should see something like the following:

{ 
    "host": "mbp-de-baha.lan", 
    "ident": "-", 
    "auth": "-", 
    "@timestamp": "1995-09-01T03:59:53.000Z", 
    "path": "/Users/baha/Downloads/data/NASA_access_log_Aug95.log", 
    "bytes": "39017", 
    "request": "/images/kscmap-small.gif", 
    "@version": "1", 
    "response": "200", 
    "httpversion": "1.0", 
    "message": "cindy.yamato.ibm.co.jp - - [31/Aug/1995:23:59:53 -0400] "GET /images/kscmap-small.gif HTTP/1.0" 200 39017", 
    "verb": "GET", 
    "clientip": "cindy.yamato.ibm.co.jp" 
  }

At this stage, we are now ready to create some ML jobs.

Table of Contents for Ingesting the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Ingesting the data