Layer-based ingestion

A threat hunting architecture relies on rich and reliable data ingestion that will allow you to detect and investigate anomalous behaviors. In our scenario, we need to use both data coming from end user workstations and data coming from the network. Luckily, we have Packetbeat and Winlogbeat, which capture the network activity and ingest logs generated on Windows machines, respectively. These can be downloaded from https://www.elastic.co/downloads/beats, where all the Beats are listed:

As you can see, they are not the only Beats available to ingest data from; there are different Beats for different purposes. Each Beat can have a set of modules or metrics set that can be enabled, based on the data that you desire to ingest. For example, here is the one that's available for Packetbeat:

In our example of detecting DNS tunneling, we will need to enable the collection of DNS data to see and detect the unusual outbound DNS queries.

In general, for data not originating from the Beats framework, it is advisable to enrich that data as much as possible before ingestion. This allows the data to be better understood and will ultimately allow for the data to be more comprehensively analyzed. Fortunately, the data that's originating from Beats is already rich with context.

Another aspect to think about is the index pattern naming convention; that is, if you desire to correlate the data across indices. In our example, our environment is made of three index patterns, as shown in the following screenshot:

This hierarchy-based index structure is laid out so that every time you add a layer of infrastructure, the index name follows the same naming pattern. As a consequence, if you need to see a specific layer of data (such as the network data), you just select the Packetbeat index. If you want to see all of the data, you can easily choose the top-level index. This approach is applicable for every view in Kibana: Discover, Visualize, and Machine Learning.

Elastic ML needs a minimum amount of data to be able to build an effective model for anomaly detection. Essentially, it's based on how quickly ML can get the first estimates of the various model parameters. For sampled metrics such as mean, min, max, and median, the minimum data amount is either eight non-empty bucket spans or two hours, whichever is greater. For all other non-zero/null metrics and count-based quantities, it's four non-empty bucket spans or two hours, whichever is greater. For the count and sum functions, empty buckets matter and therefore it is the same as sampled metrics (eight buckets or two hours). For the rare function, it'll typically be around 20 bucket spans. It can be faster for population models, but it depends on the number of people that interact per bucket.

So, in our scenario, as shown as following screenshot, we have about a month's worth of data, which represents about 7 million documents for Packetbeat:

This index will help us to understand what's happening on the network, such as which DNS requests have been made, as the following example shows:

{ 
    "server": "", 
    "dns": { ... }, 
    "method": "QUERY", 
    "proc": "", 
    "source": { ... }, 
    "dns_ips": "...", 
    "client_port": 63446, 
    "host": "localhost", 
    "tags": [], 
    "subdomain": "elastic.", 
    "bytes_out": 51, 
    "query": "class IN, type A, elastic.slack.com.", 
    "resource": "elastic.slack.com.", 
    "client_proc": "", 
    "ip": "...", 
    "port": 53, 
    "type": "dns", 
    "data_source": "elastic", 
    "client_server": "", 
    "geoip": { ... }, 
    "@version": "1", 
    "geo_ip_dns": { ... }, 
    "client_ip": "...", 
    "bytes_in": 35, 
    "highest_registered_domain": "slack.com", 
    "beat": { 
      "version": "5.1.1", 
      "hostname": "HR08", 
      "name": "HR08" 
    }, 
    "status": "OK", 
    "dest": { ... }, 
    "transport": "udp", 
    "domain": "slack.com.", 
    "@timestamp": "2017-12-12T23:59:58.153Z", 
    "responsetime": 22 
  }

The preceding document describes a normal DNS request. Remember that in the case of DNS tunneling, the important bit is the subdomain field, which is where the exfiltrated data will be hidden as an encrypted payload. We will see how this subdomain field will be leveraged in the ML job that we will use in our investigation.

Table of Contents for Layer-based ingestion

Create new playlist

Sign In

Sign Up

Table of Contents for
Layer-based ingestion