Data processing

Data processing plays a very important role in parsing and enriching data to create insights faster and visualize data with the required analytics. Data processing basically includes event, timestamp, and host configuration.

Event configuration

Any data uploaded on Splunk is termed as an event. An event can be anything from a log activity, error logs, usage logs, to machine-generated data from devices, servers, or from any other sources. Events are used to create visualization and get insight about the source in the Splunk environment. So, it is required to process the events properly, depending on the data and source. The processed events' settings and configurations can be stored later in a source type.

Character encoding

Splunk supports many languages to support internationalization of Splunk Enterprise. The default character's set encoding on Splunk Enterprise is UTF-8, whereas it has inbuilt support for various other encoding available internationally. If the data is not UTF-8 or it contains non-ASCII data, then Splunk tries to convert it to UTF-8 until and unless it is specified by the user in the Splunk configuration file to not convert it.

Splunk supports various characters' sets, but it always uses UTF-8 by default. If the data is of the other encoding type, it is required to be configured in the props.conf file. The following line in props.conf forces the data uploaded from the SatelliteData host to be parsed using a Russian encoding character set:

[host::SatelliteData]
CHARSET=ISO-8859-5

Splunk also supports automatic detection of character set encoding. In a situation wherein the data is uploaded on Splunk to a specific source type or a specific host contains a mixture of various character sets, in that case, Splunk's powerful algorithm can automatically detect the character set encoding and apply it accordingly to the data. Props.conf needs to be modified with the following lines to force the source type to autoencoding, rather than using the default UTF-8 character set:

[host::SatelliteData]
CHARSET=AUTO

Event line breaking

A single event can be of a few words, a single line, or multiple lines as well. The Splunk Enterprise engine has the capability to automatically detect the events. However, since there are various types and formats of data, it may not be necessary that the events will be well detected and broken into events properly. So, manual line breaking can be required if the automatic line break does not detect multiple line events properly.

Event line breaking can be configured to be based on a regular expression, a specific word that occurs at the start of every new event, a specific word that ends the events, when a new date or time is encountered, and so on.

The following is a list of event-line breaking commands that can be configured from Splunk Web via data uploading or can be configured in the props.conf file:

  • TRUNCATE=<NUMBER>: This commands accepts a number, which is in bytes, after which the lines are to be truncated. If the data is in a long line and only up to a few specific bytes, the data is useful. Using this command, data that is not required can be truncated.

    For example, TRUNCATE=500 will truncate the line after 500 bytes of data. So, any line that has more than 500 bytes will be truncated. Generally, truncate is used to avoid memory leaks, search slowdown, and avoid indexing of useless data.

  • LINE_BREAKER=<REGULAR_EXPRESSION>: This command is used to break the event at the occurrence of a specific regular expression. Whenever that specific regular expression is detected, the preceding data is termed as a new event.
  • SHOULD_LINEMERGE = [true or false]: This command combines several lines into a single line until and unless the condition pertaining to any of the following set of attributes is satisfied:
    • BREAK_ONLY_BEFORE_DATE = [true or false]: Here, the data is marked as a new event whenever a new line with a date is detected
    • BREAK_ONLY_BEFORE = < REGULAR_EXPRESSION >: A new event is created whenever a specific regular expression is encountered in a new line
    • MUST_BREAK_AFTER = < REGULAR_EXPRESSION >: Splunk creates a new event for the next input on occurrence of a specified regular expression on the given line
    • MUST_NOT_BREAK_AFTER = < REGULAR_EXPRESSION >: Splunk does not break for a given regular expression until and unless events that satisfy the condition of MUST_BREAK_AFTER are satisfied
    • MAX_EVENTS = <NUMBER>: This number specifies the maximum number of lines a single event can be of.

The following is an example of configuring the event breaking in the props.conf file:

[SatelliteData]
SHOULD_LINEMERGE = true
MUST_BREAK_AFTER = </data>

The preceding change in props.conf instructs the events to be broken after the occurrence of </data> for the SatelliteData source type.

Timestamp configuration

A timestamp is one of the very important parameters of data. It is very useful in creating visualization and insight by time, that is, the number of errors and crashes occurred in one day, in the last 10 min, in the last one month, and so on. A timestamp is required to correlate data overtime, create visualizations based on time, run searches, and so on. The data that we upload from different sources may or may not have a timestamp in it. The data that has a timestamp should be parsed with a correct timestamp format, and for the data that does not have a timestamp, Splunk automatically adds a timestamp during the upload for better time-based visualizations.

The Splunk procedure of assigning a timestamp to the data is based on various parameters such as the timestamp settings in props.conf. If no settings are found in the props.conf file, then it checks for the sourcetype timestamp format in the events. If the event doesn't have a timestamp, then it tries to fetch a date from the source or filename. If it is not able to find the date, then it assigns the current time to the event. In most cases, we are not required to do any specific configuration, since Splunk checks for almost all the possible options to assign a timestamp to the data.

In some cases, if the timestamp is not properly parsed, then the timestamp can be configured during the upload of data in the Source type Settings page or it can be manually configured in the props.conf file as well.

The following are the attributes that can be configured in props.conf for timestamp configuration:

  • TIME_PREFIX = <REGULAR_EXPRESSION>: This helps us to search for a specific regular expression that is prefixed to the timestamp. For example, if in your data, a timestamp is available after the <FORECAST> tag, then define it as TIME_PREFIX.
  • MAX_TIMESTAMP_LOOKAHEAD = <NUMBER>: In this attribute, we specify the position number in the event, where the timestamp is located. Let's suppose that we have configured to break the events by every line and after every 15 words, a timestamp is found, then MAX_TIMESTAMP_LOOKAHEAD=15 needs to be configured.
  • TIME_FORMAT = <STRPTIME_FORMAT>: The format of timestamp strptime().
  • TZ = <TIMEZONE>: The time zone can be specified in the (+5:30) format or in the format of UTC. For example, TZ=+5:30 specifies the time zone of India.
  • MAX_DAYS_AGO = <NUMBER>: These configuration settings can be very useful if you do not want to upload older data on Splunk. Let's suppose that the user is interested in uploading the data of the last one month only, then this attribute can be configured and any data older than the specified days will not be uploaded on Splunk. The default value of this parameter is 2000 days.
  • MAX_DAYS_HENCE =<NUMBER>: These settings upload data that has a date less than the number of days in the future. For example, the default value for this attribute is 2, and then, from the present day, if the data has a date greater than two days, it will be ignored and not uploaded.

Let's look at an example of a timestamp extraction.

The following settings in props.conf will search for a timestamp after the word FORECAST. It will parse the timestamp in the following mentioned format and time zone of Asia/Kolkata, which is +5:30, and it will not allow to upload the data that is more than 30 days old:

[SatelliteData]
TIME_PREFIX = FORECAST: 
TIME_FORMAT = %b %d %H:%M:%S %Z%z %Y
MAX_DAYS_AGO = 30
TZ= Asia/Kolkata

Host configuration

A hostname or host is the name used to identify the source from where the data is uploaded on Splunk. It is a default field, and Splunk assigns a host value to all the data that gets uploaded on Splunk. The default host value is generally a hostname, IP address, or path of the file or TCP/UDP port number from where the data was generated.

Let's take an example, where we have data being uploaded from four different web servers located at Mumbai, Jaipur, Delhi, and Bangalore. All the data is uploaded from web servers, so it will be available under the same source type. In such situations, it becomes difficult to get insight of only one specific location. So, a hostname can be assigned to it from where the data is getting uploaded to uniquely identify the source and also create visualizations and insight specific to that source. If a user is interested in finding the number of failures, server downtime, and other insights only specific to one web server, in that case, the hostname assigned to that specific location's web server can be used as a filter to fetch information respective to that source.

As mentioned earlier, Splunk automatically tries to assign a hostname if not already specified or configured by the user in the transforms.conf configuration or while defining the source type during data input configuration. In many situations, there can be a need for manual configuration of hostnames for better insight and visualizations.

The default host value can be configured from the inputs.conf file as follows:

[default]
host = <string>

Setting a host as <string> configures Splunk to keep the IP address or domain name of the source as the host.

Tip

Never include quotes (") in the host value. For example, host=Mumbai is valid, but host="Mumbai" is the wrong way of assigning a host value in the inputs.conf file.

In a large distributed environment, it may happen that data is uploaded via forwarders or via a directory path, and it may be required that the hostname be assigned depending on the directory in which the data needs to be classified or on the basis of events. Splunk can be configured to handle such complex scenarios, where the hostname can be either statically or dynamically assigned based on a directory structure or on the basis of the events of the data.

Configuring a static host value – files and directories

This method is useful when the data received from one specific file or directory is to be assigned a single host value. The following procedure is to be applied to define a single host value for data sourced from a specific file or directory.

Let's look at the Web Console method:

  1. Navigate to Settings | Data Input | Files and Directories from Splunk Web Console.
  2. If the settings are to be applied on the existing input, choose the respective input to update or create a new input in order to configure the host value.
  3. Under the Set host drop-down menu, choose Constant value, and in the Host filed value textbox, enter the hostname that you wish to set for the respective input source.
  4. Click on Save/Submit to apply the settings.

Now, Splunk will ensure that any data uploaded from the configured data input will be assigned the specified host value.

Let's see the Config File method. Here, static host values can also be configured manually by modifying the inputs.conf file as follows:

[monitor://<path>]
host = <Specify_host_name>

In case of the existing data input, just replacing the host value will ensure that any data uploaded in future will be assigned the mentioned host value. If the input method does not exist, then an entry similar to the preceding one with the path of the file/directory from where the data will be uploaded and the host value required can be configured.

Here is an example. The following settings in inputs.conf ensure that any data getting uploaded from the Data folder of the F drive will have the TestHost host value:

[monitor://F:Data]
host = TestHost

Configuring a dynamic host value – files and directories

This configuration is useful when we are dependent on the name of the file or a regular expression from the source where the data of different hosts can be differentiated. Generally, this is useful when archived data is uploaded on Splunk and the filename has some information about the host, or this can be useful in scenarios where a single forwarder fetches data from different sources and then uploads it on Splunk.

Let me explain this with an example. Suppose that the data from the following folders is uploaded on Splunk:

  • F:DataVer4.4
  • F:DataVer4.2
  • F:DataVer5.1

If for the preceding scenario, the data uploaded from the 4.4 folder has the Kitkat host value, the 4.2 folder has Jellybean, and the 5.1 folder has Lollipop, then a dynamic host value configuration is required.

The steps for the Web Console method are as follows:

  1. Navigate to Settings | Data Input | Files and Directories from Splunk Web Console.
  2. If the settings are to be applied on the existing input, choose the respective input to update or create a new input in order to configure the host value.
  3. Under the Set host drop-down menu, you will find the following options:
    • Regex on Path: This option can be chosen if the hostname is to be extracted using a regular expression on the path.

      The preceding example can be implemented using this method by setting the Regular expression textbox as F:DataVer(w+).

    • Segment in Path: This option is useful in scenarios where the path segment can be used as a host value.

      The preceding example can also be implemented by choosing this option and by setting the Segment Number textbox as 4, that is, F:DataVer4.4; in this case, 4.4 is the fourth segment of the path.

  4. Click on Save/Submit to apply the settings.

With the Config File method, dynamic host values can be configured manually by modifying the inputs.conf file as follows. For the preceding example, input.conf will look like this:

  • Regex on Path:
    [monitor://F:DataVer]
    host_regex =F:DataVer(w+)

Or, the input.conf file will look as follows:

  • Segment in Path:
    [monitor://F:DataVer]
    host_segment = 4

Configuring a host value – events

Splunk Enterprise supports assigning different hostnames based on the events in data. Event-based host configuration plays a very important role when the data is forwarded by a forwarder or the data is from the same file/directory where hostname classification cannot be done on the basis of files/directories. Event-based host configuration can be configured by modifying the config files, which we will look at in a later section.

The Transforms.conf file should have the following settings, which include a unique name using which the props.conf file will be updated:

[<name>]
REGEX = <regex>
FORMAT = host::$1
DEST_KEY = MetaData:Host

Now, the props.conf file needs to be configured accordingly in reference to transforms.conf. It will have a source or source type. The TRANSFORMS parameter will have the same name that we have used for transforms.conf.

For Props.conf, the code block should look like this:

[<source or sourcetype>]
TRANSFORMS-<class> = <name>

Let's suppose that we have data that is uploaded from the test.txt file to Splunk Enterprise, which has events from various sources, and while uploading it to Splunk, the user needs to assign different hosts based on the content of the event.

This is how the transforms.conf and props.conf files need to be configured to implement this host configuration (the host value for matching events):

//For Transforms.conf file
  [EventHostTest]
  REGEX = Eventsoriginator:s(w+-?w+)
  FORMAT = host::$1
  DEST_KEY = MetaData:Host
//For Props.conf file
  [TestTXTUpload]
  TRANSFORMS-test= EventHostTest

The preceding configuration will ensure that it assigns a host based on the detected regular expression.

All the host configuration methods explained earlier need to be implemented before the data gets uploaded on Splunk. In a situation where the data is already uploaded on Splunk, the host configuration can either be done by deleting and reindexing the data or by creating tags for incorrect host values. There is also one more approach that can be useful in case when the data is already uploaded: the use of lookup tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.130.24