An encounter with Splunk : Boot up


People think Splunk as a SIEM, but its not just SIEM. not just for logs. I think it can be used for anything that is machine generated, use it instead of R, Spark or python or anything. Yaa, it is costly, but if you already have it, you can use it for much much more. Opportunities are endless !!

Had a recent encounter with Splunk. Guys, from starting north south in Splunk documentation, struggling from data ingestion, to make a small table formatted, I have started loving Splunk now. My experience with Splunk is awesome. Though I really like dissecting a software and see how its underlying gears work, I was not able to dissect this mammoth. License issues .. :P. But what I was able to do, is make someone an expert on this.

Anyways, coming back to my point, this blog will help you to get up and running on Splunk. Basically, this will be a single document which can make you a Splunk Expert .. 🙂



What is Splunk ?. 1

How to install Splunk for playing around?. 1

How data enters into Splunk via Event processing pipeline ?. 1

What are the processing components in Splunk ?. 2

What is an Index ?. 3

How indexing works ?. 3

What are Custom Indexes?. 3

What are Fields ?. 4

What is Splunk ?

Splunk Enterprise performs three key functions as it processes data:

  1. It ingests data from files, the network, or other sources.
  2. It parses and indexes the data.
  3. It runs searches on the indexed data.

How to install Splunk for playing around?

You will need docker. I assume you have one. Otherwise let me know, will create a post for it. Here is the command to get a Splunk instance running within minutes.

$ docker run -d -e “SPLUNK_START_ARGS=–accept-license” -e “SPLUNK_USER=root” -p “8000:8000” splunk/splunk

You might need a Linux machine for this.
Let me know if you come across any hurdles. We can cross them together.

How data enters into Splunk via Event processing pipeline ?

Event processing has 2 stages

  • Parsing pipeline
  • Indexing pipeline

Parsing pipeline does :

  1. Extracting a set of default fields for each event, including host, source, and sourcetype.
  2. Configuring character set encoding.
  3. Identifying line termination using linebreaking
  4. Identifying timestamps or creating them if they don’t exist.
  5. Splunk can be set up to mask sensitive event data at this stage
  6. It can also be configured to apply custom metadata to incoming events.

Indexing pipeline does :

  1. Breaking all events into segments that can then be searched upon.
  2. Building the index data structures.
  3. Writing the raw data and index files to disk, where post-indexing compression occurs.

In particular, the parsing pipeline actually consists of three pipelines: parsing, merging, and typing, which together handle the parsing function.

What are the processing components in Splunk ?

There are three main types of processing components:

  1. Forwarders
  2. Indexers
  3. Search heads

Forwarders ingest data. The breakdown between the parsing and indexing pipelines is of relevance mainly when deploying forwarders.

Forwarders are of two types :

  1. Heavy forwarders : can run raw data through the parsing pipeline and then forward the parsed data on to indexers for final indexing. These retain most of the functionality of a full Splunk Enterprise instance. They can parse data before forwarding it to the receiving indexer.
  2. Universal forwarders do not parse data in this way. Instead, universal forwarders forward the raw data to the indexer, which then processes it through both pipelines. They perform minimal processing on the incoming data streams before forwarding them on to an indexer, also known as the receiver.

Both types of forwarders tag data with metadata such as host, source, and source type, before forwarding it on to the indexer.

Indexers and search heads are built from Splunk Enterprise instances that you configure to perform the specialized function of indexing or search management, respectively.

What is an Index ?

Splunk Enterprise stores all of the data it processes in indexes. An index is a collection of databases, which are subdirectories located in $SPLUNK_HOME/var/lib/splunk. Indexes consist of two types of files: rawdata files and index files. See How Splunk Enterprise stores indexes.

Splunk Enterprise comes with a number of preconfigured indexes, including:

main: This is the default Splunk Enterprise index. All processed data is stored here unless otherwise specified.

_internal: Stores Splunk Enterprise internal logs and processing metrics.

_audit: Contains events related to the file system change monitor, auditing, and all user search history.

When you add data, the indexer processes it and stores it in an index. By default, data you feed to an indexer is stored in the main index, but you can create and specify other indexes for different data inputs. An index is a collection of directories and files. These are located under $SPLUNK_HOME/var/lib/splunk. Index directories are also called buckets and are organized by age.

How indexing works ?

Splunk Enterprise can index any type of time-series data (data with timestamps). When Splunk Enterprise indexes data, it breaks it into events, based on the timestamps.

Event processing occurs in two main stages, parsing and indexing.

While parsing, Splunk Enterprise performs a number of actions, as mentioned below.

What are Custom Indexes?

The main index, by default, holds all your events. The indexer also has a number of other indexes for use by its internal systems, as well as for additional features such as summary indexing and event auditing. You can add indexes using Splunk Web, the CLI, or indexes.conf.

What is the need for multiple indexes :

  1. To control user access.
    1. The main reason you’d set up multiple indexes is to control user access to the data that’s in them. When you assign users to roles, you can limit user searches to specific indexes based on the role they’re in.
  2. To accommodate varying retention policies.
    1. In addition, if you have different policies for retention for different sets of data, you might want to send the data to different indexes and then set a different archive or retention policy for each index.
  3. To speed searches in certain situations.
    1. Another reason to set up multiple indexes has to do with the way search works. If you have both a high-volume/high-noise data source and a low-volume data source feeding into the same index, and you search mostly for events from the low-volume data source, the search speed will be slower than necessary, because the indexer also has to search through all the data from the high-volume source. To mitigate this, you can create dedicated indexes for each data source and send data from each source to its dedicated index. Then, you can specify which index to search on. You’ll probably notice an increase in search speed.

Sending events to a specific index:

By default, all external events go to the index called main. However, you might want to send some events to other indexes

Important: To send events to a specific index, the index must already exist on the indexer. If you route any events to an index that doesn’t exist, the indexer will drop those events.

The following example inputs.conf stanza sends all data from /var/log to an index named fflanda:

disabled = false
index = fflanda

Note: you want to use custom source types and hosts. You should define those custom source types and hosts before you start indexing, so that the indexing process can tag events with them. After indexing, you cannot change the host or source type assignments.

If you neglect to create the custom source types and hosts until after you have begun to index data, your choice is either to re-index the data, in order to apply the custom source types and hosts to the existing data, as well as to new data, or, alternatively, to manage the issue at search time by tagging the events with alternate values.

What are Fields ?

Fields appear in event data as searchable name/value pairings such as user_name=fred or ip_address= Fields are the building blocks of Splunk searches, reports, and data models. When you run a search on your event data, Splunk software looks for fields in that data.

What is Field Extraction ?

As Splunk software processes events, it extracts fields from them. This process is called field extraction.

  • Automatically extracted fields
    • Splunk software automatically extracts host, source, and sourcetype values, timestamps, and several other default fields when it indexes incoming events
    • It also extracts fields that appear in your event data as key=value This process of recognizing and extracting k/v pairs is called field discovery. You can disable field discovery to improve search performance.
    • When fields appear in events without their keys, Splunk software uses pattern-matching rules called regular expressions to extract those fields as complete k/v pairs. With a properly-configured regular expression, Splunk Enterprise can extract user_id=johnz from the events like Nov 15 09:32:22 00224 johnz

Index time does the following :

Search time does the following :

So, that you ended up this far, means you already have become an expert. Do let me know about any suggestions for this blog, I would be happy to incorporate them.

Enjoy learning !


Abhay Dandekar ! 🙂


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s