Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Getting Data in One Place

Once you collect all your data, you have to have an environment where you can process it and produce results. In this chapter, I provide some notes on an architecture to facilitate the rapid development and operational deployment of security analysis software (analytics¹).

There are a number of ways to implement this; the version you’ll see in Figure 8-1 is a high-level diagram for a basic environment. In general, these environments should have the following attributes:

Robust, universal access to all sensor data. The term “universal” here is used in lieu of “centralized”—it’s not critical that the data be in one place, but it is critical that anyone implementing analytical code have uniform access to all the data.
Access to a Turing-complete language. This differentiates an analysis environment from the classic security console. Complex analytics require access to a general-purpose programming language and the ability to build constructs that rely on in-place memory manipulation—so, Python good, R good, SQL bad.
Performance. Any analytic system will have to deal with resource contention; it is better to overprovision for multiple simultaneous queries early on rather than have your analysts fighting to get results in a crisis.

A Brief History of Security Analysis Tools, and Why They Don’t Play Well with Each Other

Work on an operational floor, and you’ll inevitably see analysts working on dual-monitor setups where they have one console in one window, another console in another window, and are manually passing information between the two consoles. These “chair-swivel” situations add stress and errors into an already stressful environment.

Before launching into the architectural walkthrough, I need to provide some context to explain the history of security analysis tools, and in particular, why security analysis tools generally don’t do interoperations well (if at all).

For the purposes of this discussion, we can break security tools into three “generations,” shown in the attached figure. The earliest generation, until the early 2000s, is characterized by isolated inline tools. Examples of these tools include IDSs, firewalls, AV systems, and the like. These tools were characterized by their simplicity and their isolation. IDS tools, running inline, could only manage a limited number of rules.

The second generation, which really took off in the early 2000s, is characterized by security information management (SIM) or security information and event management (SIEM) tools and dashboards, particularly ArcSight and Splunk. These tools don’t generate data; they are databases that other systems dump their security information into.

This second generation is defined largely by the first successful adoption of intercommunications standards, in particular rough-and-ready interfaces such as CEF. Still, systems in this generation are often run alone with nonstandardized output. Often, these systems repackage threat intelligence as part of their services.

We’re in the midst of a third generation right now, which is characterized by two major factors: the adoption of big data architectures (which for the purposes of this discussion I describe as non-CRUD databases and the use of MapReduce operations) for analysis, and the increased reliance on annotative and third-party data such as threat intelligence.

For developers working with analysis teams, I recommend the following actions as rules of thumb:

Develop command line–accessible APIs for every tool you build. System integration will be much easier if everything has a REST API, and analysts often work off the command line.² Keep the API current to the UI at all times.
Include the ability to dump any and all output to CSV format. If it exists, someone is going to shove it into Excel.
If you need authentication, work on single sign-on and keep discrete domains to a minimum.
Work to the SIM; if the ops floor lives off ArcSight, Splunk, or an ELK stack, build your requirements backward from that, and maintain consistent terminology and interfaces with whatever tool they are using.
Do not ignore reliability. Security analysis is usually done in a high-stress environment. Big data tools are, as of this writing, far more wonky than monotonous enterprise databases.

As a developer, design your tools to interoperate from the start. As a purchaser, check to see what the service level and data agreements are for the tool, particularly at end of life. There’s nothing as painful as depending on a tool for five years and then finding out that you can’t extract data from it.

High-Level Architecture

Figure 8-1 shows a high-level view of a security analysis environment. This environment is envisioned as assisting in the rapid prototyping and deployment of new analytics; security is a constantly moving target, and analysts will need to experiment with new analytics on a regular basis. I will briefly walk through the architectural goals and then discuss each component in more depth.

The Sensor Network

The sensor network consists of all the data-gathering devices inside your observed network. This includes network sensing (e.g., IDS, firewall, flow sensing), host-based sensing (e.g., AV, HIDS, DLP), and service sensing (syslog, HTTP logs, and the like). Active sensing, being done on an ad hoc basis, is not embedded within the sensor network.

When designing a sensor network, consider the following questions and issues:

Will the data be transported in band or out of band? If you are generating a large amount of sensor data, consider setting up dedicated VLANs for transporting this information so it is not reported within the sensor network.
Will you store data on the sensors and fetch on demand, or will you forward all data? Summary data (such as NetFlow or constructed events) may be best stored centrally, while raw data is recorded at the sensors proper and pulled as needed.
How will you measure the sensors themselves? The integrity of the sensor network will be an ongoing concern. Make sure that you have analytics available to verify that a sensor is correctly installed, to onboard new sensors, and to identify when a sensor has dropped off the network.

The Repository

The repository is a location for any nonstreaming data. As Figure 8-1 shows, this is subdivided into three components: an archive, annotation data, and the knowledge base.

The repository is the component of the system that is accessed most often and most randomly. As such, performance issues hit the repository more than any other part of the system. When building the repository, consider the following issues:

How much information do you expect to maintain? This question may be determined by regulation within your sector. In the absence of that, I like being able to go back at least 90 days (a quarter), and longer if at all possible.
How will you access immediate (say, the last two weeks) versus longer-term information? Longer queries will be rarer than ones over the last week or so, so if you intend to use high-performance, expensive storage, focus on the last week to two weeks.
How often will you update your storage estimates? Network traffic volume will increase steadily over time, so updating the estimates of how much information you can store is a process you should consider at least quarterly.

Annotation

Annotation data refers to information that you use to supplement your archive data. This includes threat intelligence, geolocation data, network reputation, and other forms of mapping such as DNS repositories. Annotation data differs from event data in that it has a valid time—for example, the owner of an IP address may be one organization from March 5th to 10th, and another on the 11th onward. Coordinating this timing information is critical when considering annotation.

In addition to the general repository issues, specific questions to consider with annotation include:

How do you combine annotation with archive data? Will you pay a performance cost and integrate them as needed, or pay a storage cost and update the archive records with annotations?
If you have multiple redundant annotation sources (e.g., two different geolocation databases), how do you represent this?

Geolocation: You Get What You Pay For

Third-party geolocation software is primarily intended for web services companies to provide real-time geofencing information—for example, targeted advertising or rights management. Consequently, geolocation licensing agreements are usually a “per-lookup” style of license.

When buying geolocation software, keep the following rules in mind:

Country-level location is generally pretty good, but once you try to get down to cities or metropolitan statistical areas (MSAs, geographic regions defined by the US government), you’re going to find the accuracy goes downhill.
Be aware of the location process’s failure mode; a common problem has been geolocation software defaulting to the center of a region, leading to things like all the world’s malware appearing to originate from Potwin, Kansas.
Accuracy is generally best within the RIPE and ARIN regions (see “The RIRs and IP Address Allocation”), and tends to fall off outside of those areas.
Check the licensing and see if the vendor will directly ship you a database, rather than buying a per-query license. A good day’s worth of scan traffic may cost you your monthly query limit.

Knowledge base

The knowledge base (KB) consists of information and judgments that the organization has built up itself over time, and consists of information that is specifically relevant to the target enterprise. Examples of critical information in the KB include asset inventories (a CMDB if you have one), personnel data, and internal calendars. Unlike the other information here, KB data is unstructured and usually relatively small, as it’s largely person-moderated.

Query Processing

The query processing system is a development environment that supports rapid prototyping, data synthesis, and providing contextual data. The system can process data from multiple locations to synthesize it.

Questions to ask when determining query processing requirements include:

How do developers or analysts touch the query processing system? Limit the number of languages as much as possible, down to two if you can get away with it.
How do they touch data? The great achievement of SIEM is to provide a developer access to all the data in one database cursor. However you store or hide the data, ensure that the analysts have one access point and consistently named tables (a source IP is a source IP everywhere).
How will you manage queue contention? Figure out a query that takes about 5 minutes to run, then run it once, 4 times simultaneously, 10 times simultaneously, and 20 times simultaneously.³ Many query systems will gracefully degrade, allowing x number of queries to process simultaneously, then holding off on x+1 until a query finishes. During times of stress, you can expect that analysts will be beating the system constantly, and queue contention is going to kill you.

Real-Time Processing

Real-time processing consists of any analyses that are done à la minute, which may include straight signature matching, specifically crafted high-performance analytics, and logfile generation. As a rule, real-time processing should be distinct from query processing—real-time data should ideally summarize or process data to reduce the amount of query processing needed.

A Question of Timing

The term “real-time” in intrusion detection and network security is vaguely defined. The general sense of the term is that the system detects attacks “while the attack takes place”; this doesn’t help because “while an attack takes place” is itself vaguely defined and may involve a number of short bursts over several months.

When discussing real-time detection, there are a couple of issues to keep in mind. First, there’s a limiting case: without a real-time response, real-time detection is pointless. As a corollary to that, most self-respecting defenders aren’t going to trust a real-time defense system since an attacker is going to see it as a “please DoS me kit.”

Second, there’s the problem of false positive rate. The base-rate problem impacts all detection systems. Effectively limiting the impact of false positives is going to involve correlating data and providing context, which isn’t a real-time action.

Put another way, whatever you can do in real-time should be done because it has to be real-time, not because it can be. These include dealing with obvious threats such as DDoS attacks; things where the attack is obvious and the consequences of an immediate response are manageable.

There is one particular area where real-time data collection, if not detection, is critical: correlating transient lookup data. If something has a short lifetime, then collect it while it’s alive. The most prominent example is the relationship between DNS names and IP addresses, but this also includes things such as address/port combinations across proxies and MAC/IP address relations in DHCP networks.

Source Control

I can’t emphasize the importance of a source code repository enough—just make sure everyone knows how to use Git, even if they’re not developers. In particular, the following information should be maintained in the repository:

Analytics
Signature sets
Firewall and IDS configurations⁴

Analytics are obvious, since they are code executed on data. The others are less obvious, but as a rule, anything that is mechanically processed by a detection system or other middlebox in your network should be recorded and changes maintained in the archive. This is a necessity when reconstructing past events—if a change in traffic occurs on a network, the first question the analyst should ask is whether the change is due to the internet or due to the collection system.

Log Data and the CRUD Paradigm

The CRUD (create, read, update, and delete) paradigm describes the basic operations expected of a persistent storage system. Relational database management systems (RDBMSs), the most prevalent form of persistent storage, expect that users will regularly and asynchronously update existing contents. Relational databases are primarily designed for data integrity, not performance.

Ensuring data integrity requires a significant amount of the system’s resources. Databases use a number of different mechanisms to enforce integrity, including additional processing and metadata on each row. These features are necessary for the type of data that RDBMSs were designed for. That data is not log data.

This difference is shown in Figure 8-2. In RDBMSs, users add and query data from a system constantly, and the system spends resources on tracking these interactions. Log data does not change, however; once an event has occurred, it is never updated. This changes the data flow as shown in the figure on the right. In log collection systems, the only things that write to disk are the sensors; users only read from disk.

This separation of duties between users and sensors means that, when working with log data, the integrity mechanisms used by databases are wasted. For log data, a properly designed flat file collection system will often be just as fast as a relational database.

Creating a Well-Organized Flat File System: Lessons from SiLK

In Chapter 9, we discuss SiLK, the analysis system CERT developed to handle large NetFlows. SiLK was a very early big data system. While it doesn’t use current big data technologies, it was designed around similar principles, and understanding how those principles work can inform the development of more current systems.

Log analysis is primarily I/O bound, meaning that the primary constraint on performance is the number of records read, as opposed to the complexity of the algorithms run on the records. For example, in the original design of SiLK, we found that it was considerably faster to keep compressed files on disk—the performance hit from reading the records off of disk was much higher than the performance hit of decompressing a file in memory.

Because performance is I/O bound, a good query system will read the minimum number of relevant records possible. In log collection systems, the most effective way to reduce the records read is to index them by time and always require a user to specify the time queried. In SiLK, log records are stored in hourly files in a daily hierarchy; for example, /data/2013/03/14/sensor1_20130314.00 to /data/2013/03/14/sensor1_20130314.23. SiLK commands include a globbing function that hides the actual filenames from the user; queries specify a start date and an end date, which in turn is used to derive the files.

This partitioning process does not have to stop with time. Because network traffic (and log data) is usually dominated by a couple of major protocols, those individual protocols can be split off into their own files. In SiLK installations, it’s not unusual to split web traffic from all other traffic because web traffic makes up 40–80% of the traffic on most networks.

As with most data partitioning schemes, there’s more art than science in deciding when to stop subdividing the data. As a rule of thumb, having no more than three to five further partitions after the time partition is acceptable because as you add additional partitions, you increase complexity for users and developers. In addition, determining the exact partitioning scheme usually requires some knowledge of the traffic on the network, so you can’t do it until after you’ve acquired a better understanding of the network’s structure, composition, and the type of data it encounters.

A Brief Introduction to NoSQL Systems

The major advance in big data in the past decade has been the popularization of NoSQL big data systems, particularly the MapReduce paradigm introduced by Google. MapReduce is based around two concepts from functional programming: mapping, which is the independent application of a function to all elements in a list, and reducing, which is the combination of consecutive elements in a list into a single element. Example 8-1 clearly shows how these elements work.

Example 8-1. Map and reduce functions in Python

>>> # map works by applying a function to every element in an array.
... # For example, we create a sample array of 1 to 10.
>>> sample = range(1,11)
>>> # We now define a doubling function.
...
>>> def double(x):
...     return x * 2
...
>>> # We now apply the doubling function to the sample data.
... # This results in a list whose elements are double the
... # original's.
...
>>> map(double, sample)
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
>>> # Now we create a 2-parameter function that adds two elements.
...
>>> def add(a, b):
...     return a + b
...
>>> # We now run reduce with add and the sample; add is applied
... # to every element in turn, so we get add(1,2), which produces
... # 3. The list now looks like [3,3,...] as opposed to
... # [1,2,3....], and the process is repeated: 3 is added to 3,
... # and the list now looks like [6,4,...], until everything is
... # added.
...
>>> reduce(add, sample)
55

MapReduce is a convenient paradigm for parallelization. Map operations are implicitly parallel because the mapped function is applied to each list element individually, and reduction provides a clear description of how the results are combined. This easy parallelization enables the implementation of any of a number of big data approaches.

For our purposes, a big data system is a distributed data storage architecture that relies on massive parallelization. Recall the previous discussion about how flat file systems can enhance performance by intelligently indexing data. But now, instead of simply storing the hourly file on disk, we split it across multiple hosts and run the same query on those hosts in parallel. The finer details depend on the type of storage, for which we can define three major categories:

Key stores: Including MongoDB, Accumulo, Cassandra, Hypertable, and LevelDB. These systems effectively operate as a giant hashtable in that a complete document or data structure is associated with a key for future retrieval. Unlike the other two options, key store systems don’t use schemas; structure and interpretation are dependent on the implementer.
Columnar databases: Including MonetDB, Sensage, and Paraccel. Columnar databases split each record across multiple column files with the same index.
Relational databases: Including MySQL, Postgres, Oracle, and Microsoft’s SQL Server. RDBMSs store complete records as individually distinguishable rows.

Figure 8-3 explains these relations graphically. In a key store, the record is stored by its key while the relationship between the recorded data and any schema is left to the user. In a columnar database, rows are decomposed into their individual fields and then stored, one field per file, in individual column files. In an RDBMS, each row is a unique and distinguishable entity. The schema defines the contents of each row, and rows are stored sequentially in a file.

Key stores are a good choice when you have no idea what the structure of the data is, when you have to implement your own low-level queries (e.g., image processing and anything not easily expressed in SQL), or even if the data has structure. This reflects their original purpose of supporting unstructured text searches across web pages. Key stores will work well with web pages, tcpdump records containing a payload, images, and other datasets where the individual records are relatively large (on the order of 60 KB or more, around the size of the HTML on a modern web page). However, if the data possesses some structure, such as the ability to be divided into columns, or extensive and repeated references to the same data, then a columnar or relational model may be preferable.

Columnar databases are preferable when the data is easily divided into individual log records that don’t need to cross-reference each other, and when the contents are relatively small, such as the CLF and ELF record formats discussed in “HTTP: CLF and ELF”. Columnar databases can optimize queries by picking out and processing data from a subset of the columns in each record; their performance improves when they query on fewer columns or return fewer columns. If your schema has a limited number of columns (for example, an image database containing a small date field, a small ID field, and a large image field), then the columnar approach will not provide a performance boost.

RDBMSs were originally designed for information that’s frequently replicated across multiple records, such as a billing database where a single person may have multiple bills. RDBMSs work best with data that can be subdivided across multiple tables. In security environments, they’re usually best suited to maintaining personnel records, event reports, and other knowledge—things that are produced after processing data or that reflect an organization’s structure. RDBMSs are good at maintaining integrity and concurrency; if you need to update a row, they’re the default choice. The RDBMS approach is possibly unwarranted if your data doesn’t change after creation, individual records don’t have cross-references, or your schemas store large blobs .

Table of Contents for
8. Getting Data in One Place

Chapter 8. Getting Data in One Place

High-Level Architecture

Figure 8-1. Reference architecture for a security analysis environment

The Sensor Network

The Repository

Archive

Annotation

Knowledge base

Query Processing

Real-Time Processing

Source Control

Log Data and the CRUD Paradigm

Figure 8-2. Comparing RDBMSs and log collection systems

A Brief Introduction to NoSQL Systems

Example 8-1. Map and reduce functions in Python

Figure 8-3. Comparing data storage systems

Further Reading

Table of Contents for 8. Getting Data in One Place

Create new playlist

Sign In

Sign Up

Chapter 8. Getting Data in One Place

High-Level Architecture

Figure 8-1. Reference architecture for a security analysis environment

The Sensor Network

The Repository

Archive

Annotation

Knowledge base

Query Processing

Real-Time Processing

Source Control

Log Data and the CRUD Paradigm

Figure 8-2. Comparing RDBMSs and log collection systems

A Brief Introduction to NoSQL Systems

Example 8-1. Map and reduce functions in Python

Figure 8-3. Comparing data storage systems

Further Reading

Table of Contents for
8. Getting Data in One Place