Chapter 8. Getting Data in One Place

Once you collect all your data, you have to have an environment where you can process it and produce results. In this chapter, I provide some notes on an architecture to facilitate the rapid development and operational deployment of security analysis software (analytics1).

There are a number of ways to implement this; the version you’ll see in Figure 8-1 is a high-level diagram for a basic environment. In general, these environments should have the following attributes:

  • Robust, universal access to all sensor data. The term “universal” here is used in lieu of “centralized”—it’s not critical that the data be in one place, but it is critical that anyone implementing analytical code have uniform access to all the data.

  • Access to a Turing-complete language. This differentiates an analysis environment from the classic security console. Complex analytics require access to a general-purpose programming language and the ability to build constructs that rely on in-place memory manipulation—so, Python good, R good, SQL bad.

  • Performance. Any analytic system will have to deal with resource contention; it is better to overprovision for multiple simultaneous queries early on rather than have your analysts fighting to get results in a crisis.

High-Level Architecture

Figure 8-1 shows a high-level view of a security analysis environment. This environment is envisioned as assisting in the rapid prototyping and deployment of new analytics; security is a constantly moving target, and analysts will need to experiment with new analytics on a regular basis. I will briefly walk through the architectural goals and then discuss each component in more depth.

nsd2 0801
Figure 8-1. Reference architecture for a security analysis environment

The Sensor Network

The sensor network consists of all the data-gathering devices inside your observed network. This includes network sensing (e.g., IDS, firewall, flow sensing), host-based sensing (e.g., AV, HIDS, DLP), and service sensing (syslog, HTTP logs, and the like). Active sensing, being done on an ad hoc basis, is not embedded within the sensor network.

When designing a sensor network, consider the following questions and issues:

  • Will the data be transported in band or out of band? If you are generating a large amount of sensor data, consider setting up dedicated VLANs for transporting this information so it is not reported within the sensor network.

  • Will you store data on the sensors and fetch on demand, or will you forward all data? Summary data (such as NetFlow or constructed events) may be best stored centrally, while raw data is recorded at the sensors proper and pulled as needed.

  • How will you measure the sensors themselves? The integrity of the sensor network will be an ongoing concern. Make sure that you have analytics available to verify that a sensor is correctly installed, to onboard new sensors, and to identify when a sensor has dropped off the network.

The Repository

The repository is a location for any nonstreaming data. As Figure 8-1 shows, this is subdivided into three components: an archive, annotation data, and the knowledge base.

The repository is the component of the system that is accessed most often and most randomly. As such, performance issues hit the repository more than any other part of the system. When building the repository, consider the following issues:

  • How much information do you expect to maintain? This question may be determined by regulation within your sector. In the absence of that, I like being able to go back at least 90 days (a quarter), and longer if at all possible.

  • How will you access immediate (say, the last two weeks) versus longer-term information? Longer queries will be rarer than ones over the last week or so, so if you intend to use high-performance, expensive storage, focus on the last week to two weeks.

  • How often will you update your storage estimates? Network traffic volume will increase steadily over time, so updating the estimates of how much information you can store is a process you should consider at least quarterly.

Archive

The archive is a location for event data. Examples of this data include:

  • Network traffic data (e.g., NetFlow logs, packet captures; see Chapter 3 for more information)

  • System information (e.g., process statistics, filesystem stats, AV reports)

  • System log data (e.g., server logs, syslog information)

Archive data is event data, meaning that it happens at a specific time and it doesn’t update. This is the place in your analysis environment to build a humongous HDFS system and then pump queries through it constantly.

In addition to the general repository issues discussed earlier, issues to consider with the archive include:

  • How will you manage queries involving aged-out data? If storage space is finite, you can bet that there will eventually be an investigation requiring going back to tape archives.

  • Consider postprocessing summarization. Events such as scans take up a large amount of records for little value. You may want to consider storing raw data for a day, then start splitting out and summarizing high-level phenomena (scanning, legitimate interactions, dark space interaction) to improve access speed.

  • How much data in the archive is redundant? Can you place the redundant or similar data in one location? For example, you might put all HTTP flows and weblogs in one repository.

Annotation

Annotation data refers to information that you use to supplement your archive data. This includes threat intelligence, geolocation data, network reputation, and other forms of mapping such as DNS repositories. Annotation data differs from event data in that it has a valid time—for example, the owner of an IP address may be one organization from March 5th to 10th, and another on the 11th onward. Coordinating this timing information is critical when considering annotation.

In addition to the general repository issues, specific questions to consider with annotation include:

  • How do you combine annotation with archive data? Will you pay a performance cost and integrate them as needed, or pay a storage cost and update the archive records with annotations?

  • If you have multiple redundant annotation sources (e.g., two different geolocation databases), how do you represent this?

Knowledge base

The knowledge base (KB) consists of information and judgments that the organization has built up itself over time, and consists of information that is specifically relevant to the target enterprise. Examples of critical information in the KB include asset inventories (a CMDB if you have one), personnel data, and internal calendars. Unlike the other information here, KB data is unstructured and usually relatively small, as it’s largely person-moderated.

Query Processing

The query processing system is a development environment that supports rapid prototyping, data synthesis, and providing contextual data. The system can process data from multiple locations to synthesize it.

Questions to ask when determining query processing requirements include:

  • How do developers or analysts touch the query processing system? Limit the number of languages as much as possible, down to two if you can get away with it.

  • How do they touch data? The great achievement of SIEM is to provide a developer access to all the data in one database cursor. However you store or hide the data, ensure that the analysts have one access point and consistently named tables (a source IP is a source IP everywhere).

  • How will you manage queue contention? Figure out a query that takes about 5 minutes to run, then run it once, 4 times simultaneously, 10 times simultaneously, and 20 times simultaneously.3 Many query systems will gracefully degrade, allowing x number of queries to process simultaneously, then holding off on x+1 until a query finishes. During times of stress, you can expect that analysts will be beating the system constantly, and queue contention is going to kill you.

Real-Time Processing

Real-time processing consists of any analyses that are done à la minute, which may include straight signature matching, specifically crafted high-performance analytics, and logfile generation. As a rule, real-time processing should be distinct from query processing—real-time data should ideally summarize or process data to reduce the amount of query processing needed.

Source Control

I can’t emphasize the importance of a source code repository enough—just make sure everyone knows how to use Git, even if they’re not developers. In particular, the following information should be maintained in the repository:

  • Analytics

  • Signature sets

  • Firewall and IDS configurations4

Analytics are obvious, since they are code executed on data. The others are less obvious, but as a rule, anything that is mechanically processed by a detection system or other middlebox in your network should be recorded and changes maintained in the archive. This is a necessity when reconstructing past events—if a change in traffic occurs on a network, the first question the analyst should ask is whether the change is due to the internet or due to the collection system.

Log Data and the CRUD Paradigm

The CRUD (create, read, update, and delete) paradigm describes the basic operations expected of a persistent storage system. Relational database management systems (RDBMSs), the most prevalent form of persistent storage, expect that users will regularly and asynchronously update existing contents. Relational databases are primarily designed for data integrity, not performance.

Ensuring data integrity requires a significant amount of the system’s resources. Databases use a number of different mechanisms to enforce integrity, including additional processing and metadata on each row. These features are necessary for the type of data that RDBMSs were designed for. That data is not log data.

This difference is shown in Figure 8-2. In RDBMSs, users add and query data from a system constantly, and the system spends resources on tracking these interactions. Log data does not change, however; once an event has occurred, it is never updated. This changes the data flow as shown in the figure on the right. In log collection systems, the only things that write to disk are the sensors; users only read from disk.

This separation of duties between users and sensors means that, when working with log data, the integrity mechanisms used by databases are wasted. For log data, a properly designed flat file collection system will often be just as fast as a relational database.

nsd2 0802
Figure 8-2. Comparing RDBMSs and log collection systems

A Brief Introduction to NoSQL Systems

The major advance in big data in the past decade has been the popularization of NoSQL big data systems, particularly the MapReduce paradigm introduced by Google. MapReduce is based around two concepts from functional programming: mapping, which is the independent application of a function to all elements in a list, and reducing, which is the combination of consecutive elements in a list into a single element. Example 8-1 clearly shows how these elements work.

Example 8-1. Map and reduce functions in Python
>>> # map works by applying a function to every element in an array.
... # For example, we create a sample array of 1 to 10.
>>> sample = range(1,11)
>>> # We now define a doubling function.
...
>>> def double(x):
...     return x * 2
...
>>> # We now apply the doubling function to the sample data.
... # This results in a list whose elements are double the
... # original's.
...
>>> map(double, sample)
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
>>> # Now we create a 2-parameter function that adds two elements.
...
>>> def add(a, b):
...     return a + b
...
>>> # We now run reduce with add and the sample; add is applied
... # to every element in turn, so we get add(1,2), which produces
... # 3. The list now looks like [3,3,...] as opposed to
... # [1,2,3....], and the process is repeated: 3 is added to 3,
... # and the list now looks like [6,4,...], until everything is
... # added.
...
>>> reduce(add, sample)
55

MapReduce is a convenient paradigm for parallelization. Map operations are implicitly parallel because the mapped function is applied to each list element individually, and reduction provides a clear description of how the results are combined. This easy parallelization enables the implementation of any of a number of big data approaches.

For our purposes, a big data system is a distributed data storage architecture that relies on massive parallelization. Recall the previous discussion about how flat file systems can enhance performance by intelligently indexing data. But now, instead of simply storing the hourly file on disk, we split it across multiple hosts and run the same query on those hosts in parallel. The finer details depend on the type of storage, for which we can define three major categories:

Key stores

Including MongoDB, Accumulo, Cassandra, Hypertable, and LevelDB. These systems effectively operate as a giant hashtable in that a complete document or data structure is associated with a key for future retrieval. Unlike the other two options, key store systems don’t use schemas; structure and interpretation are dependent on the implementer.

Columnar databases

Including MonetDB, Sensage, and Paraccel. Columnar databases split each record across multiple column files with the same index.

Relational databases

Including MySQL, Postgres, Oracle, and Microsoft’s SQL Server. RDBMSs store complete records as individually distinguishable rows.

Figure 8-3 explains these relations graphically. In a key store, the record is stored by its key while the relationship between the recorded data and any schema is left to the user. In a columnar database, rows are decomposed into their individual fields and then stored, one field per file, in individual column files. In an RDBMS, each row is a unique and distinguishable entity. The schema defines the contents of each row, and rows are stored sequentially in a file.

nsd2 0803
Figure 8-3. Comparing data storage systems

Key stores are a good choice when you have no idea what the structure of the data is, when you have to implement your own low-level queries (e.g., image processing and anything not easily expressed in SQL), or even if the data has structure. This reflects their original purpose of supporting unstructured text searches across web pages. Key stores will work well with web pages, tcpdump records containing a payload, images, and other datasets where the individual records are relatively large (on the order of 60 KB or more, around the size of the HTML on a modern web page). However, if the data possesses some structure, such as the ability to be divided into columns, or extensive and repeated references to the same data, then a columnar or relational model may be preferable.

Columnar databases are preferable when the data is easily divided into individual log records that don’t need to cross-reference each other, and when the contents are relatively small, such as the CLF and ELF record formats discussed in “HTTP: CLF and ELF”. Columnar databases can optimize queries by picking out and processing data from a subset of the columns in each record; their performance improves when they query on fewer columns or return fewer columns. If your schema has a limited number of columns (for example, an image database containing a small date field, a small ID field, and a large image field), then the columnar approach will not provide a performance boost.

RDBMSs were originally designed for information that’s frequently replicated across multiple records, such as a billing database where a single person may have multiple bills. RDBMSs work best with data that can be subdivided across multiple tables. In security environments, they’re usually best suited to maintaining personnel records, event reports, and other knowledge—things that are produced after processing data or that reflect an organization’s structure. RDBMSs are good at maintaining integrity and concurrency; if you need to update a row, they’re the default choice. The RDBMS approach is possibly unwarranted if your data doesn’t change after creation, individual records don’t have cross-references, or your schemas store large blobs.

Further Reading

  1. M. Kleppmann, Designing Data-Intensive Applications (Sebastopol, CA: O’Reilly Media, 2017).

  2. M. Hausenblas and N. Bijnens, The Lambda Architecture, available at http://www.lambda-architecture.net.

1 Pedantry compels me to point out that “analytic” is an adjective, but I’ve lost that particular battle and will treat it as a noun. Also, it’s “hieroglyph.”

2 Where grep lives.

3 Make sure you’re doing this with similar-sized but different datasets so you can compensate for caching effects.

4 Note that the converse is true as well—if you are managing a system with multiple firewalls, IDSs, and the like, make sure that you are also consistently pushing out your configurations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.198.12