Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12

Making Sense of Unstructured Natural Language Information

Kellyn Rein

Abstract

Making sense of the vast mountains of data that are being increased by tremendous quantities each day is a significant challenge. Whereas much of these data are produced by devices (sensors, cameras, global positioning systems, etc.), the volume of information produced by humans in the form of natural language is also growing dramatically. Making sense of text-based information is of particular importance in many communities including security, intelligence, and crisis management, where the focus is on the actors who may be involved in illicit or threatening activities or who may be caught up in disaster situations in which human communications have an important role in protecting life and property. However, dealing with unstructured natural language information has specific challenges not present in device-generated data. Current technologies for text analytics offer some limited partial solutions for intelligence purposes, but many problems remain unsolved or are only in the early stages. In this chapter we examine a number of issues involved in dealing with unstructured natural language data, discuss briefly the strengths and weaknesses of some widely used technologies for text analytics, and look at a flexible alternative solution to fill some of the gaps.

Keywords

Intelligence; Natural language; Sense making; Unstructured data

Introduction

Information is of great value when a deduction of some sort can be drawn from it. This may occur as a result of its association with some other information already received.

AJP 2.0 Allied Joint Intelligence, Counter Intelligence and Security Doctrine (NATO, 2003)

The Holy Grail of Big Data is making sense of the overwhelming mountains of information available. This information is generated by a variety of devices such as video cameras, motion sensors, acoustic sensors, satellites, and global positioning systems (GPS), as well as by humans in written and spoken form. Automatic processing of data derived from devices uses powerful mathematical algorithms that manipulate the data and are often capable of running in parallel on multitudes of servers to produce more timely results.

Unstructured information—that is, information that is not stored in a structured format such as a database or an ontology, but is formulated in natural language such as social media, blogs, government documents, research papers, intelligence reports, and so on—poses some hurdles that do not exist for device-derived data: A thermometer delivers a value that we know represents a temperature, an acoustic sensor delivers data that we know represent sound waves, and the output of a GPS system is the location where we currently find ourselves. Although some interpretation (usually in the form of an algorithm) is needed to make sense of the data received from devices, we do not try to interpret a GPS reading as a temperature.

In contrast, a human being delivers data in the form of natural language formulations, which can represent a wide range of information (including temperatures, sounds, and locations). Furthermore, each human chooses any of a number of human languages in which to describe those temperatures, sounds, or locations. Thus the first objective may be to determine which natural language has been used to describe the information (Spanish? Chinese? English?); the next objective is to determine the focus of the information (e.g., temperature, sounds, locations, persons, events). In other words, we know fairly precisely what information to expect from a given device, but a human may deliver a wide and diverse set of topics, including something new and unexpected.

Unfortunately for understanding unstructured natural language, not only are there many natural languages in which this information can be formulated, there are variations within each language for the representation of that information (dialect and synonymic formulations). For example, when describing the detonation of a bomb, various words and phrases may be used: “blew up,” “exploded,” “went off,” and so on. Although a speaker of that language would understand that these can all be used to describe the same event, automated processing requires the system to be able to recognize this as well.

Furthermore, any data received from a device are always historical in the sense that the data represent something that the device has observed and recorded. Using these historical data, algorithms may project future conditions—for example, the projected flight path of an airplane being tracked by radar—but this projection is based on historical data and algorithmic programming. Much of what is formulated in natural language reflects observations made in the past, but there is also a significant amount of inference and speculation about future events. Similarly, reporting on events that occurred in the past may include interpretation or speculation on the part of the reporter rather than statements of fact. Complicating things even further, the speaker may pass on information received via a third party such as another person, a news report, or blog, rather than simply reporting something personally witnessed. Finally, the speaker may lie, tell partial truths, or distort the facts. Although a device may sometimes fail or be negatively influenced by environmental factors such as heat or humidity, and although the object of observation by a device may employ diversionary tactics to evade identification, the device never makes a decision to intentionally deceive.

Thus, for the security and intelligence communities, sense making includes sifting through signals expressed in natural language by human sensors who pass on hearsay, conceal, intentionally distort, lie, and conjecture in words that are ambiguous, vague, and imprecise, looking for clues to anticipate and hostile actions. Credibility of information is not based on calibrations achieved by testing under various conditions, but on examining the sources of that information and the clues to veracity that these sources embed in their communications.

Current technologies in text analytics have made some inroads into dealing with the complexities of natural language data. However, many challenges remain for efficient and effective processing of this information. This chapter examines a number of these technologies and discusses an approach to processing natural language communications, which provides a high level of flexibility.

Big Data and Unstructured Data

“Big Data” is an umbrella term that covers datasets so immense that they require special methodologies such as massively parallel systems and algorithms to be processed. Digital data are being generated and collected at unprecedented rates. Big science accounts for a huge volume of data: for example, in 2010, Economist reported that “When the Sloan Digital Sky Survey started work in 2000, its telescope in New Mexico collected more data in its first few weeks than had been amassed in the entire history of astronomy” (Economist, 2010). Big business accounts for another huge chunk of Big Data: Retailers, banks and credit card companies collect and analyze vast amounts of customer data daily for processing. Security cameras capture our daily movements, mobile telephone companies identify our locations and contacts, and we even self-report via social media.

Unstructured data are data that are not stored in a structured format such as a database or ontology. They are generally understood to include such diverse forms as e-mails, word processing documents, multimedia, video, PDF files, spreadsheets, text messaging content, digital pictures and graphics, mobile phone GPS records, and social media content (Roe, 2012).

Exploitation of these unstructured data depends on the type of unstructured data. Some require specialized algorithms such as image processing or analysis of acoustic data. Others require preprocessing; for example, PDF files are often converted to normal text files to run more standard text analytic algorithms. In many cases, unstructured data may be structured to make them more processable (see Chapters 2 and 10).

Aspects of Uncertainty in Sense Making

Regardless of the source of the data we receive, these data cannot always be taken at face value. Sometimes we know a lot about the source of the data we received; i.e., we know exactly what make and model of sensor we have placed where, we have calibration and reliability information about that particular type of sensor, and we know how it reacts under various types of environmental conditions (rain, heat, night, etc.). Therefore, knowing the time, weather conditions, and so on, we will have some idea of the reliability of the information that the sensor delivers, as well as precisely where the data have been gathered.

Humans, on the other hand, are mobile and able to insert new information into the system from a variety of locations; i.e., a blogger may blog from anywhere in the world in which the Internet is accessible and the blog post may contain information about events happening far distant from the blogger’s physical location. Even if the blogger tells us (truthfully) in the text where he is currently located, we may actually access that text at an entirely different time, rendering the information useless except for historical purposes.

Ultimately, five aspects of uncertainty need to be considered in analyzing and aggregating information (Kruger, 2008; Kruger et al., 2008; Rein and Schade, 2009; Rein et al., 2010):

1. Source uncertainty: How reliable is the source of the information? How much do we trust this source? Is this eyewitness information or are there indications that the source is relating information derived from another source (hearsay)?

2. Content uncertainty: How credible do we believe the content, which the source has delivered, to be? Does it have to be confirmed by other sources? Does it fit with other data or is it anomalous? If the source is an algorithm (the result of other preprocessing), does the algorithm give us an estimate of the certainty of its results?

3. Correlation uncertainty: How certain are we that various pieces of information are related? When dealing with natural language information, we are often confronted with vague or imprecise formulations. How confident are we that reports concerning “several large vehicles” and “five tanks” are referring to the same thing?

4. Evidential uncertainty: How strongly does our information indicate a specific threat or behavior in which we are interested? Although the purchase of 50 kg of a chemical fertilizer may indicate that a homemade bomb is being built, it is much less shaky as evidence than the same individual acquiring a significant amount of, say, plastique.

5. Model uncertainty: Even with all factors present, how certain are we that the model mirrors reality—for instance, when there is constant behavior modification on the parts of foes who seek to evade discovery?

In making sense of text-based data, we need to take all of these various types of uncertainty into consideration; otherwise both the data upon which we base information and the assumptions we make about the connections among the various pieces of information are compromised (Dragos and Rein, 2014).

Situation Awareness and Intelligence

Sense making can mean different things in different contexts. In numerous domains such as the military, air traffic control, harbor security, emergency services (crisis management), and public safety, sense making is generally synonymous with situation awareness, with emphasis on threat recognition and decision support. Situation awareness, particularly in the military domain, is an ongoing overview of important environmental elements within the area of interest, such as the locations of military units, both friendly and hostile, tracking personnel and equipment movements, and locations and conditions of facilities. On an intelligence level this may also include information on nonmilitary or paramilitary activities such as refugee movement, political climate, and tribal coalitions. Often this information is captured and displayed visually on maps in the command and control systems being used to give decision makers an overview of the current state of affairs. In the case of trackable changes such as the movements of individuals, vehicles, or military units, there will generally be some projection as to a future state (e.g., “where that column of tanks may be 1 h from now”). Such situational awareness is generally restricted as to the timeline (current state plus projections that may forecast second, minutes, or perhaps hours). Decision support under these circumstances will affect the assignment of resources, aid in detecting developing problems, and support the protection of life and property.

Situation Awareness: Short Timelines, Small Footprint

Situation awareness depends on knowing or predicting the state of the elements of interest in the (complex) environment under consideration. According to Endsley (1988), situation awareness is “the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future” (p. 792).

The timeline is generally relatively limited, the geographical area likewise usually restricted, and the possible threats relatively well understood or defined by experience. Often a significant percentage of the information underlying the situation awareness picture comes from devices such as video and still cameras, motion detection sensors, acoustic sensors, and radar. Algorithms to make sense of the data produced by the devices are continuously improving.

Natural language information for situation awareness often concerns movements or changes within the area of interest, and text analytic processing used to update the situation awareness may be relatively lightweight.

Intelligence: Long(er) Timelines, Larger Footprint

Sense making for intelligence purposes, whether military, national security, or business, often involves timelines that are much longer, covering weeks, months, or years instead of microseconds, minutes, or hours. Intelligence sense making over longer periods will often rely on information that is text based. Much intelligence work is carried out over longer periods during which assets may be acquired and set in motion. The data collected via assets may include focused reports from intelligence assets, but also many types of open sources including news sources, government documents, and research results. Thus, environmental scanning may be subtle and complex, involving political and cultural changes, economic shifts, and other trends, which may indicate activities that pose threats. In such cases, open sources such as newspapers, television, government reports, blogs, and social media as well as reports from intelligence assets and analysts are useful. However, the information in these sources must first be understood and then collated and examined for (repetitive) patterns of behavior that indicate developing threats.

Processing Natural Language Data

Natural language processing uses a number of techniques to analyze the individual parts of sentences in an attempt to make sense of them. Parsing analysis will use grammatical rules to identify the parts of speech contained within the sentence (subject, verb, and direct and indirect objects) as well as identify adjectives, adverbs, prepositional phrases, and other constructs of which the sentence is composed.

Text mining, popularly referred to as “text analytics,” encompasses a variety of different techniques for analyzing natural language text to cull information from documents at hand. Using analysis techniques based on lexical and grammatical patterns in the language employed, sentences can be parsed so that information about documents as well as individual structures within documents and sentences (and, to a small extent, between sentences) may be discovered. These techniques include (but are not limited to):

• Document classification: Using a variety of techniques based on linguistic and statistical analysis, documents may be classified (type of content, human language used, etc.), summarized (what the document is about), or clustered (based on a classification).

• Named entity recognition/pattern recognition: Useful patterns such as proper names of individuals or organizations, telephone numbers, or e-mail addresses may be recognized and extracted.

• Co-reference identification: Alternate names for the same object may be identified through correlation analysis: “Barack Obama,” “President Obama,” “the US President,” “the 44th president,” and “44” all refer to the current president of the United States.

• Sentiment analysis: Using emotive words and phrases buried within the text, hints as to sentiment, emotion, or opinion may be culled. This has been recently most extensively used in social media analysis.

• Relationship and event extraction: Relationships among objects found in the text (“Susan works at ABC Company,” “Jane is the sister of Bob,” and “Mozart died in 1791”) may be discovered.

The results of the extraction processes are then available for use in logic models and algorithms, which will look for yet more complex and subtle relationships among the entities that have been discovered. Some of this will serve as background information for context, i.e., to aid in disambiguation (e.g., helping to determine when “44” refers to Mr. Obama and when it refers to, say, someone’s age). Other algorithms combine the extracted information: “Susan Smith works for ABC Company” and “Sam Brown works for ABC Company” establishes a link between Susan Smith and Sam Brown.

Much of the information thus extracted is stored in databases and, increasingly for large volumes of data, a specialized type of storage called a triple store, which has been designed to efficiently store and retrieve triples consisting of subject–predicate–object. These will be discussed in more depth in the following section.

Structuring Natural Language Data

Extracted text-based information may be stored in structured formats for processing and access. Currently, structures for storage of text-based information for automatic processing generally fall into two categories: ontologies and databases/triple stores, the latter of which are a special kind of database. Each of these has its strengths and weaknesses for sense making, which we will discuss in this section (see also Chapter 12).

Ontologies contain information about the characteristics of and relationships among different classes of objects within a specific domain: that is, a definition of a shared concept of the objects in the domain. For example, within a domain containing human beings, a “parent” is a (human) object that has at least one instance of an object called “child.” A “mother” is a special subclass of parent with the extra characteristic that she also has the gender “female,” and so on. Thus, when an object is described as a specific class within the domain of interest, there is knowledge about some aspects of the object (“Mary must be female because she is a mother”) and relationships between objects (“If Mary is Susan’s mother, then Susan is Mary’s child”). Ontologies have the advantage in that we have defined in advance exactly what each class of objects is and how it relates to all other objects within our domain of interest. However, although we use an ontology to store information about the characteristics of the concept “mother” in the domain, information on individual instances of each classes is usually stored using other methods to store the actual objects—for example, databases.

Databases are useful for storing large amounts of often complex information about specific instances of objects within the domain of interest. Generally, the information contained within databases is contained within files of similar objects, often presented as tables, which may be interrelated to reduce data redundancy, speed up processing, and structure results. Files contain records in which the data for numerous instances of similar item are stored as named and typed fields describing the important characteristics of the objects in the file. Within a single file, the record structures are identical. Retrieving information relies on knowledge about the structures within the various files as well as the relationships among them. Structures for the files are determined before filling in information on individual instances, thus ensuring conformity to ease retrieval. However, determining the structure ahead of time means that the analysts have made a priori decisions as to what information is needed and what information belongs together. Later changes to the structures within the database are possible but not always easy to effect.

A special variant of databases known as a triple store is a potential solution. A triple store contains atomic information contained in triples rather than as records inside of more complexly structured files. A triple is a three-part data entity in the form subject–predicate–object: “1-800-555-1234 is a telephone number,” “Susan Smith works at ABC Company,” “ABC Company produces widgets,” etc. In a triple store, each triple is an autonomous piece of information that does not rely on structures such as a database record format to provide some context for the information. There are advantages to this, one of which is that record formats and schema do not need to be modified if there are changes and updates to the type of information being stored. Another advantage is that queries are simplified because one does not need to know the names of files and fields to make a query. Yet another advantage is that the presentation of a query is easily shown in a graph format, facilitating visualization of the query results.

Two Significant Weaknesses

Two weaknesses in current text analytics processing should be taken into account for appropriate intelligence exploitation and decision making. The first is that embedded non-content information, which provides clues as to the true source of the content (first person, hearsay, speculation, etc.) and to the credibility of the information, is being ignored. The second is that extraction of information and storage out of their original context may result in information being lost or subtly altered. These two areas of concern are discussed in detail below.

Ignoring Lexical Clues on Credibility and Reliability

Humans do not simply communicate factual observations; they relate information received from other sources, they speculate and infer, they tell partial truths, and they discuss events that might take place in the future. North Atlantic Treaty Organization (NATO) intelligence organizations often use the A1-F6 designations described in the Joint Consultation, Command and Control Information Exchange Data Model (JC3IEDM) for source reliability and information credibility. Not only are these designators relatively broad (“report plausible” or “confirmed by at least three sources”), they are often assigned to a complete report and not to individual facts (or speculations) within the report, but also are usually assigned by an analyst (i.e., not automatically generated) and are therefore also subject to that analyst’s knowledge or interpretation. Automated text analytic processing generally looks for certain types of patterns and simply ignores other elements of the texts.

Specific content within a given statement is often packed with lexical elements that indicate in some manner the uncertainty of the content itself or that indicate the original source of information. Take, for example, the following sentences:

1. John is a terrorist.

2. The Central Intelligence Agency (CIA) has concluded that John is a terrorist.

3. I believe that John is a terrorist.

4. My neighbor thinks John is a terrorist.

5. It has been definitely disproved that John is a terrorist.

In each of these sentences, the relationship (“fact”) pattern of the sentence would produce the relation John IS-A terrorist. However, the lexical clues surrounding this “fact” weaken the belief in its veracity. In (1) there are no lexical clues as to what the writer believes, but in (2) and (4) there are indicators of third-party information (which may or may not have been repeated accurately); (2) indicates an inference; (3) and (4) indicate belief rather than knowledge; and (5) could be an unidentified third-party source, but the conclusion is a contradiction of the extracted “fact” of John being a terrorist.

Humans also chain multiple indicators of uncertainty into a single statement. For example, by adding the adverb “probably” to (2), the resulting “fact” of John being a terrorist is even weaker.

2. The CIA has concluded that John is a terrorist.

6. The CIA has concluded that John is probably a terrorist.

Appropriate decision making depends on knowledge of the quality of the intelligence upon which the decision rests. Natural language processing algorithms, which identify parts of speech, can identify adjectives, adverbs, and other constructs in which expressions of uncertainty are embedded; the results from text analytic processes that extract information from natural language text should be expanded to include the analysis of such embedded information to the fusion algorithms and models that predict threats.

Out of Context, Out of Mind

Over time, those who pose a threat to the security and well-being of citizens learn and modify their behavior to escape detection. This means that tools and behavioral expectations that are created today may well be outdated tomorrow. This also means that information we find unimportant today may be highly significant tomorrow. In addition, patterns of activity may become more nuanced and complex over time; we may not always know in advance what we are looking for.

Extracting isolated pieces of information out of the context in which they were stated may result in incorrect information being stored. Consider the following sentences:

7. Elaine flew from London to Stockholm via Amsterdam on 17 November.

8. Wolfgang gave Johanna Petra’s book.

From (7) we can, of course, extract triples such as “Elaine flew to Stockholm,” “Elaine flew via Amsterdam,” and “Elaine flew on 17 November.” However, if we are looking for patterns of behavior, it may turn out that the most interesting information is that Elaine flew via Amsterdam on that particular date (perhaps because another person of interest also was at Amsterdam airport on that day)—something that would be hard to reconstruct unless this information remains connected.

The second sentence, (8), contains both a direct object (Petra’s book) and an indirect object (Johanna), which means that there are (at least) four major components to this statement, rendering it impossible to represent as a triple as it stands. Either we make inferences about some of the information in this statement (Wolfgang is somehow connected to Johanna, and Johanna has Petra’s book) to force a triple or we store this information in another format. A database could be suitable, but then we must anticipate in advance which information we will store and under what format, and the database must be flexible enough to accommodate all possible formulations: for example, the ability to store information about multiple “via” stops, should Elaine’s next suspicious trip include more than one stopover.

An Alternative Representation for Flexibility

Originally designed for commanding simulated units, BML is a standardized language for military communication (orders, requests, and reports) that has been developed under the aegis of the NATO MSG-048 “Coalition BML” and has been expanded to communicate not only orders but also requests and reports. BML is based on JC3IEDM (NATO Multilateral Interoperability Program (MIP), 2005), which is used by all participating NATO partners. As NATO standard (STANAG 5525), JC3IEDM defines terms for elements of military operations, whether wartime or non-war, and thus provides a vocabulary sufficiently expressive to formulate both military and nonmilitary communications for a variety of different deployment types. It also provides a basis for standardized reporting among NATO coalition partners. Although BML has been predominantly developed for use by the military, the principles underlying the grammar and standardized representation of natural language text can be expanded into any domain. Extensions of BML for other domains such crisis management (CML), police investigations (IML), and e-government (C2LG) already exist or are in development.

BML has been designed as a controlled language (Huijsen, 1998) based on a formal grammar (Schade and Hieb, 2006; Schade et al., 2010). This grammar was designed after one of the most prominent grammars from the field of computational linguistics, Lexical Functional Grammar (Bresnan, 2001). As a result, BML is an unambiguous language that can easily be processed automatically.

As described in Schade and Hieb (2006) and in Schade and Hieb (2007), a basic report in BML delivers a statement about an individual task (action), event, or status. A task report is about a military action either observed or undertaken. An event report contains information on nonmilitary non-perpetrator occurrences such as flooding, earthquakes, political demonstrations, or traffic accidents. Event reports may provide important background information for a particular threat: For example, a traffic accident may be the precursor of an improvised explosive device detonation. Status reports provide information on personnel, materiel, facilities, etc., whether own, enemy, or civilian, such as the number of injured, amount of ammunition available, and condition of an airfield or bridge.

Using various natural language processing techniques and text analytics as described previously, natural language statements can be processed and converted to BML (Jenge et al., 2009). BML has an advantage in that the production rules of the underlying grammar capture all of the content information held in context. Clues as to source type (e.g., eyewitness or third-party) as well as linguistic clues as to uncertainty of the information (e.g., “possibly,” “probably,” “might be”) are reduced to information concerning source type and reliability, credibility of the information, and a label, which among other things establishes provenance because it is generated based on time/date information.

The statement “Coalition forces report the detonation of a bomb at the Old Market in XY City at shortly past 4 PM today” would be represented as a BML string (Figure 12.1, bottom) and can be implemented as a feature-value (structured) matrix (Figure 12.1, top) or other structured form for use.

Figure 12.1 Representation of the report “Coalition forces report the detonation of a bomb at the Old Market in XY City at shortly past 4 PM today” as a BML string (bottom) and implemented as a feature-value (structured) matrix. Note that indicators of source type (“eyeball” meaning “eyewitness”) and reliability (“completely reliable”) and content credibility (“RPTFCT” indicating “reported considered fact”) are attached to the statement, as well as a provenance marker (the final position in the string at bottom).

Note that the information remains in context, but also that the simplified representation of the statement as a BML string means that this representation is implementation independent and therefore can be easily mapped into other formats such as XML as needed for further processing (Jenge et al., 2009; Rein, 2013).

To date, this data representation is used for multilevel fusion, including within a NATO research group (IST-106) as a means to bridge the gap between information generated by devices and by algorithms (which present their results in BML) to fuse both hard and soft data, as well as for low-level and high-level fusion results (Biermann et al., 2014; Rein et al., 2012; Rein and Schade, 2012). Furthermore, the underlying concept may be used to enable information fusion across multiple natural languages by converting and mapping to the (English-like) BML, thus lowering the barrier of multilingual information (Rein and Schade, 2012; Kawaletz and Rein, 2010).

Conclusions

Natural language data are exploding and are increasingly important; yet, making sense of them remains problematic. Many inroads have been made into techniques for limited extraction of information from text, but two major weaknesses remain: insufficient analysis of embedded linguistic clues concerning the certainty (truth) of the statements, and loss of information owing to removed context. We have discussed these problems as well an alternative representation to current popular techniques, which would resolve or minimize these problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12. Making Sense of Unstructured Natural Language Information

Create new playlist

Sign In

Sign Up