Chapter 4

Complexities of Data

IN THIS CHAPTER

Understanding the challenges of big data

Highlighting real-world use cases of searching unstructured data

Making your data semantically retrievable

Adding context to your search results

Distinguishing between business intelligence and data analytics

Using birds flocking visualizations to guide preliminary data exploration

Your Google search log, Uber trips logs, Tweets, Facebook status updates, Airbnb stays, and bank statements tell a story about your life. Your geographical locations logged by your cellphone carrier, your most frequent places visited, and your online purchases can define your habits, your preferences, and your personality.

This avalanche of data, being generated at every moment, is referred to as big data, and it’s the main driver of many predictive analytics models. Capturing all different types of data together in one place and applying analytics to it is a highly complex task. However, you might be surprised that in most cases only about a small percent of that data is used for analysis that results in real, valuable results. This small percent of big data is often referred as smart data or small data — the nucleus that makes sense out of big data. Only this small representative portion of big data will make it into the elevator pitch that justifies your analytical results.

remember The secret recipe for a truly successful predictive analytics project starts with filtering out the dirt, noise, invalid data, irrelevant data, and misleading data. When you’ve done that task, then ultimately you can extract smart data from big data. Analyzing smart data leads to extracting the value of predictive analytics.

So how do we go from big data to smart data to tangible value? At what rate must data be captured, considering how fast it’s being generated? Can visualization of raw data give you a clue as to what to correct, exclude, or include? Relax. This chapter addresses all those questions.

Finding Value in Your Data

Any successful journey takes serious preparation. Predictive analytics models are essentially a deep dive into large amounts of data. When the data isn't well prepared, the predictive analytics model will emerge from the dive with no fish. The key to finding value in predictive analytics is to prepare the data — thoroughly and meticulously — that your model will use to make predictions.

Processing data beforehand can be a stumbling block in the predictive analytics process. Gaining experience in building predictive models — and, in particular, preparing data — teaches the importance of patience. Selecting, processing, cleaning, and preparing the data is laborious. It’s the most time-consuming task in the predictive analytics lifecycle. However, proper and systematic preparing of the data will significantly increase the chance that your data analytics will bear fruit.

Although it takes both time and effort to build that first predictive model, after you take the first step — building the first model that finds value in your data — then future models will be less resource-intensive and time-consuming, even with completely new datasets. Even when you don’t use the same data for the next model, your data analysts will have gained valuable experience with the first model.

Delving into your data

Using a fruit analogy, you not only have to remove the bad peel or the cover, but dig into it to get to the endocarp; as you get closer to the endocarp, you get to the best part of the fruit. The same rule applies to big data, as shown in Figure 4-1.

image

FIGURE 4-1: Big data and smart data.

Data validity

Data isn't always valid when you first encounter it. Most data are either incomplete (missing some attributes or values) or noisy (containing outliers or errors). In the biomedical and bioinformatics fields, for example, outliers can lead the analytics to generate incorrect or misleading results. Outliers in cancer data, for example, can be a major factor that skews the accuracy of medical treatments: Gene-expression samples may appear as false cancer positives because they were analyzed against a sample that contained errors.

Inconsistent data is data that contains discrepancies in data attributes. For example, a data record may have two attributes that don’t match: say, a zip code (such as 20037) and a corresponding state (Delaware). Invalid data can lead to wrong predictive modeling, which leads to misleading analytical results that will cause bad executive decisions. For example, sending coupons for diapers to people who have no children is a fairly obvious mistake. But it can happen easily if the marketing department of a diaper company ends up with invalid results from their predictive analytics model. Gmail might not always suggest the right people if you’re trying to fill in the prospective customers you might have forgotten to include in a group e-mail list. Facebook, to give another example, may suggest friends who might not be the type of friends you’re looking for.

In such cases, it’s possible that there’s a large margin of error in the models or algorithms. In some cases, the flaws and anomalies are in the data initially selected to power the predictive model — but the algorithms that power the predictive model might end up being applied to large chunks of invalid data. Invalid data is just one example that can lead to poor predictions.

Data variety

The absence of uniformity in data is another big challenge known as data variety. From the endless stream of unstructured text data (generated through e-mails, presentations, project reports, texts, tweets) to structured bank statements, geolocation data, Uber trips logs and customer demographics, companies are starving for this variety of data.

Aggregating this data and preparing it for analytics is a complex task. How can you integrate data generated from different systems such as Twitter, Opentable.com, Google search, and a third party that tracks customer data? Well, the answer is that there is no common solution. Every situation is different, and the data scientist usually has to do a lot of maneuvering to integrate the data and prepare it for analytics. Even so, a simple approach to standardization can support data integration from different sources: You agree with your data providers to a standard data format that your system can handle — a framework that can make all your data sources generate data that’s readable by both humans and machines. Think of it as a new language that all big-data sources will speak every time they’re in the big-data world.

Constantly Changing Data

This section presents two main challenges of big data: velocity and volume. These are (respectively) the rate at which data is being generated, received, and analyzed, and the growing mass of data.

Data velocity

Velocity is the speed of an object moving in a specific direction. Data velocity refers to another challenge of big data: the rate at which data is being generated, captured, or delivered. The challenge is figuring out how to keep up.

Think of the data being generated by a cellphone provider. It includes all customers’ cellphone numbers, call durations, and GPS locations (for openers). This data is growing all the time, making the task of capturing smart data from big data even more challenging.

So how do you overcome this challenge? There isn’t one simple solution available yet. However, your team can (in fact, must) decide

  • How often you can capture data
  • What you can afford in resources and finances
  • Which type of data you're going to model (for example, streaming or one-time data)
  • Whether you're modeling streaming data or only deriving prediction scores of one or multiple records

If (for example) you own a supercomputer and you have the funds, then you should capture as much data as you can — but you might also need to take into consideration how often that data is changing. In Chapters 16 and 17, we outline some of the widely used tools to handle streams of data.

High volume of data

A common mistake that people make when they talk about big data is to define it as merely a large amount of data. Big data isn't just about large volumes of data; it’s more about a wide variety of data (yes, in huge amounts) generated at high speed and frequency. In Figure 4-2, big data spans three dimensions in spiral exponential fashion; it looks like a tornado.

image

FIGURE 4-2: Big data as tornado.

Big data is “big” not only because of its large volume (such as numbers of rows, or columns, or comprehensiveness); it’s also — and mainly — about those other three dimensions: volume, velocity and variety.

Complexities in Searching Your Data

This section presents the three main concepts of searching unstructured text data in preparation for using it in predictive analytics:

  • Getting ready to go beyond the basic keyword search
  • Making your data semantically searchable
  • Adding context to the users’ search experience

Keyword-based search

Imagine if you were tasked with searching large amounts of data. One way to approach the problem is to issue a search query that consists (obviously) of words. The search tool looks for matching words in the database, the data warehouse, or goes rummaging through any text in which your data resides. Assume you’re issuing the following search query: the President of the United States visits Africa. The search results will consist of text that contains exactly one or a combination of the words President, United States, visits, Africa. You might get the exact information you’re looking for, but not always.

How about the documents that don't contain any of the words previously mentioned, but some combination of the following: Obama’s trip to Kenya.

None of the words you initially searched for are in there — but the search results are semantically (meaningfully) useful. How can you prepare your data to be semantically retrievable? How can you go beyond the traditional keyword search? Your answers are in the next section.

Semantic-based search

An illustration of how semantic-based search works is a proof of value project that Dr. Bari led during his tenure at the World Bank Group in Washington, DC, an international organization whose primary mission is to fight poverty around the world. The project aimed to investigate existing large-scale enterprise search and analytics capabilities in the market and build a prototype for a cutting-edge framework that would organize the World Bank data — most of which was an unstructured collection of documents, publications, project reports, briefs, and case studies. This massive valuable knowledge is a resource used toward the Bank’s main mission of reducing world poverty. But the fact that it’s unstructured makes it challenging to access, capture, share, understand, search, data-mine, and visualize.

The World Bank is an immense organization, with many offices around the globe. During the feasibility study, we found out that like in any other large organization, divisions used several terms and concepts that had the same overall meaning but different nuances. For example, terms such as climatology, climate change, gas ozone depletion, and greenhouse emissions were all semantically related but not identical in meaning. The goal was to engineer a search capability that is smart enough to extract documents that contained related concepts when someone searched any of these terms.

The very initial architecture for that capability that the Bari’s team adopted was the Unstructured Information Management Architecture (UIMA), a software-based solution. Originally designed by IBM Research, UIMA is available in IBM software such as IBM Content Analytics, one of the tools that powers IBM Watson, the famous computer that won the Jeopardy game. Bari’s team joined forces with engineers and data scientists from IBM Content Management and Enterprise Search, and later with an IBM Watson team, to collaborate on this project.

technicalstuff An Unstructured Information Management (UIM) solution is a software system that analyzes large volumes of unstructured information (text, audio, video, images, and so on) to discover, organize and deliver relevant knowledge to the client or the application end-user.

The ontology of a domain is an array of concepts and related terms particular to a domain. A UIMA-based solution uses ontologies to provide semantic tagging, which allows enriched searching independent of data format (text, speech, PowerPoint presentation, e-mail, video, and so on). UIMA appends another layer to the captured data, and then adds metadata to identify data that can be structured and semantically searched.

Semantic search is based on the contextual meaning of search terms as they appear in the searchable data space that UIMA builds. Semantic search is more accurate than the usual keyword-based search because a user query returns search results of not only documents that contain the search terms, but also of documents that are semantically relevant to the query.

For example, if you’re searching for biodiversity in Africa, a typical (keyword-based) search will return documents that have the exact words biodiversity and Africa. A UIMA-based semantic search will return not only the documents that have those two words, but also anything that is semantically relevant to “biodiversity in Africa” — for example, documents that contain such combinations of words as “plant resources in Africa,” “animal resources in Morocco,” or “genetic resources in Zimbabwe.” Semantic capabilities can go beyond correlation between words and leverage the grammatical structure of the search query to give better search results. Semantic capabilities are mainly based on complex natural-language processing algorithms.

Through semantic tagging and use of ontologies, information becomes semantically retrievable, independent of the language or the medium in which the information was created (Word, PowerPoint, e-mail, video, and so on). This solution provides a single hub where data can be captured, organized, exchanged, and rendered semantically retrievable.

Dictionaries of synonyms and related terms are open-source (freely available) — or you can develop your own dictionaries specific to your domain or your data. You can build a spreadsheet with the root word and its corresponding related words, synonyms, and broader terms. The spreadsheet can be uploaded into a search tool such as IBM Content Analytics (ICA) to power the enterprise search and content analytics.

Contextual search

In the world of search engines, the goal is to display the right results for the right user. Contextual search is about discovering that right time to display the right results and to the right user. Indeed, a search engine that uses contextual search attempts to understand its users algorithmically, by learning users’ preferences from their behaviors on the web and other users’ online experiences. Contextual search, when implemented right, will deliver the most relevant results that the user needs to see at the right time.

Major search engines like Google and Bing have been building the next generation of search. That web search is mainly based on actualizing the concept of contextual search. Contextual search engines try to frame the user search experience within a meaningful context. As an example, consider a user searching for “Good coffee shops in New York City,” A contextual search most likely will return search results according the user location. If the user’s location at the time of searching was SoHo in New York City, then the results will be ranked based on both public coffee shop ratings or popularity (because the user specified the word “good” in the query) and proximity to the user’s location. The location of the user automatically is detected by the search engine, and an algorithm optimizes the search results to display the web pages about coffee shops in order of proximity to the user’s current location.

Further, if the same user searching for a good coffee place in New York City also is known to be a guru in predictive analytics, big data, and data science, then the ads on the search engine’s results or the pages of the coffee shops should display relative ads (such as an ad for Predictive Analytics For Dummies).

A search engine that uses contextual search learns and builds a user profile that encapsulates the users’ preferences through logging the most visited web pages by the user, the keywords used by the user in past search queries and even in some cases personal information about a user like occupation, address, and gender among others. Further, for the same user who searched for “Good coffee shops in New York City,” the results might not only be ranked based on the user location and user’s preferences, but also on predicted users’ social status so the coffee shops presented in the search results are within the user’s budget that was algorithmically predicted: A diner or a five-star hotel.

User’s location, user’s profile, and other factors add context to the user search experience. For the same query, a search engine returns different results depending on who is searching, location, and time. The result might even be shaped by what mood the user is in at the time of search (sentiment analysis component) and many other factors.

The next-generation contextual search will span over a multi-dimensional space of users attributes that can be learned from the digital prints we leave on the cyberspace:

  • Places we go (for example, through Uber and online reservations)
  • Web pages we visit
  • E-commerce sites where we shop
  • Products we buy (or return)
  • Sequences of our activities
  • Workout activities we do
  • Food we eat
  • Entertainment we like

All of this information will sketch a digital profile of you in cyberspace. The user profiles will be the backbone for the recommendation systems that power contextual search.

Recommender systems and predictive models are the heart of a contextual based search engine. In essence, a recommender system will be suggesting relevant search results to the user by using term-based recommendation, similar-users recommendation, or both. Recommender systems are covered in details in Chapter 2.

The challenge for building such a recommender system is creating a knowledge base. The knowledge base is simply the data about users’ information and their online behavior. If you're building a search engine for your organization with contextual search, you need to architect a data hub where you will store your users’ information, the clickstream (clicks logs on your site) and search queries that are happening every time a user is on your site.

As an example, consider a case where you build a search engine to serve your company’s employees for documents, such as project reports and publications.

One of the first things that you would need to do is to locate the employees’ data, such as the departments they work for, their interests, and their past projects, then build their biographies. This data usually is available in the human resources department. Then you may want to join this data with logs from the search engine web application in your organization about employee’s query performed in the site, the time the queries were issued, and the time spent on every page or search results. This information will allow you to build profiles of your employees based on their behavior and personal information. These data sources are a possible way to feed the knowledge base of your system. You might expand it later to other data sources that will help build richer profiles about your users.

Additionally, you may want to build a recommender system that will match users to documents. To be able to merge the results from the recommender system to your search engine platform, you may need to design an algorithm that will serve for that purpose. You may need to apply keyword-based search and semantic search that will give you a larger set of results that you may expand or narrow by using the results from contextual search.

In fact, the search results generated by the keyword-based search are documents where the keywords being searched appear the most.

Natural-language processing algorithms and ontologies that will power semantic-based search will expand the search result through query expansion to semantically related keywords (for example, searching for “Car” will be expanded to searching for “Vehicle”). The recommender system will bring another set of results that matches the users who is performing the search query. One way to create an optimal search results is to filter the search results given by the semantic and keyword based search, then apply high-ranking score to those documents that were recommended by the recommendation system to the user performing the search.

Consider a scenario where the user is searching for “solar power plants projects”. Assume that Document#1, Document#2, and Document#3 are the top three documents returned in the search results page. These documents appear on the top search results according to keyword-based search mainly because they have the words solar, power, plants and projects mentioned relatively frequently in those documents and in important sections in those documents (such as title, introduction, and conclusion). When the search engine supports semantic search, then the results may change; for example, Document#5 may outrank Document#2 or Document#3 and appear in the search results page because Document#5 contains several related terms to the search query (such as energy, climate change, carbon emissions, and fossil fuels), in addition to high frequency of the words solar, plant and project. In addition, because the user who is searching for “solar power plants projects” is known to the search engine to be interested in projects in Africa, the recommender system built into the search engine might give a high relevancy score to Document#3. In fact, Document#3 is about a project that describes the world's largest concentrated solar plant (which was recently developed in Morocco, which is located in north Africa). Notice that a context of user’s interests in projects in African countries played a role in the search results.

Let us consider another scenario where a user is searching for the same query “solar power plants projects”. Unlike the preceding scenario, the search results for the same search query for this particular user are different and it's showing a document that appears first in the search results page that talks about a recent giant solar plant in the town of Beloit.

The reason behind this document appearing first in the search result could be that this particular user clicked on U.S. based solar plants projects document in the past, and it could also be that the user is an energy specialist interested specifically in solar projects in the United States of America.

Information about the user’s interests can be captured either at initial registration for the search engine, or can be gathered from the human resources department, or captured as the user navigated through the search engine website in the past. In summary, the search results will be adjusted according to the user’s predicted interests at the time she or he is searching.

Differentiating Business Intelligence from Big-Data Analytics

Be sure to make a clear distinction between business intelligence and data mining. Here are the basics of the distinction:

  • Business intelligence (BI) is about building a model that answers specific business questions. You start with a question or a set of questions, gather data and data sources, and then build a computer program that uses those resources to provide answers to the business questions you’ve targeted. Business intelligence is about providing the infrastructure for searching your data and building reports. Business intelligence uses controlled data — predefined, structured data, stored mainly in data-warehousing environments. BI uses online analytical processing (OLAP) techniques to provide some analytical capabilities — enough to construct dashboards you can use to query data, as well as create and view reports. BI is different from data mining and doesn't discovered hidden insights.
  • Data mining (DM) is a more generalized data-exploration task: You may not necessarily know exactly what you’re looking for, but you’re open to all discoveries. Data mining can be the first step in the analytics process that allows you to dive into the collected and prepared data to unveil insights that could have gone undiscovered otherwise. Big data Analytics rely on newer technologies that allow deeper knowledge discovery in large bodies of data of any type (structured, unstructured, or semi-structured); technologies designed for such work include Hadoop, Spark, and MapReduce (introduced in Chapter 3).

Predictive Analytics emerged to complement, rather than compete with, business intelligence and data mining. You can co-ordinate your pursuit of analytics with business intelligence by using BI to preprocess the data you’re preparing for use in your predictive analytics model. You can point BI systems to different data sources — and use the visualizations that BI produces from raw data — to give you an overview of your data even before you start designing your analytics model. This approach leads to the visualization of large amounts of raw data, as discussed in the next section.

Exploration of Raw Data

A picture is worth a thousand words — especially when you’re trying to get a good handle on your data. At the pre-processing step, while you’re preparing your data, it’s a best practice to visualize what you have in hand before getting into predictive analytics.

You start by using a spreadsheet such as Microsoft Excel to create a data matrix — which consists of candidate data features (also referred to as attributes). Several business intelligence software packages (such as Tableau) can give you a preliminary overview of the data to which you’re about to apply analytics.

Identifying data attributes

In programming terms, data is a collection of objects and their corresponding attributes. A data object is also called a record, item, observation, instance, or entity. A data object can be described by a collection of attributes that describe it. A data attribute is also known as feature, field, or characteristic. Attributes can be nominal, ordinal, or interval:

  • Nominal attributes are numbers (for example, zip codes) or nouns (for example, gender).
  • Ordinal attributes, in most cases, represent ratings or rankings (for example, degree of interest in buying Product A, ranked from 1 to 10 to represent least interest to most interest).
  • Interval attributes represent data ranges, such as calendar dates.

One motivation behind visualizing raw data is to select that subset of attributes that will potentially be included in the analytics. Again, this is one of the tasks that will lead you to the nucleus — the smart data to which you’ll apply analytics.

Your data scientist might be applying predictive models to a database or data warehouse that stores terabytes (or more) of data. Don’t be surprised when it takes a very long time to run models over the whole database. You’ll need to select and extract a good representative sample of your large body of data — one that produces nearly the same analytical results as the whole body. Another helpful step, called dimension reduction, starts when you select the data attributes that are most important — and here visualization can be of help. Visualization might be able to give you an idea of the dispersion of the attributes in your data. For example, taking a spreadsheet of numerical data and with a few clicks in Excel, you can identify the maximum, minimum, variance, and median. You can picture spikes in your data that might lead you to quickly find those outliers that are easy to capture. The next sections illustrate examples of visualizations that you can apply to raw data to detect outliers, missing values, and (in some cases) early insights.

Exploring common data visualizations

When you request a cleaning service come to your house and the company you hire has never been to your house, naturally you’d expect the company to ask you about (you guessed it) your house. They might also ask about structural features such as the number of rooms and baths, the current overall state of the house, and when they can visit the house to see it before they start cleaning — or even give you an estimate of the cost. Well, getting your data in order is similar to getting your house in order.

Suppose you’ve captured a large amount of data. You’d like to see it before you even start building your predictive analytics model. Visualizing your data will help you in the very first steps of data preparation by

  • Guiding you to where to start cleaning your data
  • Providing clues to which values you need to fill
  • Pointing you to easy outliers and eliminating them
  • Correcting inconsistent data
  • Eliminating redundant data records

Tabular visualizations

Tables are the simplest, most basic pictorial representation of data. Tables (also known as spreadsheets) consist of rows and columns — which correspond, respectively, to the objects and their attributes mentioned earlier as making up your data. For example, consider online social network data. A data object could represent a user. Attributes of a user (data object) can be headings of columns: Gender, Zip Code, or Date of Birth.

The cells in a table represent values as shown in Figure 4-3. Visualization in tables can help you easily spot missing attribute values of data objects.

image

FIGURE 4-3: Example of online social network data in tabular format.

Tables can also provide the flexibility of adding new attributes that are combinations of other attributes. For example, in social network data, you can add another column called Age, which can be easily calculated — as a derived attribute — from the existing Date of Birth attribute. In Figure 4-4, the tabular social network data shows a new column, Age, created from another existing column (Date of Birth).

image

FIGURE 4-4: Example of derived attributes.

Word clouds

Consider a list of words or concepts arranged as a word cloud — a graphic representation of all words on the list, showing the size of each word as proportional to a metric that you specify. For example, when you have a spreadsheet of words and occurrences and you’d like to identify the most important words, try a word cloud.

Word clouds work because most organizations’ data is text; a common example is Twitter’s use of trending terms. Every term in Figure 4-5 (for example) has a weight that affects its size as an indicator of its relative importance. One way to define that weight could be by the number of times a word appears in your data collection. The more frequently a word appears, the “heavier” its weight — and the larger it appears in the cloud.

image

FIGURE 4-5: Word importance represented as weight and size in a word cloud.

Flocking birds as a novel data representation

“Birds of a feather flock together” is a traditional saying that provides instant insight as a way of visualizing data: Individuals of similar tastes, interests, dreams, goals, habits, and behaviors tend to congregate in groups.

Natural flocking behavior in general is a self-organizing system in which objects (in particular, living things) tend to behave according to (a) the environment they belong to and (b) their responses to other existing objects. The flocking behavior of natural societies such as those of bees, flies, birds, fish, and ants — or, for that matter, people — is also known as swarm intelligence.

Birds follow natural rules when they behave as a flock. Flock-mates are birds located with a certain distance from each other; those birds are considered similar. Each bird moves according to the three main rules that organize flocking behavior.

  • Separation: Flock-mates mustn't collide with each other.
  • Alignment: Flock-mates move in the same average direction as their neighbors.
  • Cohesion: Flock-mates move according to the average position or location of their flock-mates.

Modeling these three rules can enable an analytical system to simulate flocking behaviors. That’s because biologically inspired algorithms, in particular those derived from bird-flocking behavior, offer a simple way to model social network data. Using the self-organized natural behavior of flocking birds, you can convert a straightforward spreadsheet into visualization, such as Figure 4-6. The key is to define the notion of similarity as part of your data — and construct a mathematical function that supports that similarity. Start with a couple of questions:

  • What makes two data objects in your data similar?
  • Which attributes can best drive the similarity between two data records?
image

FIGURE 4-6: A simple, novel way to visualize big data: natural bird-flocking behavior.

For example, in social network data, the data records represent individual users; the attributes that describe them can include Age, Zip Code, Relationship Status, List of Friends, Number of Friends, Habits, Events Attended, Books Read, and other groups of particular interests (Sports, Movies, and so on).

You can define the similarity for your group according to nearly any attribute. For example, you can call two members of a social network similar when they read the same books, have a large number of common friends, and have attended the same events. Those two similar members will flock together if birds flocking in a virtual space represent them.

In the healthcare and biomedical fields, the data object could be patient data. And you could base the attributes on the patient’s personal information, treatments, and information related to diagnosis. Then, by plotting the data as a visualization based on the flocking of birds, you might be able to visualize and discern insights before you even apply data analytics. Some interesting patterns in your data would become apparent, including some characteristic groupings — and even data anomalies.

Flocking behavior is iterative because behavior, while consistent, isn't static. At each iteration (or round), birds move. Using flocking birds as a visualization technique is an especially relevant way to represent streamed data. With streaming data, at each point the visualization of data objects can change according to the new incoming data. In Figure 4-6, shown earlier, a flocking bird in a virtual space represents a data object (such as an individual user) in the dataset in question (such as the social network). Similar birds in the virtual space that represent similar data objects in real life will flock together and appear next to each other in the visualization.

We invite you to read more on bird flocking algorithms; here are some references to explore the nature inspired predictive analytics field:

Bellaachia, A., & Bari, A. (2012, June). Flock by leader: a novel machine learning biologically inspired clustering algorithm. Springer Berlin Heidelberg.

Cui, X., Gao, J., & Potok, T. E. (2006). A flocking based algorithm for document clustering analysis. Journal of systems architecture, 52(8), 505-515.

Graph charts

Graph theory provides a set of powerful algorithms that can analyze data structured and represented as a graph. In computer science, a graph is data structure, a way to organize data that represents relations between pairs of data objects. A graph consists of two main parts:

  • Vertices, also known as nodes
  • Edges, which connect pairs of nodes

Edges can be directed (drawn as arrows) and can have weights, as shown in Figure 4-7. You can decide to place an edge (arrow) between two nodes (circles) — in this case, the members of the social network who are connected to other members as friends:

image

FIGURE 4-7: Online social graph.

The arrow’s direction indicates who “friends” whom first, or who initiates interactions most of the time.

The weight assigned to a particular edge (the numerical value of the edge, as shown in Figure 4-7) can represent the level of social interaction that two social network members have. This example uses a scale of 10: The closer the value is to 10, the more the network members interact with one another.

Here a social interaction is the process in which at least two members of a social network act and or respond to each other. The interaction can be offline and go beyond just the online interactions that are happening on the network. For example, offline interactions can include meetings, conference calls, “live” social gatherings, group travel, social events, mobile communications, and text messaging. All such interactions can be represented as numbers, each with a score (a weighted sum of all types of interactions).

The weight on the graph can also represent the influence rate. How influential is one member on another? Several algorithms can calculate influence rate between two social network members, based on their online posts, common friends, and other criteria.

In Figure 4-7, social network members are represented with black circles; each relationship represented by an edge shows a level of interaction, which is in turn represented as number on the edge. From an initial view, you can spot two disconnected groups — a quick insight that crops up even before you apply analytics.

You might want to focus on either or both of these two groups; you may want to include one of them in the data you’re preparing as a resource for when you build your predictive model. Applying your analytics to one of the groups may enable you to dive into the data and extract patterns that aren't obvious.

Common visualizations

Bar charts are well known type of visualization that can be used to spot spikes or anomalies in your data. You can use it for each attribute to quickly picture minimum and maximum values. Bar charts can be also used to start a discussion of how to normalize your data. Normalization is the adjustment of some — or all — attribute values on a scale that makes the data more usable.

A Pie chart is another popular visualization. Pie charts are used mainly to show percentages. They can easily illustrate the distribution of several items, and highlight the most dominant.

Line graphs are a traditional way of representing data that enable you to visualize multiple series or multiple attributes (columns) in the same graph. One advantage of line graphs is flexibility: You can combine different series in one line graph for several purposes. One such graph could depict a correlation between different series.

In Chapter 18 and Chapter 19, we introduce most widely used tools that you can use to visualize data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.40.63