Chapter 5

Applying Models

IN THIS CHAPTER

Understanding models

Categorizing models

Introducing the benefits of models

Highlighting relevant case studies and related domains to predictive analytics

Big data is like an engine that drives our lives. It includes everything about us. Predictive analytics can use big data to foresee our future moves and make predictions about our likely actions — especially when we're someone's prospective customers. Hypothetically, a predictive analytics model can know when you're asleep and can predict the time you'll wake up.

Companies are capturing and storing information at every opportunity. They store every purchase you make, every online search you do, every website you visit, and your preferences. Everything is closely monitored and analyzed. This has become the new norm in our lives. Your doctor, your employer, your next-door grocer will be all analyzing data about you soon, if they aren't already.

tip The rule for using all this data is clear: Whichever company can accurately find patterns in your data, analyze them, and use them effectively, will profit from it.

So what are some of the business implications of using predictive analytics on big data? And how can businesses or organizations make profit or make their own success stories from your data? For that matter, how can you do the same? To clarify that picture, this chapter introduces different types of models and highlights some recent case studies from different domains, including healthcare, social media, marketing, and politics. The chapter also highlights domains that correlate with predictive analytics.

Modeling Data

Most predictive analytics tools come equipped with common algorithms, the underlying mathematical formulas, to help you build your model. A completed model can be applied in a few minutes. In fact, a business analyst with no specific background in statistics, data mining, or machine learning can run powerful algorithms on the data relatively quickly, using available predictive analytics tools.

Suppose a business analyst at a retail company would like to know which customer segments to upsell to. She can load each customer's data, purchase history, preferences, demographics, and any other type of relevant data; run a few models to determine the likely segments of interest, and put the results to use right away in a sales campaign.

Those of us who do this type of work for a living (such as data scientists) tend to seek the ultimate results — to build the all-powerful model to wow the business stakeholders and to showcase the wealth of our knowledge.

In pursuit of this goal, the model-building may take a little longer than in the retailer example. Instead of aiming for a quick victory, we want to optimize performance by building a model with the highest accuracy possible.

A lot of tweaking and experimentation is needed. We may start with a large number of variables or features that are available in our dataset, and funnel our way through until we get to the very few variables that have the most predictive power. This process requires running as many simulations as possible while changing the values of the parameters and plotting the results.

Another common technique is to build an ensemble model (see Chapter 7), evaluate the results of the models that make it up, and present the user with the highest-scoring model among them. Or (at the very minimum) we can run multiple and separate techniques on the data, compare results, and eventually pick the one model that consistently scores higher for most of our simulations and what-if scenarios.

remember Building a model is part science and part art. The science refers to well-established statistical techniques, machine learning, and data-mining algorithms. Tweaking is the art.

Models and simulation

A model is nothing but a mathematical representation of a segment of the world we are interested in. A model can mimic behavioral aspects of our customers. It can represent the different customer segments. A well-made, well-tuned model can forecast — predict with high accuracy — the next outcome of a given event.

From this definition, you can already deduce that it's possible to build a model to mimic and represent virtually about anything you want to analyze. As you might imagine, this process can quickly and easily become very complex and difficult.

To start with the potential complexity, imagine you're working with dataset that has many associated variables with a wide range of values. Going through all the possible values and permutations across the entire dataset can be time-consuming.

The standard approach is to run what-if scenarios — hypothetical situations that your simulations investigate. The general outline of running a what-if scenario looks like this:

  1. Build the model.
  2. Start changing the values of parameters and examining the new results, or use a tool — a specialized software program that automates that whole process for you.
  3. Look at the report that the model generates.
  4. Evaluate the result and pick the most important predictors; if you've run multiple models, pick the right model.

The process looks simpler on the page than it is in practice. Certain tasks lend themselves easily to canned solutions. Others can be very hard to solve — two notoriously hard examples are predicting extended weather conditions or stock-market performance. The farther into the future you try to predict, the faster the accuracy of the predictions diminishes. It may be possible to say (for example) that it will snow within the next 12 hours, but it's hard to come up with an accurate prediction that it will snow in three weeks from now.

Similar uncertainty crops up when predicting a hurricane's path; it's difficult to know with certainty the exact path the farther out into the future one gets. That's why new data is immediately made available for these models, and the forecast is continuously updated.

Simulating the stock market is also extremely difficult, for a simple reason: The market can be affected by virtually everything.

Complex problems require clever solutions, constant tweaking, and continuous refreshment of the deployed models.

  • Clever solutions: One way to smarten up the model is to include variables not usually associated with the field you're investigating. To analyze the stock market, for example, you might include data about parking activities at malls, or analyze data about daily local newspapers from around the country.

    Another clever solution involves divorce lawyers acquiring data from credit card companies. Apparently those companies know couples are headed for divorce, about two years before the actual divorce dates. And they know that with an accuracy of 98 percent — mainly based on the spending habits of the couples.

  • Tweaking is a process of many parts: Going through what-if scenarios, running multiple algorithms, and including or excluding certain variables as you step through the analysis. You get the most from this process if you always re-evaluate and try to understand your data in light of the business problem at hand, ask the hard questions, explore the data, and experiment with the different approaches.
  • Continuous refreshing of the model: Updating your model periodically, in light of new information, is recommended to counteract the model's tendency to decay over time.

    When new data becomes available, preserve your competitive edge by updating the model; to keep this process relevant, closely track and monitor your model's performance in real time.

When all companies are doing the same thing, exploring the same gaps, and competing in the same space, you have to stay ahead of the curve. One way to do so is to vary your tactics, spice up your campaigns, and hone your ability to detect the changing trends, positioning your business and to take full advantage of them.

In short, building a predictive model is an ongoing process, not a set-it-and-forget-it solution. Getting a business to think that way represents a cultural shift for most organizations, but that's what current market conditions demand.

One other form of complexity is the multiple directions your model-building can go. You can build a model about anything and everything — where do you start? The next section helps clear up the potential confusion by categorizing the different types of models.

Categorizing models

You have various ways to categorize the models used for predictive analytics. In general, you can sort them out by

  • The business problems they solve and the primary business functions they serve (such as sales, advertising, human resources, or risk management).
  • The mathematical implementation used in the model (such as statistics, data mining, and machine learning).

Every model will have some combination of these aspects; more often than not, one or the other will dominate. The intended function of the model can take one of various directions — predictive, classification, clustering, decision-oriented, or associative — as outlined in the following sections.

Predictive models

Predictive models analyze data and predict the next outcome. This is the big contribution of predictive analytics, as distinct from business intelligence. Business intelligence monitors what's going on in an organization now. Predictive models analyze historical data to make an informed decision about the likelihood of future outcomes.

Given certain conditions (recent number and frequency of customers complaints, the date of renewal of service approaching, and the availability of cheaper options by the competition) how likely is this customer to churn?

The output of the predictive model can also be a binary, yes/no or 0/1 answer: whether a transaction is fraudulent, for example. A predictive model can generate multiple results, sometimes combining yes/no results with a probability that a certain event will happen. A customer's creditworthiness, for example, could be rated as yes or no, and a probability assigned that describes how likely that customer is to pay off a loan on time.

Clustering and classification models

When a model uses clustering and classification, it identifies different groupings within existing data. You can still build a predictive model on top of the output of your clustering model using the clustering to classify new data points. If, for example, you run a clustering algorithm on your customers' data and thereby separate them into well-defined groups, you can then use classification to learn about a new customer and clearly identify his group. Then you can tailor your response (for example, a targeted marketing campaign) and your handling of the new customer.

Classification uses a combination of characteristics and features to indicate whether an item of data belongs to a particular class.

Many applications or business problems can be formulated as classification problems. At the very basic level, for example, you can classify outcomes as desired and undesired. For example, you can classify an insurance claim as legitimate or fraudulent.

Decision models

Given a complex scenario, what is the best decision to make — and if you were to take that action, what would the outcome be? Decision-oriented models (simply called decision models) address such questions by building strategic plans so as to identify the best course of action, given certain events. Decision models can be risk mitigations strategies, helping to identify your best response to unlikely events.

Decision models probe various scenarios and select the best of all courses. To make an informed decision, you need deep understanding of the complex relationships in the data and the context you're operating in. A decision model serves as a tool to help you develop that understanding.

Association models

Associative models (called association models) are built on the underlying associations and relationships present in the data. If (for example) a customer is subscribed to a particular service, it's most likely that she will order another specific service. If a customer is looking to buy Product A (a sports car), and that product is associated with Product B (say, sunglasses branded by the carmaker), he is more likely to buy Product B.

Some of these associations can easily be identified; others may not be so obvious. Stumbling over an interesting association, previously unknown, can lead to dramatic benefits.

Another way of finding an association is to determine whether a given event increases the probability that another event will take place. If, for example, a company that leads a certain industrial sector just reported stellar earnings, what is the probability that a basket of stocks in that same sector to go up or down in value?

Describing and summarizing data

At the start of a predictive analytics project, the data scientist and the business analyst aren't fully familiar with the data yet, and don't know what analysis will work best. Substantial time must be spent exploring that data with the sole goal of gaining familiarity. Visualization tools can help with this.

Describing your data can provide a precise summary of the characteristics and underlying structure that make the data relevant and useful. For example, identifying different groupings within the data is an essential step toward building a model that accurately represents your data — which makes a useful analytical result more likely.

Making better business decisions

Business leaders use predictive analytics, first and foremost, to empower business decision-making. The value of data to an organization is, essentially, how well it drives decision-making toward the organization's success.

Data-driven decisions give your business and managerial processes a solid footing, enhance your customers' satisfaction, and increase your return on investment. In a world marketplace full of ever-changing variables — governed by complex rules and immensely interdependent global systems — organizations can navigate more successfully by using predictive analytics to replace guesswork with actions based on real data.

Predictive analytics can transform your business by generating useful new insights that can serve as the basis of sound strategies and effective decision-making based on facts.

Healthcare Analytics Case Studies

The healthcare domain offers examples of predictive analytics in action. This section offers two case studies. In one, Google search queries attempted to build a predictive model to predict flu outbreak that made good predictions at some point in time, then it failed to continue predicting outbreaks in subsequent years. In the other, cancer data predicts the survival rate of breast cancer patients.

Google Flu Trends

People use Google to search for nearly everything, nearly all the time — their next destination, the name of the person they just met, topics they want to learn about, and even the symptoms of some disease they might think they have. That's where online searches become medically relevant.

Google researchers found that certain search terms may be good indicators that an outbreak of a disease — in particular, influenza — is in progress.

This insight first appeared in a research paper by Jeremy Ginsberg and several others, published in Nature. The title sums up the unlikely-sounding premise: “Detecting Influenza Epidemics Using Search Engine Query Data”. The results, however, were real for some period of time, then the discovered insights didn't hold for so long. Google has been using search terms attempting to predict the current flu activity, almost in real time, around the world.

Ginsberg and his colleagues discovered a strong correlation between the number of individuals who search for flu-related terms and the number of individuals who actually have the flu. Although an Internet search for information about a disease may seem an obvious action to take when you're feeling ill, consider why it seems obvious: An underlying pattern of behavior exists. The pattern shows up in the data.

Google insights derived from search terms reveal patterns that emerge from similar search queries and real-life phenomena. As stated on Google's site, “[there are] more allergy-related searches during allergy season, and more sunburn-related searches during the summer”.

Google attempted to utilize those search queries to extract trends. The aim was to build a model that can predict real-life events such as outbreaks in certain regions of the world, in real time.

The clinical data traditionally used for disease surveillance includes number of visits to hospitals, patients' symptoms, and patients' treatments. Such data elements can be used to detect or predict the spread of an epidemic. The Centers for Disease Control and Prevention (CDC) uses traditional surveillance systems with up to a two-week “reporting lag,” according to Google. In fact, because flu victims used Google when they first got sick, before they went to the doctor, the hope was that the Google flu approach would shorten the lag time.

The government data that Google used — readily available from the CDC website (www.cdc.gov/flu/weekly) — consists of how many patients' visits were flu-related across the nine regions of the United States. Based on a training dataset that encompassed five years of search queries and publicly available government data on influenza, Google attempted to build a model for the international surveillance of influenza that worked for some point in time, then failed at subsequent years.

The project is known as Google Flu Trends. After its initial success of predicting flu outbreaks with around 97 percent accuracy over the CDC data in 2008, Google reported failure of the same project to report incorrect predictions on subsequent years. In fact, a few months after Google announced its success of building a model that can predict a flu epidemic, Google failed to predict the 2009 swine flu caused by H1N1 influenza.

The predictions Google produced were inaccurate and the CDC remains the only credible source for predicting flu outbreaks. In fact, at some point between 2012 and 2013, Google Flu trends inaccurately predicted the number of flu cases, a number as twice the number of doctors’ visits that were originally reported by the CDC. Google Flu Trends is now suspended and is no longer publishing predictions. However, what can we learn from this failure?

Some flu symptoms are

  • Headaches
  • Fever
  • Coughing
  • Vomiting
  • Runny nose
  • Sore throat
  • Tiredness
  • Aching joints

It isn't very clearly cited what exact keywords Google uses to collect related search queries, but most likely the Google search queries are filtered by the preceding symptoms.

The major cause of inaccuracies in Google predictions comes from the fact that many of people who were using Google to search such flu symptoms don't know what influenza is, and most of them don't have it.

If you search Google for “frequent headaches in the morning” you may not necessarily have flu. The same thing applied to other symptoms, such as tiredness and sore throat.

In fact, CDC data shows that the majority of people visiting doctors were thinking that they have influenza. The reality is that they don't end up having it. The CDC reported that about 10 percent of doctor’s visits were positive flu cases. The rest were similar illness to influenza.

Of the other 90 percent of doctors’ visitors who thought they had the flu but did not, many were searching in Google and weren't excluded from the Google Flu Trend predictive model, although they didn't have a flu. Hence, Google Flu Trends’ results were inaccurate.

Lesson learned: Big data inspires and can lead to phenomenal results, but only when it's good quality data.

Cancer survivability predictors

The medical uses of predictive analytics include the use of algorithms to predict the survivability rate among breast cancer patients. According to American Cancer Society Surveillance Research, breast cancer was the leading type of cancer in 2013. An estimated 232,340 new cases of breast cancer were expected to be diagnosed among American women in 2013, with 39,620 deaths expected from breast cancer in the same year.

Professor Abdelghani Bellaachia and his team at George Washington University made a powerful connection between predictive analytics and human benefit in a research paper published by the Society of Industrial and Applied Mathematics (SIAM): “Predicting Breast Cancer Survivability Using Data Mining Techniques.”

In the course of developing the results described in this paper, the team used publicly available historical data about breast cancer survivability to build the model. That data can be downloaded from National Cancer Institute's Surveillance Epidemiology and End Results website, SEER for short (http://seer.cancer.gov).

SEER's database encapsulated historical data about different types of cancer. The collected data on breast cancer included several attributes such as survival time, vital status, and cause of death.

The predictive analytics model adopted for this case study was classification-based. Three algorithms were used — Naïve Bayes, Neural Network, and Decision Trees. (See Chapter 7 for more on these algorithms.) Researchers compared their accuracy and ultimately selected the best for use in their model.

Because this was a classification-based prediction, the model had to be trained by using clusters or classes of data. The pre-classification step requires organizing the data into categories: survived and not survived. Thus the dataset represents two groups of patients. Each row or record corresponds to a breast cancer patient; each can be labeled as survived or not survived.

The training set contains the records of patient data used to build the model; the rest of the records (test data) were used to test the model.

In the case study, a model was selected for breast cancer survivability after comparing all three algorithms' outputs. For this specific application in this specific study, the Decision Trees model performed better by accurately classifying the test data and labeling the target group.

For more details about this use case, see the original publication: Bellaachia, Abdelghani, and Erhan Guven. "Predicting breast cancer survivability using data mining techniques." Age 58.13 (2006): 10-110. http://www.siam.org/meetings/sdm06/workproceed/Scientific%20Datasets/bellaachia.pdf.

Social and Marketing Analytics Case Studies

On a whole different note, social and marketing analytics provide further evidence of startling-but-useful connections between online activity and useful predictions. The next sections examine several examples: the use of shopping data over time to predict pregnancy status of the customer; the use of Twitter to predict the stock market; the use of news articles to speculate the variations of stock prices; the use Twitter to predict earthquakes and election outcomes; and the potential of New York City biker’s data. Relax. It only looks like magic.

Target store predicts pregnant women

In an unintentionally invasive instance, the Target store chain used predictive analytics on big data to predict which of its customers were likely to be pregnant. (Charles Duhigg, a reporter at The New York Times, initially covered this story.) Target collected data on some specific items that couples were buying, such as vitamins, unscented lotions, books on pregnancy, and maternity clothing.

Using that data, Target developed predictive models for pregnancy among its customers. The models scored the likelihood of a given customer to be pregnant.

Keep in mind that predictive analytics models don't rely on only one factor (such as purchasing patterns) to predict the likelihood of an event. Target probably didn't rely on only one factor to make its predictions. Rather, the model looked at factors that included purchase patterns of pregnancy-related products, age, relationship status, and websites visited. Most important, the resulting predictions were based on events that happened over a period of time, not on isolated events. For example, a couple buys vitamins at some point in time, a pregnancy-guide magazine at another point in time, hand towels at yet another time, and maternity clothes at a still different time. Further, the same couple could have visited websites related to pre-pregnancy, or could have visited websites to look for baby names or lessons for couples on how to cope with the first days of pregnancy. (This information could have been saved from search queries done by the couple.) After Target identified potential customers as probably pregnant, it could then send specialized coupons for products such as lotion and diapers to those customers.

Details of the exact model that Target used to predict customer pregnancy aren't available. One way to build such a model, however, is to use classification-based prediction. (Note that this isn't the only possible way, and may not be the approach used at Target.) The general procedure would look like this:

  1. Collect data about past, current, or potential Target customers, and including their purchases over time in the cyberspace (such as search data from Target’s website)
  2. Collect transactional data from Target customers who actually purchase the products you're interested in, some of which are pregnancy-related.
  3. Select training data that will be used to build your classification-based model, and set aside some of the past data to use in testing your model.
  4. Test the model until it's validated and you're happy with the accuracy of its performance on historical data.
  5. Deploy your model. As new incoming data for a given customer arrives, your model will classify that customer as either potentially pregnant or not.

Twitter-based predictors of earthquakes

Another astonishing use of predictive analytics is to detect earthquakes. Yes, earthquakes. Researchers Sakaki, Okazaki, and Matsuo from the University of Tokyo — situated in a region known for seismic activity — used postings on the Twitter microblog social network to detect an earthquake in real time. A summary of their 2010 research (“Earthquake Shakes Twitter Users: Real-Time Event Detection by Social Sensors”) was published in the proceedings of the 2013 International Conference on World Wide Web.

The researchers' approach was to utilize Twitter users as sensor that can signal an event through tweets. Because Twitter users tend to tweet several times daily, the researchers could capture, analyze and categorize tweets in real time. They sought to predict the occurrence of earthquakes of Intensity three or more by monitoring those tweets. One result of the research was an earthquake-based Twitter monitoring system that sends e-mails to registered users to notify them of an earthquake in progress. Apparently the registered users of this system received notification much faster than from the announcements broadcasted by the Japan Meteorological Agency. The Twitter-based system was based on a simple idea:

  1. The earthquake-detection application starts collecting tweets about an event that's happening in real time.
  2. The collected tweets would be used to trace the exact location of the earthquake.

One problem: Tweets containing the word earthquake may or may not be about an actual earthquake. The data collected was originally focused on tweets consisting of words directly related to an earthquake event — for example, such phrases as “Earthquake!” or “Now it's shaking!” The problem was that the meanings of such words might depend on context. Shaking crops up in phrases such as “someone is shaking hands with my boss” — and even earthquake might mean a topic rather than an event (as in, “I am attending an earthquake conference”). For that matter, the verb tense of the tweet might refer to a past event (as in a phrase such as, “the earthquake yesterday was scary”).

To cut through these ambiguities, the researchers developed a classification-based predictive model based on the Support Vector Machine (see Chapter 7).

  • A tweet would be classified as positive or negative on the basis of a simple principle: A positive tweet is about an actual earthquake; a negative tweet is not.
  • Each tweet was represented by using three groups of features:
    • The number of words in the tweet and the position of the query word within the tweet.
    • The keywords in the tweet.
    • The words that precede and follow a query word such as earthquake in the tweet.
  • The model makes these assumptions:
    • That a tweet classified as positive contains the tweeter's geographical location.
    • That a tweeter who sends a positive tweet can be interpreted as a virtual sensor.
    • That such a tweeter is actively tweeting about actual events that are taking place.

Twitter-based predictors of political campaign outcomes

In a relatively short time, political activity has saturated online social media — and vice versa — even at the higher levels of government.

At the United States House of Representatives, for example, it's common to see congressional staffers in the House gallery, busily typing tweets into their mobile phones while attending a session. Every senator and congressman or congresswoman has a Twitter page — and they (or their staffers) have to keep it active, so they tweet about everything happening inside the House.

Even so, some things never change: A successful political campaign still focuses on making its candidate popular enough to get elected. A winning candidate is the one who can make a lot of people aware of him or her — and (most importantly) get people talking positively about him or her. That's where politics and social media grab hold of each other.

An Indiana University study has shown a statistically significant relationship between Twitter data and U.S. election results. DiGrazia et al. published a paper titled “More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior”. (An electronic copy is available at: http://ssrn.com/abstract=2235423.) The study found a correlation between the number of times a candidate for the House of Representatives was mentioned on Twitter in the months before an election and that candidate's performance in that election. The conclusion: The more a candidate is mentioned on Twitter, the better.

According to a Washington Post article, the sentiments expressed in tweets as reactions to the political events of the 2012 elections matched the balance of public opinion (as indicated by a random-sample survey) about 25 percent of the time.

As Nick Kolakowski relates in an article published online (http://insights.dice.com/2012/11/06/twitter-experiment-predicts-obama-election-win), a team at the Oxford Internet Institute, led by Mark Graham, investigated the relationship between Twitter and election results. They collected thirty million Tweets in October 2012, and counted how many tweets mentioned the two presidential candidates. They found that Obama was mentioned in 132,771 tweets; Romney was mentioned in 120,637 tweets. The Institute translated the count into projected percentages of the popular vote — 52.4 percent for Obama versus 47.6 percent for Romney. At least in terms of popular votes, those figures predicted Obama's victory.

However, a certain ambiguity tended to cloud the picture: The user who tweeted about a candidate might not vote for that candidate. One way to unveil the intention of such a Twitter user would be to apply sentiment analysis to the text of the tweet. In fact, Graham admitted that they should have analyzed the sentiments of the tweets. Clearly, sentiment analysis plays a major role in building a predictive analytics model for such situations.

So, if you're building a model that seeks to predict victory or defeat for the next prominent political candidate, here's a general approach:

  1. Start by collecting a comprehensive training dataset that consists of data about past political campaigns, and present data about all current candidates.

    tip Data should be gathered from microblogs such as Twitter or Tumblr, and also from news articles, YouTube videos (include the number of views and viewer comments), and other sources.

  2. Count the mentions of the candidates from your sources.
  3. Use sentiment analysis to count the number of positive, negative, and neutral mentions for each candidate.
  4. As you iterate through the development of your model, make sure your analysis includes other criteria that affect elections and voting.

    tip Such factors include scandals, candidates' interviews, debates, people's views as determined by opinion mining, candidates' visits to other countries, sentiment analysis on the candidates' spouses, and so on.

  5. Geocode — record the geographical coordinates of — your criteria so you can predict by locations.

    One or more of the features you identify will have predicting power; those are the features that indicate whether a candidate won a past election. Such results are relevant to your model's predictions.

When your training data has been gathered and cleaned, then a suitable model can be based on classification. At this point, you can use Support Vector Machines, decision trees, or an ensemble model (see Chapter 7 for more on these algorithms) that would base predictions on a set of current criteria and past data for each of the political candidates. The idea is to score the results; the higher score should tell you whether the candidate in question will win.

Tweets as predictors for the stock market

Twitter can be a surprisingly valuable source of data when you're building a predictive model for the stock market. A paper by Johan Bollen and his colleagues, “Twitter Mood Predicts the Stock Market,” summarizes an analysis of about ten million tweets (by about three million users), which were collected and used to predict the performance of the stock market, up to six days in advance.

The study aggregated tweets by date, and limited its scope to only those tweets that explicitly contained sentiment-related expressions such as “I feel, I am feeling, I'm feeling, I don't feel” were considered in the analysis. The researchers used two tools in this classic example of opinion mining:

  • Opinion Finder is a sentiment-analysis tool developed by researchers at the University of Pittsburgh, Cornell University, and the University of Utah. The tool mines text and provides a value that reflects whether the mood discovered in the text is negative or positive.
  • GPOMS (Google Profile of Mood States) is a sentiment analysis tool provided by Google. The tool can analyze a text and generate six mood values that could be associated with that text: calm, happy, alert, sure, vital, and kind.

The research followed this general sequence of steps:

  1. By aggregating the collected tweets by date and tracking the seven values of the discovered moods, the study generated a time series — a sequence of data points taken in order over time. The purpose was to discover and represent public mood over time.
  2. For comparison, the researchers downloaded the time series for the Dow Jones Industrial Average (DIJA) closing values (posted on Yahoo! Finance) for the period of time during which they collected the tweets.
  3. The study correlated the two time series, using Granger causality analysis — a statistical analysis that evaluates whether a time series can be used to forecast another time series — to investigate the hypothesis that public mood values can be used as indicators to predict future DJIA value over the same time period.
  4. The study used a Fuzzy Neural Network model (see Chapter 7) to test the hypothesis that including public mood values enhances the prediction accuracy of DJIA.

Although the research didn't provide a complete predictive model, this preliminary correlation of public mood to stock market performance identifies a quest worth pursuing.

Predicting variation of stock prices from news articles

Can we predict the variation of today’s stock prices from online news reports? Can we use algorithms to predict the influence of daily news on stock prices? These are just some of the questions asked by those who wish to understand stock valuations and profit from them.

In Dr.Bari’s predictive analytics class, graduate students Hongzhi Ren, Lin-Yu Tai, and Yen-Cheng Liu started a research project to apply deep-learning algorithms specifically convolutional neural networks (CNN) and multi-layer perceptron algorithms to predict the influence of news articles on stock prices. (Chapter 7 presents an overview of deep learning.)

The question is whether news articles in text format can be applied to develop a predictive analytics framework to classify how much news articles influence a stock price.

Large of amount of text data from news articles published from 2011 to 2015 were collected from Bloomberg News. Data on stock prices during the same period was collected from Google Finance and Yahoo Finance.

News articles were tagged with the companies’ names being mentioned in the articles. During the data preparation phase, each article was assigned a label. The label represented the percent change that occurred to the stock price of the company mentioned in the article. For example, a price change from +1 percent to -1 percent was labeled as “1”, a price change +1 percent to +5 percent and -1 percent to -5 percent articles were labeled as “2”, from percent change +5 percent to +10 percent and -5 percent to -10 percent articles were labeled as “3”, and changes of more than 10 percent were labeled as “4”.

After labeling articles with labels that represented their influence on stock prices of the companies mentioned in them, the next phase in building the model was applying feature extraction algorithms to the news articles. (Chapter 9 introduces feature extraction.) Feature extraction from text documents aims at extracting the vocabulary and words that best represent a set of documents. A deep-learning feature extraction algorithm known as word2vec algorithm was applied to take a corpus of news articles as an input and produce a collection of learned word vectors as output. Word2vec is a two-layer neural network that is good for processing text. It's an algorithm that takes a text as input and outputs a set of vectors. Those vectors are considered featured vectors for words in the input text dataset.

For more details and implementation of word2vec, visit http://deeplearning4j.org/word2vec. Chapter 7 introduces neural networks and deep learning.

In the modeling phases, two deep learning algorithms knows as multi-layer perceptron (MLP) and convolutional neural network (CNN) (see Chapter 7) were applied to the word vectors generated in the preceding phase to learn from labeled four categories of news articles (1,2,3, and 4) mentioned earlier.

Running multiple experiments and using cross validation as model validation technique, experimental results reported an accuracy of 65.3 percent using MLP model and 71.2 percent for the CNN model to predict the influence label (1 to 4) of a news article. This is an on-going research project and one of the learning outcomes from the early experimental results was that online news articles have the potential to predict the stock future movement. This is an on-going research project that aims at expending the data sources being used to a larger dataset of news to extend the predictive analytics framework.

Analyzing New York City’s bicycle usage

If you happen to spend one day in New York City you will encounter a large number of bicyclists interspersed with the pedestrian and vehicle traffic. According to the New York Daily News, New York City’s bicyclists made a record-breaking ten million trips in 2015.

Early in 2016, Dr. Bari and his graduate students Liyang Yan, Yifu Zhao, and Xingye Zhang began work on a research project to analyze a dataset created by CitiBike. The purpose of the project is to discover valuable insights that could provide solutions to a number of different problems, including congestion on bike lanes, and delivering essential city services where they are needed most.

The genesis of this research project came from the following questions:

  • Can we predict where the city’s hotspots will be by mapping people to places within the city?
  • If you work in the financial district in downtown NYC, where would you most likely be living?
  • Where will you most likely spend your time after work?

The potential for such insights to be used as a knowledge base to influence decision making is what drives this research.

The team has been developing a data science framework that has the potential to discover insights from the city’s hot spots at any given time of the day. This framework could also be used to reveal potential correlations between workplaces and residential neighborhoods to power recommendation systems for real estate searching.

We considered the following attributes from the dataset:

  • Trip duration
  • Start time and date
  • Stop time and date
  • Start station name
  • End station name
  • Station ID
  • Station latitude and longitude

technicalstuff The dataset website has detailed information about the attributes:

www.citibikenyc.com/system-data

After preprocessing the data, data clustering algorithms such as K-means, DBscan, and hierarchical clustering were applied to the dataset of nearly 300 bicycle stations (Chapter 6 has a detailed explanation of data clustering and algorithms). Most clustering algorithms came to the same insight: two groups of bicycle stations. Upon closer inspection, the two groups of stations shared many intriguing characteristics:

  • One group of stations was located near work sites, such as banks and insurance companies. A similar number of bikes arrived between 8 A.M. and 11 A.M., and a similar number departed between 5 P.M. and 8 P.M. on the same day.
  • The second group of stations was located near residential areas. Like the first group, these stations happened to have the same number of bikes on average depart between 8 A.M. and 11 A.M., then arrive at the station between 5 P. M. and 8 P. M.

Next, we applied data clustering algorithms to about eight million bike trip records during 2014. Similar trips were grouped based on source station, the destination station’s longitude and latitude, time, and duration. A traffic pattern emerged: similar trips between 8 A.M. and 11 A.M. These trips probably were bicyclists traveling from home to work. The same analysis was done for evening trips to discover evening hotspots.

The discovered insights can be summarized like this:

  • Many cyclists who work in the financial district in New York City probably live in the following areas: Battery Park, Five Points, Chelsea, Hell’s Kitchen, and East Village. These cyclists take bikes after work to spend time in one of these areas: Upper Westside, Times Square, or Greenwich Village.
  • Similarly, many cyclists who travel to SoHo commute from Little Italy, Greenwich Village, Gramercy, Williamsburg, or Fort Greene. They ride bicycles in the evening to spend time in Little Italy, East Village, or Tribeca.

Similar insights were discovered for other parts of the city. These results were only relevant when the weather was convenient for biking. Results also showed that that some cyclists travel to stations that are nearby restaurants, bars, and subway stations. In these cases, it was harder to tell whether the individual took the trip to go to a restaurant or to commute by an alternative method.

There is a lot of potential in the insights from the dataset that can be used to recommend neighborhoods for living, ease bicycle traffic, deliver services to the city, and more. The future of this research will focus on expanding the time frame to look at subsequent years. Additionally, it will aim to take into account alternative data, such as weather and taxi trip data, to develop a smarter predictive analytics framework.

Predictions and responses

In his book Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Wiley), Dr. Eric Siegel illustrates that each application of predictive analytics can be outlined by two questions:

  • What is predicted?

    The kind of behavior (action, event, or happening) to predict each individual, stock, or other kind of element.

  • What is done about it?

    These are the decisions derived by prediction; the action taken by the organization in response to or informed by each prediction.

For example, in the case where Target Stores predicted pregnant women, Dr. Siegel models the predictive analytics application using these two questions:

  • What is predicted?

    Which female customers will have a baby in coming months

  • What is done about it?

    Market relevant offers for soon-to-be parents of newborns

On another instance, Dr. Siegel sums up the targeted direct marketing predictive analytics application as follows:

  • What is predicted?

    Which customers will respond to marketing contact

  • What is done about it?

    Contact customers more likely to respond

Being able to answer these two questions will not only help you view and identify application of predictive analytics, but will also direct you in defining the purpose or the problem of your next predictive analytics application.

Dr. Siegel's book is ideal if you want to read more beyond this chapter on applications of predictive analytics.

The next three sections illustrate important fields that correlate with predictive analytics: data compression, prognostics and open data analytics.

Data compression

Prof.Abdou Youssef is a data compression expert and renowned computer scientist who was awarded a gold medal from the United States National Institute of Standards and Technology (NIST) for his efforts for development of the NIST Digital Library of Mathematical Functions. He has been an advocate and an early proponent of data compression and its role in data analytics before the big data era. Data compression is becoming an important step in the data analytics process. Data compression is the process reducing the size of data. It reduces data storage requirements and can play an important role in improving the performance of the predictive analytics model.

There are many types of data compression. One is known as lossless compression. As the name implies, lossless compression algorithms can compress and decompress files with no loss of data during the process. These algorithms are based primarily on statistical modeling techniques that aim to reduce redundant information in a file.

Lossy compression is another type of data compression. Lossy compression algorithms can compress and decompress files into an approximated representation of the original data. Lossy compression reduces bits in the data representation by identifying and removing unnecessary and redundant information in the data. This method is often used on multimedia files.

In most cases, data compression algorithms are mathematically complex and require a lot of time to process. Major vendors, such as IBM, offer tools and services for large-scale operations.

It is important for data scientists to understand both the need for data compression algorithms and how to use them. This section provided an overview on the topic of data compression with the goal of inviting the audience to learn more about a subject that will come to have a great impact on our lives.

Here are some useful resources for learning more about data compression:

Prognostics and its Relation to Predictive Analytics

Prognostics is an engineering field that aims at predicting the future state of a system. Prognostics improves the process of scheduling maintenance, ordering parts, and using resources. Prof. David Nagel, a renowned expert in nuclear energy, educator and researcher derived an interesting correlation between the field of Predictive Analytics and the old field of Prognostics.

Prognostics and Health Management of Machinery (PHM)

The design, production and operation of machinery are multi-trillion dollar global industries. Anticipation of the needs for machinery maintenance and replacement is of great value to manufacturers of all types. Hence, the field of Prognostic and Health Management (PHM) developed over the past few decades. A good example is the book Intelligent Fault Diagnosis and Prognosis for Engineering Systems by G. Vachtsevanos (Wiley ).

The core idea is to service machines in production lines when they need it, not according to an overly conservative time-based schedule. With the development and commercial availability of cheap microelectromechanical systems (MEMS) and other small sensors, and the realization of sophisticated real-time computations, the field of PHM developed rapidly. It takes historical and current data, and uses them to predict when machines will require expensive and disruptive attention. That permits cost-effective planning for the care of critical machinery in diverse production lines. The PHM Society serves such work: www.phmsociety.org.

Similarly, predictive analytics (PA), a relatively new field that emerged in recent years, takes big data from many sources, often across complete enterprises or fields. PA processes data with a wide variety of algorithms on diverse computers to predict the times and conditions when actions should be taken. The developed field of PHM and the newer field of PA share common methodologies and goals. Currently, PA and PHM remain distinct from each other. It's possible that PHM will remain separate from PA. It might be more likely that the two fields will interact to their mutual benefit in the coming years. The economic benefits of such interactions could be substantial for both arenas.

In fact, in 2015 General Motors announced the creation of a prognostic technology that aims to determine whether certain vehicle components need attention and maintenance because of possible future failure.

It is important to note that a prognostics system consists of sensors, data acquisition system, prediction algorithms to perform sensor fusion, and predictive model to interpret the results. Data sources involved in building a Prognostic Health Management System mainly include

  • Manufacturer’s data:
    • Manufacturer engineering data based on computer models and field experience from other or similar systems.
    • Represents the nominally expected performance baseline.
    • Data updates accomplished by downloading the latest prognostic routines from a central analysis site database through automated means (telemetry).
  • Additional sensor data:
    • Oil temperature, air pressures, vibrations.
    • Sensor data that include noise as noise could be indicative of the underlying problem.
    • Platform CPU data
  • Many algorithms have been used in prognostic systems:
    • Linear Regression.
    • Time series analysis.
    • Bayesian Dynamic linear models.
    • Non-linear regression.

The Rise of Open Data

Bob Lytle, the CEO of rel8ed.to (www.rel8ed.to), and most recently known as the former CIO of TransUnion Canada, is leading efforts on the use of public information as an alternative and strategic data source for predictive modeling in the financial services and insurance sectors.

Open Data originated from the idea that access to government data should be free and available for everyone to use. Here some examples of open data:

The movement of Open Data was part of a simple theory, as Bob says: “We as citizens are the government, and therefore have a right to understand and even reuse the information generated by municipal, regional, and federal entities.” However, some data will remain private. Another reality about open data is that open data is dirty data.

Public data is often incomplete, missing many values. The rel8ed.to team is building a platform for cleaning and reducing public data that will be ready to use for modelers across multiple business segments. Bob’s team is also using public data to generate open data-driven predictions to find the bottom 10 percent of businesses that will likely fail in 2017, and the top 10 percent that are almost sure to grow and extend. Predictive models using Open Data information can be used by financial institutions to check portfolio trends and take action much earlier in the cycle, before adverse risk or churn events occur.

Chapter 6 and Chapter 9 present several techniques for cleaning and reducing data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.252.56