Chapter 3


Why is big data useful?

‘Big data is why Amazon’s recommendations work so well. Big data is what tunes search and helps us find what we need. Big data is what makes web and mobile intelligent’ —Greg Linden, pioneering data scientist at Amazon.19

The big data ecosystem fundamentally changes what you can do with data, and it fundamentally changes how you should think about data.

Completely new ways to use data

We are doing things today that we could not have done without big data technologies. Some of these applications are recreational, while some are foundational to our understanding of science and healthcare.

Big data was what enabled scientists to collect and analyse the massive amounts of data that led to the discovery of the Higgs boson at Europe’s enormous CERN research facility in 2012. It is allowing astronomers to operate telescopes of unprecedented size. It has brought cancer research forward by decades.20

The quantity of training data and the technologies developed to process big data have together breathed new life into the field of artificial intelligence, enabling computers to win at Jeopardy (IBM’s Watson computer), master very complicated games (DeepMind’s AlphaGo) and recognize human speech better than professional transcriptionists (Microsoft Research).21

The ability of search engines to return relevant results from millions of sources relies on big data tools. Even the ability of mid-sized e-commerce sites to return relevant results from their own inventories relies on big data tools such as Solr or Elastic Search.

Data and analytics were extremely useful before the recent explosion of data, and ‘small data’ will continue to be valuable. But some problems can only be solved using big data tools, and many can be solved better using big data tools.

A new way of thinking about data

Big data changes your data paradigm. Instead of rationing storage and discarding potentially valuable data, you retain all data and promote its use. By storing raw data in data lakes, you keep all options for future questions and applications.

Consider a simple illustration. Suppose I develop an interest in Tesla cars and decide to count the Teslas I see for one month. After the month, I have a number. But if someone asks me for details about colour, time of day, or perhaps another type of vehicle, I’ll need another month before I can give an answer. If I had instead kept a video camera on my car during the first month and had saved all my recordings, I could answer any new questions with data I already had.

Following a data-driven approach

W. Edward Deming, the American engineer who worked to re-invigorate Japanese industry in the 1950s, is often credited for the quote, ‘In God we trust; all others bring data.’ Whereas some organizations are led by the intuition of their leaders or diligently adhere to established practices, data-driven organizations prioritize data in making decisions and measuring success. Such a data-driven approach was instrumental in Bill bratton’s leadership of the NYPD during the 1990s, when he introduced the CompStat system to help reduce crime in New York City.22

In practice, we all operate using a blend of intuition, habit and data, but if you follow a data-driven approach, you will back up your intuition with data and actively develop the tools and talent required to analyse your data.

Data insights

Challenge your assumptions and ask for supporting data. For example, find data showing if your regular promotions are boosting revenue or are simply loss-makers. Track how customer segments respond to different product placements. Find out why they do or don’t come back.

Case study – Tesco’s Clubcard

Some organizations are natively data-driven. Others undergo a data transformation. British supermarket giant Tesco is an example of the latter. With the help of external analysts, Tesco experienced tremendous success adopting a data-driven approach to customer relations and marketing, fuelled by the data from their Tesco Clubcard. The chairman, Ian MacLaurin, amazed at the analysts’ insights, said, ‘You know more about my customers in three months than I know in 30 years’.

This period of data-driven growth brought Tesco’s market share from 18 per cent in 1994 to 25 per cent in 2000, as shown in Figure 3.1.23, 24 Its management would later say that data had guided nearly all key business decisions during that time, reducing the risk of launching bold initiatives, and providing an extremely clear sense of direction in decision making.

Figure 3.1 Tesco share price (Clubcard launched Q1 1995).

Figure 3.1 Tesco share price (Clubcard launched Q1 1995).25

Analysis

Some insights jump out from data. Others you’ll have to dig for, perhaps using statistical methods for forecasts or correlations. Our next case study illustrates such a process.

Case study – Target’s marketing to expecting mothers

In 2002, when big data technology was still incubating in Silicon Valley, Target Corporation was initiating a data-driven effort that would bring it significant revenue, along with a certain amount of unwelcomed publicity.

Target, the second largest discount retailer in the United States, was struggling to gain market share from Walmart. Target had a brilliant idea, an idea that would require creative use of data.

Professor Alan Andreasen had published a paper in the 1980s demonstrating that buying habits are more likely to change at major life events. For Target, the customer event with perhaps the greatest spending impact would be the birth of a child. Target launched a project to flag pregnant shoppers based on recent purchases, with the goal of marketing baby products to these shoppers at well-timed points in the pregnancy.

Target’s analysts carefully studied all available data, including sales records, birth registries and third-party information. Within a few months, they had developed statistical models that could identify pregnant shoppers with high accuracy based solely on what products they were purchasing, even pinpointing their due dates to within a small window.

One year later, an angry father stormed into a Target store outside of Minneapolis, demanding to talk to a manager: ‘My daughter got this in the mail!’ he said. ‘She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?’ The father soon learned the girl was actually pregnant. The story made headlines, and the world marvelled at how Target had hit a financial gold mine and a PR landmine.

Target saw 20 per cent year-over-year annual growth during these years (2002–2005), growth which they attributed to ‘heightened focus on items and categories that appeal to specific guest segments, such as mom and baby.’26 The world took notice. Target had adopted an analytics-driven approach to marketing, and it had resulted in substantial revenue growth.

Neither the Target nor the Tesco case studies involved what we today call big data, but both brought double-digit growth rates. Target and Tesco took all the information in their systems and added data acquired from third parties. They placed this data in the hands of trained analysts and used the results to steer their operations.

Such data-driven approaches are still bringing success to many companies today. What’s changed is that you now have access to new types of data and better tooling.

Better data tooling

Data brings insights. Your ability to archive and analyse so much potentially relevant data lets you find answers quickly and become extremely agile in your business planning. It’s disruptive technology.

More data generally enables better analysis. It improves some analysis and completely transforms others. It’s like adding power tools to a set of hand tools. Some jobs you can do better, and some that were previously not feasible suddenly become feasible.

In this next section, I’ll show some ways big data makes traditional analytics better.

Data: the more the better

You’ll want to collect as much data as possible to do your analysis: more types of data and greater quantities of each type.

There is a fundamental principle of analytics that ‘more data beats better models’. The strength of your analysis depends on:

  1. Discovering what data is most meaningful.
  2. Selecting an analytic tool appropriate to the task.
  3. Having enough data to make the analysis work.

The reason you’ll want to develop big data capabilities is that big data gives you additional types of data (such as customer journey data) for the first dependency and additional quantities of data for the third dependency.

Keep in mind

Update your existing statistical and analytic models to incorporate new data sources, particularly big data such as web traffic, social media, customer support logs, audio and video recordings and various sensor data.

Additional types of data

To illustrate, imagine an insurer calculating the premium for your car insurance. If the insurer knows only your home address, age and car model, they can make a rough estimate of your risk level. Telling the insurer how far you drive each year would give more insight, as more driving means more risk. Telling where and when you drive would give even more insight into your risk. The insurance company will benefit more from getting the additional data than it would from improving its risk model with the original, limited data.

In a similar way, big data provides additional types of data. It gives detailed sensor information to track product performance for machines. It allows us to record and analyse deceleration rates for cars equipped with monitoring devices. It allows us to manage massive volumes of audio and video data, social media activity and online customer journey data.

The value of customer journey data

Customer journey data is an extremely valuable type of big data. Tesco’s customer analysis in the late 1990s used demographic information (age, gender, family profile, address) and purchase data. This was a lot of data at the time, considering their limited storage media, and it was sufficient for insights into purchase patterns of customer segments. The resulting insights were valuable for marketing, product selection and pricing, but they gave a two-dimensional view of a three-dimensional world.

Tesco only saw what happened when the customer reached the checkout queue. The data we have today is much richer.

Although traditional web analytics gives you a two-dimensional view, with summary statistics such as traffic volume and conversion events (e.g. purchases), the complete web logs (the big data) will tell you:

  • What marketing effort sent each customer to your site: Facebook, Google, an email campaign, a paid advertisement?
  • What was top-of-mind when each customer entered? You might see the link that brought them or the first search term used onsite.
  • What is most important to each customer? You’ll see which filters the customer (de-) selects and the sort order chosen (increasing or decreasing price, rating, etc.). Knowing this can make a significant difference in how you approach each customer during the rest of their online visit.
  • What alternate products each customer considered before making a purchase. With the online customer journey, you can analyse micro-conversions that signal when you’ve captured the customer’s interest. Particularly for expensive items with infrequent sales, you’ll want to understand how items are capturing the interest of your visitors, and you’ll use these insights in deciding how to sell to future visitors.
  • How to create successful shopping experiences, based on what you learn about customer intention and preference. For example, you might learn that, for customers who entered your site looking for an android tablet, filtered for memory above 64GB, sorted based on decreasing price and then sorted by highest product review, the most commonly bought tablets were XXX and that certain other tablets were never purchased. You’ll see what additional items this type of customer often purchased. Using this knowledge, you can guide look-alike customers to quickly find the item or items that best suit them.

If you ran a small shop and were on a first-name basis with each customer, you would already have such insights and would rely on them to improve your business. In e-commerce, with millions of unseen customers, recapturing this level of insight is extraordinary. We are not talking about invasive spying techniques. You can get valuable insights from studying even anonymous online customer journeys.

Your stores of big data allow you to ask new questions from old data. When you notice a sales spike over the past quarter and wonder how this related to a certain popular item, you can search through detailed historic data to see which customer visits included searches or views of that item. This flexibility in after-the-fact analysis is only possible with big data solutions.

In statistical analysis, as in the Target example, the customer journey data will provide new features for your analytic models. In the past, your models used customer age, income and location, but you can now add search terms and filters, search result orderings and item views. Knowing that a customer bought an unscented hand cream was a signal of possible pregnancy for Target. Knowing that the customer specifically searched for hand cream that is unscented would have been an even stronger signal.

Keep in mind

If your website sees significant customer engagement, you should start using a big data system to store and analyse the detailed online activity. You’ll benefit from this analysis even if the visitors remain anonymous.

Your detailed customer journey logs will accumulate at a rate of several gigabytes or even terabytes of unstructured data per day. You won’t use your traditional databases for this. We’ll talk more about selecting appropriate databases in Chapter 8.

Additional quantities of data

Some analytic models require very little data to work properly. (You need just two points to fit a line.) But many models, especially machine learning models, work much better as they are fed more data. Michele Banko and Eric Brill, researchers at Microsoft in 2001, demonstrated how certain machine learning methods never stopped benefitting from more data, even as they were gorged with extreme amounts.27 Such machine learning algorithms truly benefit from big data.

The examples above focused heavily on retail applications. I’ll round out the chapter with a case study from medical research.

Case study – Cancer research

Big data is playing an increasingly important role in cancer research, both for storing and for analysing important genomic data. There are numerous applications, but I’ll briefly mention two: genomic storage and pathway analysis

Every cancer is different, even for patients with the same type of cancer. A single tumour mass may have 100 billion cells, each mutating in a different way, so that studying only a sample of tumour cells will not give the complete picture of what is happening in that individual.

Technology is making it possible for cancer researchers to record the data from more and more of those cancer cells. Since 2003, with the completion of the Human Genome Project, the cost of sequencing genomes has dropped dramatically, as shown in Figure 3.2.

The result is that we are building up a huge catalogue of genomic data, particularly related to cancer. Estimates are that scientists will soon be sequencing and storing more than an exabyte of genomic data every year.

Big data technologies are also providing the tools for studying that data. Cancers are characterized by how they disrupt cell protein pathways, and these disruptions differ from patient to patient. To gain deeper insight into these patterns, researchers have developed a method where gene interaction networks are modelled as graphs of 25 thousand vertices and 625 million edges. Protein pathways then correspond to subnetworks in this graph. Researchers can identify connected subnetworks mutated in a significant number of patients using graph algorithms running on big data technologies (such as Flink). Such methods have already brought insights into ovarian cancer, acute myeloid leukaemia and breast cancer.

Figure 3.2 Historic cost of sequencing a single human genome.

Figure 3.2 Historic cost of sequencing a single human genome.28, 29

But not all applications of big data methods to cancer research have been successful, as we’ll see in a case study in Chapter 12.

Takeaways

  • Big data technologies enable you to bring business value from otherwise unmanageable data.
  • Big data technologies allow you to operate in a much more data-driven manner.
  • Big data opens the door to new analytic methods and makes traditional methods more accurate and insightful.
  • Online customer journey is an example of big data that has proven valuable in many applications.
  • Big data has many applications to medical research

Ask yourself

  • When was the last time you uncovered an unexpected insight within your data? Do you have people and processes in place to promote data-driven insights?
  • Which analytic techniques currently used within your organization could be improved by incorporating new data sources not available when those techniques were first built?
  • What problems have you previously written off as ‘too difficult to solve’ because you didn’t have the necessary data or computing power? Which of these might you now be able to solve with big data technologies?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.48.161