Chapter 2

Digging into the Essence of Big Data

In This Chapter

arrow Exploring the four Vs of big data

arrow Delving into the (justifiable) hype

arrow Looking at today and into the future

Having picked up this book, you’ve likely already come across articles or blogs championing big data. You’ve heard some of the buzz and want to know why big data is such a big deal right now.

Data in itself isn’t a new business phenomenon. But big data is. And not all data is big data, even if there’s plenty of it. Don’t be alarmed, it gets easier!

The truth is, companies have had a lot of data for a long time – consider big mainframe computers and early data centres. Data in itself isn’t a new invention. Until recently, this data was limited to what’s called structured data (see Chapter 4), meaning it was typically in spreadsheets or databases. However, even though there was lots of it, this data wouldn’t count as big data because big data is defined by more than just how, well, big it is. As I show in this chapter, there are other factors that define big data.

tip For the purposes of this chapter, I talk strictly about big data as defined by the four Vs – volume, velocity, variety and veracity – I cover in the following section. In practice, some of the data you use in your business may not exactly qualify as big data, and that’s fine. The key to successfully using data is finding the best data for you – the data that gives you the insights needed to grow your business – and then making sure you act on those insights. (The chapters in Part IV set out this process in detail.) If the best data for you isn’t strictly big data, so what? Nobody is going to call the big data police!

Breaking Big Data into Four Vs

Big data is not a passing fad; it’s here to stay. And it’s going to change the world completely. But to really understand big data, and what separates it from normal data, you need to understand four main factors: volume, velocity, variety and veracity, commonly known as the four Vs of big data. By exploring each of the Vs, you can get a feel for how big data can revolutionise the way you do business.

remember The four Vs define what is really special about big data, why it’s different to regular data, and why it’s so transformative. For data to be classed as ‘big data’ it must satisfy at least one of the Vs: volume, velocity, variety and veracity.

I look at each of the Vs in turn in the next sections.

Growing volumes of data

Volume refers to the vast amounts of data generated every second. Just think of all the emails, Twitter messages, photos, video clips and sensor data you produce and share every second (and those are just for starters). On Facebook alone, ten billion messages are sent, the Like button is clicked 4.5 billion times and 350 million new pictures are uploaded each and every day. You’re no longer talking about humble gigabytes of data, but petabytes and even zettabytes or brontobytes of data. To put this in perspective, if you take all the data generated in the world between the beginning of time and the year 2000, the same amount of data is now being generated every minute!

remember Data on this scale is simply too large to store in traditional ways, such as on a disc or a mainframe computer. Neither can it be analysed using traditional database technology. This is where distributed computing and cloud computing (see Chapter 6) come into their own. Here the storage burden can be shared across lots of computers and servers. Effectively, the data is broken up into parcels which are stored in different locations. The data is managed and brought together by an overarching software program such as Hadoop. I talk a little more about Hadoop in Chapter 9 but, in a nutshell, the system uses a little of the power from each computer to analyse large volumes of data. This is much faster and more efficient than relying on one (very powerful and expensive!) machine to do it.

Theoretically, using distributed systems you can cope with any amount of data, which is a good job when you consider that rates of data are growing at a mind-boggling speed.

This means you talk about big data when the volumes are so big that the data no longer fits into traditional storage systems, such as a single database or data warehouse. For example, if you have an enormous customer database with two million rows of information, it may be a large amount of data, but it’s not strictly big data.

Increasing velocity of data

Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities or the milliseconds it takes trading systems to analyse social media networks to pick up signals that trigger decisions to buy or sell shares. Some data is being generated at such a speed that makes it not worth storing – it’s out of date seconds later.

For instance, in the example of checking credit card transactions for fraudulent activities (such as a transaction in Rome one minute and a transaction in India ten minutes later), the analysis is done in real time, as the transaction is taking place. There’s absolutely no point storing this data and accessing it later. The credit card company needs to analyse it right then and there, on the fly, and has no use for it later.

remember This is typical of the way many companies use data today. Big data technology, specifically in-memory technology, allows you to analyse the data as it’s being generated, without ever putting it into databases.

Exploding variety of data

Variety refers to the different types of data you can now use. In the past, the focus was on structured data that neatly fits into tables or databases, such as financial data. In fact, the vast majority of the world’s data is now unstructured, and therefore can’t easily be put into tables (think of photos, voice recordings and social media updates). I talk more about the different types of data in Chapters 4 and 5.

remember With big data technology, you can now harness different types of data, including messages, social media conversations, photos, sensor data, video data and voice recordings, and bring them together with more traditional, structured data. This ability to analyse and use a wide variety of data is really powerful, and you can now extract more business-critical insights than ever before.

For me, variety is the most interesting and exciting aspect of big data. While businesses have been capturing and analysing data for years, the ability to harness a wide range of data is completely transforming your ability to understand the world and everything within it – including, most importantly for business, your customers.

Variety, velocity and volume are very closely linked. Because we’re now able to extract information from messy (unstructured) data like Facebook posts and video images, the volume and velocity of data have increased accordingly. The amount of data out there and the rate at which we’re generating new data is frightening – and I’m a data expert! The digital universe is doubling in size every two years. At that rate, by 2020 there will be nearly as many bits of information in the digital universe as there are stars in the physical universe.

Coping with the veracity of data

The first three Vs – volume, velocity and variety – formed the original definition of big data by IT (information technology) analytics firm Gartner. Experts have attempted to add on a number of other Vs over time (some useful, some less so), but those three remained the core. However, veracity is a useful and valid fourth V, and it’s now widely accepted as a key feature of big data.

Veracity refers to the messiness or trustworthiness of the data. You used to only be able to analyse neat and orderly structured data, data that you trusted as accurate. But now you can cope with unruly and unreliable data.

remember With many forms of big data, quality and accuracy aren’t controllable – just think of Twitter posts with incorrect hash tags, abbreviations, typos and colloquial speech, as well as the reliability and accuracy of the content. But big data and analytics technology now allow you to work with messy data like this. Because there’s so much of it, you can make sense of it. In this way, the volumes often make up for the lack of quality or accuracy.

Introducing a fifth V – Value

I’d argue there’s another important V to take into account when looking at big data – value. Having access to vast amounts of data and many different varieties of data is wonderful, but unless you can turn it into value, it’s useless. Data has to earn its keep and provide positive outcomes for your business, such as understanding your customers better or making your production line more efficient. So, while people get giddy with excitement over the volume, velocity and variety of big data, when it comes to business, it’s actually value that’s the most important V.

tip It’s also important that businesses make a strong business case for any attempt to collect and leverage big data. It’s all too easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of the costs and benefits to your business. Therefore, I always advise clients to start by building a business case and developing a data strategy – there’s more on this in Chapter 10.

Understanding Why Big Data is Such Big News

Big data seems to have really captured people’s imaginations, and I see more and more articles or blogs on the topic each week. Far from a flash-in-the-pan thing, it’s becoming a mainstream part of the how businesses operate and make decisions.

Why is big data such a big deal? Well, partly it comes down to the wide applications of big data, which go far beyond business. Big data has a huge role to play in activities as diverse as healthcare, disaster relief efforts and making cities safer, better places to live. Sure, it helps companies sell a lot more stuff, but it’s also capable of much, much more.

remember I think there are three main aspects of big data that make it so newsworthy: its powerful predictive capabilities, how it helps you make much smarter decisions and the way it challenges traditional notions of causality. I look at each of those in turn in the next sections.

Promising predictability

It’s no exaggeration to say that with big data you can predict the future. It’s kind of like a crystal ball – but made of wires and networks and software.

technicalstuff Thanks to the ever-increasing amount of data and exciting new analytic technologies, experts have the ability to build predictive models for just about anything. This is called predictive analytics. Predictive analytics can be used to predict who will buy what when and how many units of a product you’ll sell next year; it can be used to predict crop yields for a specific type of seed planted in a certain site; it can even predict civil unrest or outbreaks of viruses.

Making more fact-based decisions

For businesses, the real power of data is that it helps you make smarter decisions that change your business for the better. Want to understand more about your customers so you can run more targeted promotions? Data can help. Want to understand how customers navigate your website so you can make improvements to the site? Data can tell you that. Want to understand what makes your employees tick so you can increase employee satisfaction and reduce staff turnover? You guessed it, data can help.

remember So much of business is based on experience or good old gut reactions. And there’s still a place for that to some degree. But in this day and age you have the ability to base your decisions on so much more than instinct and what’s worked in the past. Today, almost everything in business can be measured, quantified and analysed, and, in doing so, you can dramatically improve your decisions. That’s not to say there’s no room for gut instinct or that all decision making must be a sterile, concrete process. On the contrary – I’d argue the smartest decisions are those based on the hard facts of data viewed with the wisdom of experience.

In Chapter 11, I set out a step-by-step process for using data to improve decision making in your business. It’s an approach I use all the time with clients. But there’s also a bigger, cultural shift required to transform decision making across your company, and I explore that in Chapter 13.

Challenging causality

One of the big discussions in big data is that it allows you to challenge traditional models of causality, or cause and effect.

remember With big data, it’s all about correlations, rather than causality. So, when I buy Product A from Amazon, the site tells me ‘People who bought Product A also bought Product B’. Amazon uses the correlation between the products to sell more, but they don’t necessarily need to understand why people who bought Product A also bought Product B. Nor does the customer, for that matter. In this way, causality becomes less important. You no longer need to understand why one thing leads to another, you just listen to what the data tells you.

Turning science on its head

This focus on correlations potentially challenges the whole scientific method. In the past if you wanted to know something, you developed a hypothesis and ran experiments to establish whether the hypothesis was correct or not. The experiments that took place would vary depending on what you were trying to find out but, regardless of whether you were seeking to understand consumer behaviour or the efficacy of a new drug, the experiment would always take a sample of data, people, ingredients or components in order to test the hypothesis. The sample was always therefore limited in size and the results were then extrapolated out to make assumptions or best- and worst-case scenario predictions for everything from the spread of disease to the accumulation of credit card debt.

This approach has worked well and is credited with many breakthroughs in just about every area of human endeavour. But big data could change all that. If you test a hypothesis on consumer behaviour, for example, the sample of people you test the hypothesis on is based on the assumption that the sample is representative of all consumers (N = all). It’s not. But it was the best option considering the lack of data. The advent of big data – and specifically the technology to store, collate and analyse that data – means lack of data is no longer the problem. Theoretically, at least you can use a sample where N really does equal all.

warning The danger, of course, is that N does not equal all any more than the sample equals all. For example, if an insurance company is able to use all claim information and finds a correlation between fraud and the amount of time taken to complete an online claim form, it doesn’t need to know why time is an indication of fraud, it just needs to know that it is so it can initiate an investigation of all claims over a certain time period. But what about the person who is not very computer literate, or the person who was interrupted by her sleeping toddler waking up crying? The correlation isn’t representative of all people all of the time. Similarly, with the Target pregnancy indicator example from the sidebar ‘Big Brother gains a sibling’, what if the customer was buying products for a friend or for a non-pregnancy-related use? (A lot of people, for instance, use nappies, or diapers, in plant pots to help provide the plants with extra nutrition.)

There’s a real danger that you could be pigeonholed by all sorts of organisations and businesses based on probability, not reality. What happens when someone is refused a mortgage because some algorithm identifies that person as a high risk even though she’s never actually defaulted on a mortgage before? What happens if your insurance premium is increased based on your probability to claim in the future even though that future hasn’t arrived yet?

When correlations go crazy

If you ever took a science class in school, you might have heard the phrase, ‘Correlation does not equal causation’. For example, if data tells you that men over six feet tall spend more money online, that doesn’t mean their height caused them to spend more money.

And that can be the problem when data analysts are looking at these strange and interesting new truths that emerge from the mass quantities of data to which they now have access. If you take it as true that orange used cars are more reliable, the question then becomes why. Are owners of orange cars more careful? Does the colour prevent people from getting in accidents? Or does the colour orange have some other magical property that keeps a car running well? The data has no answers, it just tells you what is.

Tyler Vigen posts funny charts to his website, Spurious Correlations (www.tylervigen.com), that show the danger of simply matching two data sets without any deeper understanding of how the things are related. For example, if correlation is all you need to go by, then you can assume that the more films Nicolas Cage appears in in any given year, the more people drown in swimming pools. Or you can assume that an increase in US spending on science will result in an increase of suicides by hanging. Spurious indeed, you hope, or US researchers and Nic Cage’s film career are in trouble.

tip Don’t take absolutely everything that data tells you at face value and never ever change your business strategy based on potentially bogus correlations. If your data throws up some strange and interesting correlations, try validating those findings using other data sets to see if there really is a genuine link.

Why Now? A Cloud Full of Data

You understand what defines big data and what makes it so exciting. But why is it such a big deal at this particular point in time? Like planets aligning, several factors have come together at the right time to make harnessing big data a reality.

I talk more about the key technological advances that make big data possible in Chapter 6, and I go into building a big data infrastructure in Chapter 9. Here, I give a quick overview of the main technological advances: better storage, faster networks and more powerful analytics. All of these advances come together to create big data.

Harnessing more storage than ever before

In the past, if you wanted to house a lot of data, you needed to invest in some pretty heavy-duty kit, like mainframe computers. Now, companies big and small can buy inexpensive, off-the-shelf storage systems that are connected to the Internet.

remember Now you have connected computers and servers which give you more storage capacity than ever before. Data can be parcelled up and stored across a range of machines in different locations. Massive increases in storage and computing power, some of it available via cloud computing, make number crunching possible on a very large scale and at declining cost.

In addition, these advances in computer storage and processing power have meant that for the first time you are able to analyse large, complex and messy data sets (which previously would have been far too big to store).

example For example, we’ve been able to record video data for a long time but a lack of storage capacity or a way to really analyse those recordings limited their utility. But all that is changing. In days gone by the only video data that was collected was security closed-circuit data (CCTV). The purpose of that data was to monitor retail or business premises for shoplifting, malicious damage or employee wrongdoing. Most of the security systems would loop recordings which meant that they would record continuously onto video tapes or digital hard drives and then after a set number of days the recording would loop back and re-record over the old data. If there was no incident in the area being recorded then the data was useless so it was erased over and over again. However, with the advances in video and image analytics all that is changing. The data is now being viewed as useful in ways that were not even considered before, such as understanding how customers move around a store and how the placement of products and staff affects their buying decisions. Like all big data and analytic changes, this has come about primarily because of the quantum leap in storage capability. Ten years ago it would have been unheard of to record and store all that CCTV footage – you’d have needed a warehouse just to keep the old tapes, which would degrade if not kept in a temperature-controlled environment! No business in its right mind would swallow the massive costs involved in storing all that video, especially as there was no way to really analyse it at the time.

When you consider that the amount of stored information grows four times faster than the world economy, and the processing power of computers grows nine times faster, it’s easy to see how all that stored data could now become useful.

Fuelling big data with faster networks

Distributed computing gives you greater storage capability than ever before, but it also allows you to connect data faster than ever before. With data being spread across many different locations, the ability to bring that data together in a split second is key.

remember Without faster networks that connect data sets together for analysis, big data just wouldn’t be a practical option. With these faster networks and overarching data software like MapReduce, Big Table and Hadoop, you can break up the analysis of data into manageable chunks, meaning that no one machine has to bear all the load. This makes analysis faster and far more efficient.

There’s more on using systems like this and building your own big data infrastructure in Chapter 9.

Taking advantage of new and better analytical capability

Our improved analytical capability is closely related to better networks and increased storage. Without the ability to store and access all this data, you wouldn’t be able to analyse it and extract useful insights.

Analytical advances can be summarised by three key factors:

  • Thanks to distributed computing you can analyse data much faster than ever before, often in real time as something is happening (think of the credit card transactions).
  • You can also analyse a much wider variety of data: faces, videos, speech and so on. It’s no longer all about rows and columns in a spreadsheet.
  • You have innovative new ways of analysing the data itself. For example, you can analyse tone of voice in conversations between call centre staff and customers.

remember In the past if you wanted access to data or wanted to be able to gain insights from that data, it needed to be contained in a structured relational database and you needed to use SQL query tools (see Chapter 4) to extract any value. That’s now no longer the case – the data can be in just about any form, structured, unstructured, text, audio, video, sensor, imagines, messy or neat – and you can still extract value from it.

example In the past, the only way to analyse CCTV video was to physically sit and watch it frame by frame. The latest video analytics tools are changing all that because they now use algorithms that go through video, scene-by-scene, shot-by-shot, and actually capture what is in the video. And then they index that information and use it to identify patterns or cross reference it with other analytic tools.

What Next for Big Data?

Big data is an exciting field with the power to completely change the way you do business. And it’s easy to get so caught up in the positives that you overlook the potential negative aspects of big data. Although I’m a strong advocate for using big data, I do believe it’s important to be aware of the ethical issues. This is especially important in business, as reputation is so critical to success. Thanks to social media, scandals can travel around the world in the blink of an eye and reputations that took years to build up can be damaged in just a few seconds.

When you hear about these predictive models and what companies are doing with big data and analytics, there’s a real danger that privacy will give way to probability. And there’s also clearly an issue of transparency – an issue that I believe businesses need to manage very carefully.

There are still significant moral and ethical dilemmas to be ironed out in this area. Big data is a little like the gold rush – a lawless frontier of extraordinary opportunity for those willing to take the early risk. But the law will catch up as more and more people become increasingly uncomfortable about what’s being collected, how it’s used and what’s now possible.

Bracing for the big data backlash

I’ve predicted for a while that there’ll be a backlash to big data. At every seminar or keynote speech I give, people are always shocked by the level of data collection going on that they weren’t aware of. You sign your life and privacy away so easily these days with very little thought. Few people are fully aware of just how much analysis Facebook is doing and how much it can understand about you based on what you like and what you upload.

Like most brilliant innovations, big data can be used for good and bad. The possibilities of face recognition software alone are more than a little frightening, and whilst that software can help to prevent crime and thwart terrorist activities, it can also be used to spy on ordinary people for commercial purposes. And therein lies one of the biggest challenges – most people have absolutely no idea what’s going on in darkened rooms in places that don’t officially exist or in the basements of giant corporations that have access to masses of data and futuristic technology.

You don’t know what data is being compiled about you. Even if companies or applications tell you in their terms and conditions, you, like most people, don’t read them. Or even if you do read them, you may not understand them or understand the implications of what you agree to.

example One of my favourite examples of the lack of understanding in this area comes from a short experiment in 2014 designed to highlight the dangers of public Internet. Customers in a London café were asked to agree to terms and conditions as they logged on to use the free Wi-Fi. In the terms was a clause which stipulated the user would ‘assign their first-born child’ to the company in return for free Wi-Fi access. Several people willingly agreed!

Lots of small businesses use Google’s email service, Gmail – it’s free, it’s reliable and you get tons of storage. But did you know that, as it’s offering you that free service, Google feels that you can’t legitimately expect privacy when using the service. Basically Google believes it is okay to read and analyse the content of any and all emails sent or received from a Gmail user. This revelation was put forward in a brief that was filed in a US federal court as part of a lawsuit against Google. Google is accused of breaking US federal and state laws by scanning the emails of Gmail users and in its defence put forward this statement (which was recently exposed by Consumer Watchdog):

Just as a sender of a letter to a business colleague cannot be surprised that the recipient’s assistant opens the letter, people who use web-based email today cannot be surprised if their communications are processed by the recipient’s ECS provider in the course of delivery.

So, essentially, if you sign up to use Gmail, you waive all rights to privacy, and Google can use what it discovers using text analytics to better target its advertising. My guess is that probably 95% of the 400+ million users of the Gmail service don’t currently realise this. But it’s not just Google. Facebook is famous – or rather infamous – for constantly tinkering with privacy policy and privacy settings.

remember Everyone understands that companies need to make money and providing a free email service such as Gmail may be reward enough for some people. Many people may not care about privacy. But if we’re to navigate these murky waters safely, then I believe there needs to be much, much more transparency about what’s being collected and how the data is being used or could potentially be used.

When it comes to data and the law, many legal systems are playing catch-up. In Scandinavian countries, for example, data protection laws are much more stringent than the UK and US. I predict that new legislation will be implemented in the UK and US to tighten up data protection and individual privacy. I believe (and I hope) companies will need to be much more upfront about what they collect and why, and consumers will have a more obvious opt-in/opt-out choice regarding their data.

Encouraging transparency and ethics

I feel that a lot of data collection practices aren’t very ethical. Facebook, for example, buries a lot of what it does with data in a 50-page user agreement that nobody reads. I think it’s vital that businesses explain to their customers what data they’re collecting and how they intend to use it.

If you weren’t a terribly ethical company and your goal was simply to collect as much data as possible without caring much about what anyone thought, then that approach would probably work fine for you in the short term. You probably could collect a load of data. But that’s not a very sustainable long-term approach to business.

remember The more ethical you are, the more valuable your data is in the longer term. If you aren’t upfront about what information you’re collecting and storing about your customers, then there’s a danger that data could be taken away from you or your reputation could be damaged. If people understand what data they’re giving over to your company and how you’ll be using it, they’re generally happy for it to be used. Nobody likes finding out they’ve been duped!

Say I buy myself a shiny new Apple watch. I’m happy for Apple to collect certain data on me (such as sleep patterns or how many steps I take a day) because the company is clear about what it’s tracking and the data it collects helps me lead a healthier lifestyle, so I’m getting something in return. But, hypothetically speaking, imagine if Apple then started selling my data to my health insurance company, who used that information to alter my premiums. I don’t know about you, but I’d be livid. I’d feel that was a huge invasion of my privacy and not at all what I signed up for – and I’m not getting anything in return except for potentially higher premiums.

tip Follow these tips for data transparency and your customers will thank you for it:

  • If you collect personal information on your customers or employees, be upfront about it.
  • Explain why you’re collecting that information (for example, so that you can provide a better service).
  • Don’t hide this information in lengthy user agreements or terms and conditions that no one will read. Keep it short, easy-to-understand and put the information in an obvious place. A few sentences when customers register their details to shop online is a good example.
  • Offer customers something in return for parting with their precious data (such as a discount for customers who take part in a survey, or making the online ordering process much easier in future once you have their data on record).
  • Always give customers the option to opt out. Even if this means they can no longer use your service, or parts of your service, it’s far better to give them the choice.
  • Use aggregated data, data that is not tied to any specific individual, wherever possible. Facebook, for example, provides information to interested third parties on trends and hot topics that isn’t tied to individual users. Depending on what you’re using data for, you may not need data on individuals, just a bigger picture on what a collective of people are up to.

Making sure you add value

When you’re collecting data on people, it’s not just important to be honest about it, it’s a good idea to add value for them – something that makes it worth their while.

For example, I have one of the latest smart televisions from Samsung that allows me to program the television and, using the inbuilt camera, it detects the faces of my children and limits what they can watch. I don’t mind Samsung knowing what I watch, when I watch and how long I watch my smart television because Samsung is helping me and my wife to protect our children from stuff they shouldn’t see. Samsung did, however, get into trouble when it came out that it’s actually counting the number of people watching television. I think this sort of problem could be avoided with greater transparency and by delivering increased value to the user.

remember Make it beneficial for people to share their information with your business, either through better or cheaper products or services. Always seek to add value so that the people providing the data, be they customers, employees or other stakeholders, feel it’s a fair and worthwhile exchange. Aim for a win-win for all parties.

If you provide value, most people will be happy for you to use their data – especially if you’re able to remove personal markers that link them as an individual to the information. If you can demonstrate that you’re using the data ethically, people will respond positively. Ultimately, this makes the data more valuable to you in the long term – it’s no good using data to understand more about your customers if they leave in droves because they feel you’ve invaded their privacy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.62.34