Chapter 2
In This Chapter
Exploring the four Vs of big data
Delving into the (justifiable) hype
Looking at today and into the future
Having picked up this book, you’ve likely already come across articles or blogs championing big data. You’ve heard some of the buzz and want to know why big data is such a big deal right now.
Data in itself isn’t a new business phenomenon. But big data is. And not all data is big data, even if there’s plenty of it. Don’t be alarmed, it gets easier!
The truth is, companies have had a lot of data for a long time – consider big mainframe computers and early data centres. Data in itself isn’t a new invention. Until recently, this data was limited to what’s called structured data (see Chapter 4), meaning it was typically in spreadsheets or databases. However, even though there was lots of it, this data wouldn’t count as big data because big data is defined by more than just how, well, big it is. As I show in this chapter, there are other factors that define big data.
Big data is not a passing fad; it’s here to stay. And it’s going to change the world completely. But to really understand big data, and what separates it from normal data, you need to understand four main factors: volume, velocity, variety and veracity, commonly known as the four Vs of big data. By exploring each of the Vs, you can get a feel for how big data can revolutionise the way you do business.
I look at each of the Vs in turn in the next sections.
Volume refers to the vast amounts of data generated every second. Just think of all the emails, Twitter messages, photos, video clips and sensor data you produce and share every second (and those are just for starters). On Facebook alone, ten billion messages are sent, the Like button is clicked 4.5 billion times and 350 million new pictures are uploaded each and every day. You’re no longer talking about humble gigabytes of data, but petabytes and even zettabytes or brontobytes of data. To put this in perspective, if you take all the data generated in the world between the beginning of time and the year 2000, the same amount of data is now being generated every minute!
Theoretically, using distributed systems you can cope with any amount of data, which is a good job when you consider that rates of data are growing at a mind-boggling speed.
This means you talk about big data when the volumes are so big that the data no longer fits into traditional storage systems, such as a single database or data warehouse. For example, if you have an enormous customer database with two million rows of information, it may be a large amount of data, but it’s not strictly big data.
Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities or the milliseconds it takes trading systems to analyse social media networks to pick up signals that trigger decisions to buy or sell shares. Some data is being generated at such a speed that makes it not worth storing – it’s out of date seconds later.
For instance, in the example of checking credit card transactions for fraudulent activities (such as a transaction in Rome one minute and a transaction in India ten minutes later), the analysis is done in real time, as the transaction is taking place. There’s absolutely no point storing this data and accessing it later. The credit card company needs to analyse it right then and there, on the fly, and has no use for it later.
Variety refers to the different types of data you can now use. In the past, the focus was on structured data that neatly fits into tables or databases, such as financial data. In fact, the vast majority of the world’s data is now unstructured, and therefore can’t easily be put into tables (think of photos, voice recordings and social media updates). I talk more about the different types of data in Chapters 4 and 5.
For me, variety is the most interesting and exciting aspect of big data. While businesses have been capturing and analysing data for years, the ability to harness a wide range of data is completely transforming your ability to understand the world and everything within it – including, most importantly for business, your customers.
Variety, velocity and volume are very closely linked. Because we’re now able to extract information from messy (unstructured) data like Facebook posts and video images, the volume and velocity of data have increased accordingly. The amount of data out there and the rate at which we’re generating new data is frightening – and I’m a data expert! The digital universe is doubling in size every two years. At that rate, by 2020 there will be nearly as many bits of information in the digital universe as there are stars in the physical universe.
The first three Vs – volume, velocity and variety – formed the original definition of big data by IT (information technology) analytics firm Gartner. Experts have attempted to add on a number of other Vs over time (some useful, some less so), but those three remained the core. However, veracity is a useful and valid fourth V, and it’s now widely accepted as a key feature of big data.
Veracity refers to the messiness or trustworthiness of the data. You used to only be able to analyse neat and orderly structured data, data that you trusted as accurate. But now you can cope with unruly and unreliable data.
I’d argue there’s another important V to take into account when looking at big data – value. Having access to vast amounts of data and many different varieties of data is wonderful, but unless you can turn it into value, it’s useless. Data has to earn its keep and provide positive outcomes for your business, such as understanding your customers better or making your production line more efficient. So, while people get giddy with excitement over the volume, velocity and variety of big data, when it comes to business, it’s actually value that’s the most important V.
Big data seems to have really captured people’s imaginations, and I see more and more articles or blogs on the topic each week. Far from a flash-in-the-pan thing, it’s becoming a mainstream part of the how businesses operate and make decisions.
Why is big data such a big deal? Well, partly it comes down to the wide applications of big data, which go far beyond business. Big data has a huge role to play in activities as diverse as healthcare, disaster relief efforts and making cities safer, better places to live. Sure, it helps companies sell a lot more stuff, but it’s also capable of much, much more.
It’s no exaggeration to say that with big data you can predict the future. It’s kind of like a crystal ball – but made of wires and networks and software.
For businesses, the real power of data is that it helps you make smarter decisions that change your business for the better. Want to understand more about your customers so you can run more targeted promotions? Data can help. Want to understand how customers navigate your website so you can make improvements to the site? Data can tell you that. Want to understand what makes your employees tick so you can increase employee satisfaction and reduce staff turnover? You guessed it, data can help.
In Chapter 11, I set out a step-by-step process for using data to improve decision making in your business. It’s an approach I use all the time with clients. But there’s also a bigger, cultural shift required to transform decision making across your company, and I explore that in Chapter 13.
One of the big discussions in big data is that it allows you to challenge traditional models of causality, or cause and effect.
This focus on correlations potentially challenges the whole scientific method. In the past if you wanted to know something, you developed a hypothesis and ran experiments to establish whether the hypothesis was correct or not. The experiments that took place would vary depending on what you were trying to find out but, regardless of whether you were seeking to understand consumer behaviour or the efficacy of a new drug, the experiment would always take a sample of data, people, ingredients or components in order to test the hypothesis. The sample was always therefore limited in size and the results were then extrapolated out to make assumptions or best- and worst-case scenario predictions for everything from the spread of disease to the accumulation of credit card debt.
This approach has worked well and is credited with many breakthroughs in just about every area of human endeavour. But big data could change all that. If you test a hypothesis on consumer behaviour, for example, the sample of people you test the hypothesis on is based on the assumption that the sample is representative of all consumers (N = all). It’s not. But it was the best option considering the lack of data. The advent of big data – and specifically the technology to store, collate and analyse that data – means lack of data is no longer the problem. Theoretically, at least you can use a sample where N really does equal all.
There’s a real danger that you could be pigeonholed by all sorts of organisations and businesses based on probability, not reality. What happens when someone is refused a mortgage because some algorithm identifies that person as a high risk even though she’s never actually defaulted on a mortgage before? What happens if your insurance premium is increased based on your probability to claim in the future even though that future hasn’t arrived yet?
If you ever took a science class in school, you might have heard the phrase, ‘Correlation does not equal causation’. For example, if data tells you that men over six feet tall spend more money online, that doesn’t mean their height caused them to spend more money.
And that can be the problem when data analysts are looking at these strange and interesting new truths that emerge from the mass quantities of data to which they now have access. If you take it as true that orange used cars are more reliable, the question then becomes why. Are owners of orange cars more careful? Does the colour prevent people from getting in accidents? Or does the colour orange have some other magical property that keeps a car running well? The data has no answers, it just tells you what is.
Tyler Vigen posts funny charts to his website, Spurious Correlations (www.tylervigen.com), that show the danger of simply matching two data sets without any deeper understanding of how the things are related. For example, if correlation is all you need to go by, then you can assume that the more films Nicolas Cage appears in in any given year, the more people drown in swimming pools. Or you can assume that an increase in US spending on science will result in an increase of suicides by hanging. Spurious indeed, you hope, or US researchers and Nic Cage’s film career are in trouble.
You understand what defines big data and what makes it so exciting. But why is it such a big deal at this particular point in time? Like planets aligning, several factors have come together at the right time to make harnessing big data a reality.
I talk more about the key technological advances that make big data possible in Chapter 6, and I go into building a big data infrastructure in Chapter 9. Here, I give a quick overview of the main technological advances: better storage, faster networks and more powerful analytics. All of these advances come together to create big data.
In the past, if you wanted to house a lot of data, you needed to invest in some pretty heavy-duty kit, like mainframe computers. Now, companies big and small can buy inexpensive, off-the-shelf storage systems that are connected to the Internet.
In addition, these advances in computer storage and processing power have meant that for the first time you are able to analyse large, complex and messy data sets (which previously would have been far too big to store).
When you consider that the amount of stored information grows four times faster than the world economy, and the processing power of computers grows nine times faster, it’s easy to see how all that stored data could now become useful.
Distributed computing gives you greater storage capability than ever before, but it also allows you to connect data faster than ever before. With data being spread across many different locations, the ability to bring that data together in a split second is key.
There’s more on using systems like this and building your own big data infrastructure in Chapter 9.
Our improved analytical capability is closely related to better networks and increased storage. Without the ability to store and access all this data, you wouldn’t be able to analyse it and extract useful insights.
Analytical advances can be summarised by three key factors:
Big data is an exciting field with the power to completely change the way you do business. And it’s easy to get so caught up in the positives that you overlook the potential negative aspects of big data. Although I’m a strong advocate for using big data, I do believe it’s important to be aware of the ethical issues. This is especially important in business, as reputation is so critical to success. Thanks to social media, scandals can travel around the world in the blink of an eye and reputations that took years to build up can be damaged in just a few seconds.
When you hear about these predictive models and what companies are doing with big data and analytics, there’s a real danger that privacy will give way to probability. And there’s also clearly an issue of transparency – an issue that I believe businesses need to manage very carefully.
There are still significant moral and ethical dilemmas to be ironed out in this area. Big data is a little like the gold rush – a lawless frontier of extraordinary opportunity for those willing to take the early risk. But the law will catch up as more and more people become increasingly uncomfortable about what’s being collected, how it’s used and what’s now possible.
I’ve predicted for a while that there’ll be a backlash to big data. At every seminar or keynote speech I give, people are always shocked by the level of data collection going on that they weren’t aware of. You sign your life and privacy away so easily these days with very little thought. Few people are fully aware of just how much analysis Facebook is doing and how much it can understand about you based on what you like and what you upload.
Like most brilliant innovations, big data can be used for good and bad. The possibilities of face recognition software alone are more than a little frightening, and whilst that software can help to prevent crime and thwart terrorist activities, it can also be used to spy on ordinary people for commercial purposes. And therein lies one of the biggest challenges – most people have absolutely no idea what’s going on in darkened rooms in places that don’t officially exist or in the basements of giant corporations that have access to masses of data and futuristic technology.
You don’t know what data is being compiled about you. Even if companies or applications tell you in their terms and conditions, you, like most people, don’t read them. Or even if you do read them, you may not understand them or understand the implications of what you agree to.
Lots of small businesses use Google’s email service, Gmail – it’s free, it’s reliable and you get tons of storage. But did you know that, as it’s offering you that free service, Google feels that you can’t legitimately expect privacy when using the service. Basically Google believes it is okay to read and analyse the content of any and all emails sent or received from a Gmail user. This revelation was put forward in a brief that was filed in a US federal court as part of a lawsuit against Google. Google is accused of breaking US federal and state laws by scanning the emails of Gmail users and in its defence put forward this statement (which was recently exposed by Consumer Watchdog):
Just as a sender of a letter to a business colleague cannot be surprised that the recipient’s assistant opens the letter, people who use web-based email today cannot be surprised if their communications are processed by the recipient’s ECS provider in the course of delivery.
So, essentially, if you sign up to use Gmail, you waive all rights to privacy, and Google can use what it discovers using text analytics to better target its advertising. My guess is that probably 95% of the 400+ million users of the Gmail service don’t currently realise this. But it’s not just Google. Facebook is famous – or rather infamous – for constantly tinkering with privacy policy and privacy settings.
When it comes to data and the law, many legal systems are playing catch-up. In Scandinavian countries, for example, data protection laws are much more stringent than the UK and US. I predict that new legislation will be implemented in the UK and US to tighten up data protection and individual privacy. I believe (and I hope) companies will need to be much more upfront about what they collect and why, and consumers will have a more obvious opt-in/opt-out choice regarding their data.
I feel that a lot of data collection practices aren’t very ethical. Facebook, for example, buries a lot of what it does with data in a 50-page user agreement that nobody reads. I think it’s vital that businesses explain to their customers what data they’re collecting and how they intend to use it.
If you weren’t a terribly ethical company and your goal was simply to collect as much data as possible without caring much about what anyone thought, then that approach would probably work fine for you in the short term. You probably could collect a load of data. But that’s not a very sustainable long-term approach to business.
Say I buy myself a shiny new Apple watch. I’m happy for Apple to collect certain data on me (such as sleep patterns or how many steps I take a day) because the company is clear about what it’s tracking and the data it collects helps me lead a healthier lifestyle, so I’m getting something in return. But, hypothetically speaking, imagine if Apple then started selling my data to my health insurance company, who used that information to alter my premiums. I don’t know about you, but I’d be livid. I’d feel that was a huge invasion of my privacy and not at all what I signed up for – and I’m not getting anything in return except for potentially higher premiums.
When you’re collecting data on people, it’s not just important to be honest about it, it’s a good idea to add value for them – something that makes it worth their while.
For example, I have one of the latest smart televisions from Samsung that allows me to program the television and, using the inbuilt camera, it detects the faces of my children and limits what they can watch. I don’t mind Samsung knowing what I watch, when I watch and how long I watch my smart television because Samsung is helping me and my wife to protect our children from stuff they shouldn’t see. Samsung did, however, get into trouble when it came out that it’s actually counting the number of people watching television. I think this sort of problem could be avoided with greater transparency and by delivering increased value to the user.
If you provide value, most people will be happy for you to use their data – especially if you’re able to remove personal markers that link them as an individual to the information. If you can demonstrate that you’re using the data ethically, people will respond positively. Ultimately, this makes the data more valuable to you in the long term – it’s no good using data to understand more about your customers if they leave in droves because they feel you’ve invaded their privacy.
3.17.62.34