18 - Quality and Privacy

Why the biggest hurdles to Data Driven Strategy are self-evident but not straightforward

Selling data to third parties sounds easier than it is. Companies attempting to merchandise their data run into several problems. Each of these problems may seem trivial and easy to solve, but combined they are the reason why a structured and continuous process for data management is required before any data can be monetized. Although the specific practical issues will differ per organization, most of the problems filter down to two basic problem areas: data quality and privacy. Experts in Enterprise Data Management will likely argue that there are far more issues surrounding data management such as data modeling, normalization, governance and stewardship. In turn, privacy advocates will claim that privacy is not a hurdle to Data Driven Strategy but rather a ground rule. I agree with both; however, it is not my aim to go into detail on all aspects of data management and privacy. This book looks at building effective business models for data monetization. Data management and privacy are issues too extensive to be covered as a side story to monetization within the context of this book. The following chapter is intended to touch on these pressing issues regarding data when a Data Driven Strategy starts to gain traction in an organization. I recommend that at such a time you dive deeper into these subjects or seek external expertise.

Data Quality

As I have shown before on page 79, one of the most obvious problems for data providers is data quality. Data from operational systems simply isn’t generated with the aim to be sold or used by third parties. Data Driven Strategy creates a pressing need for high quality data. Data quality management is a profession in itself. Many business cases and solutions exist for both technical and cultural issues. They take into account all issues that are related to the quality of data. IBM has defined a set of criteria that has become the standard by which data quality is measured.

Issue

Explanation

Timeliness

The data represents information that is not out-of-date. For example, no customer contracts have expiry dates that have passed.

Completeness

All data is present. For example, the zip or postal code should always be populated in an address table.

Validity

The data is available in an acceptable format. For example, employee numbers have six alphanumeric characters.

Accuracy

The data is accurate. For example, employee job codes are accurate to ensure that an employee does not receive the wrong type of training.

Consistency

The data attribute is consistent with a business rule. For instance, all birthdates should be before 1/1/1900.

Uniqueness

There are no duplicate values in a data field

Data Reliability

The recent popularity of Big Data and the availability of large quantities of data from external sources have made it necessary to include an additional criterion: reliability of data.

Increasingly, organizations use data from unstructured sources such as online forums and social media. Data obtained from these sources is ungoverned and can contain unsubstantiated ‘facts’ that may turn out to be untrue. One example of how unstructured data analysis can lead to unreliable data comes from an analysis of tweets and Facebook postings. The recent Big Data trend seems to be homing in quite frequently on the potential of the analysis of consumer sentiment through social media. By analyzing what people say about a given topic on social media, the general sentiment on that topic can be determined. Although it is not my intention to ignore this potential altogether, social media analytics is anything but an advertisement for data reliability. Apart from the fact that tweets and Facebook posts do not in any way represent a solid statistical cross section of any customer base, some of the sentiments found in them are interpreted as quite the opposite of what they express.

Sarcasm proves to be especially difficult to analyze for automated systems. In my consulting practice, I came across one tweet in particular that represented the point in case. The tweet was about mobile operator T-Mobile, and it was interpreted by the analytical software as ‘positive towards the brand’. The tweet read ‘Another day without connectivity...Well done T-Mobile!’ The example above shows how basic interpretations, even when made by sophisticated analysis tools, may lead to false conclusions. Data reliability should not be taken lightly and be governed at all times when data becomes your core product. Keep in mind that data reliability is not just a requirement for data you use and procure in your own products. Your customers may require you to prove that your data is reliable for them to use as well.

Privacy

The bulk of data generated by the core processes of organizations are customer centric. Especially when the customers are private individuals, the data will reveal a substantial amount of personal details about these customers. Telcos, for instance, register who called whom. Utilities register who used how much power and when. Navigation devices register who drives where, and online stores register who bought what. Customers are rather sensitive to commercial organizations collecting data about them. Especially when they are unable to determine how much data is collected about them or how, privacy quickly becomes a big issue. Even when they do know what personal data is collected by a company, consumers are wary about the level of their privacy.

When companies start selling customer-related data to each other and it becomes obscure who knows what, the customer can interpret this situation as rather Orwellian. After all, what is to stop your insurance company from increasing your premiums if it discovers that according to your supermarket’s loyalty card data you regularly purchase unhealthy food? And why shouldn’t the police automatically fine you when your satellite navigation data registered that you have been speeding? Not to mention the risk of systems being hacked and data stolen or hijacked by criminals. Privacy is an issue not to be taken lightly. First and foremost because it causes severe problems for those whose privacy is compromised; these people may take legal action, but usually no lawsuit can undo these damages. From an organizations’ perspective, privacy concerns are not just related to legal issues when harm has been done. There are some very pressing commercial reasons for privacy assurance to customers and the public.

Customers Want to Feel ‘In Control’

When European payment provider Equens announced in the early summer of 2013 that it had started selling analyzed payment data to retailers and other third parties, it met with steep opposition from the public. Equens is Europe’s largest payment provider, processing well over 15 billion financial transactions from retail and ATMs annually. There were some obvious customers for the data that Equens generates by processing these payments. One example of an analysis provided by Equens was a ‘share-of-wallet’ analysis in which it compared consumers’ spending in one retail brand to that in other brands. But even though Equens never sold any personal data, the public felt not in control and put up a fierce protest.

Since they had no direct relationship with Equens as an organization, people simply did not understand what the company was up to and therefore did not trust the organization with what they thought was their data. Although Equens went out of its way to explain that it never compromised privacy, it was forced to abstain from selling the data products. A costly exercise that shows privacy is as much about the facts as it is about the feeling of personal data being safe.

Less than a year later, ING Bank announced it had plans to display targeted advertisements to customers based on analysis of their financial transactions. ING Bank is a subsidiary of ING Group, headquartered in Amsterdam, and is one of the larger financial institutions in the world. Although this may sound tricky at first, ING stayed well within the Dutch privacy laws. ING’s plans to monetize data sought to analyze data from financial transactions to allow advertisers to target specific clients with advertisements on the ING websites. For instance, they would offer a DIY retailer the option to advertise to ING customers that had spent at least €200 in DIY shops in the last six months. No personal or transaction data would ever be sold to advertisers. Advertisers would not even know when or to whom an ad was shown, and the ads would only be shown to ING customers who had explicitly agreed to take part in the program. ING Bank had taken all of the lessons from the Equens affair into account when it developed the program, but it met with even stiffer opposition than Equens. Within days after its announcement, ING was facing consumer organizations, the Dutch National Bank and even the national parliament. Even though ING never came close to breaking privacy laws and in addition decided to use an ‘opt-in’ system anyway, it still touched upon a very sensitive area of privacy and left customers feeling not in control.

Adara, Inc. (see page 118) realized the importance of privacy for consumers from the onset of the company and took the opposite approach. It faces privacy concerns head-on and starts by explaining why their service is safe for all. Adara handles booking data from a large number of airlines and other partners, and it knows it will be potentially out of business if there is so much as a shadow of a doubt about consumer privacy. Adara publishes its privacy policy as one of the most prominent features on its website. At the company, privacy is not about secrecy; it is about being very clear and open about what happens to data and who gets to use what. No personal customer data is used or sold. Guaranteed. Just like at Adara, in any Data Driven Strategy privacy is not a requirement—it is a hygiene factor.

Start with Privacy and Work Down from There

Most managers I’ve worked with found it difficult to understand that for most data products or services, it isn’t even necessary to use or sell data on identifiable individuals. Adara doesn’t need a person’s name and address to place its hyper-targeted advertisements on various websites. It uses tracking cookies that are placed on the customers’ computer. The data about the person itself is never registered in any of Adara’s systems. The company simply doesn’t need it and to make sure no unintentional harm can be done, it has decided to not use any sensitive data.

In a complex landscape of systems and intercompany networks, it is easy to make mistakes or overlook the privacy consequences of a single decision. In 2009, Netflix launched the Netflix Prize, a million dollar contest in which teams of researchers were challenged to improve the algorithm for recommendations of movies by 10%. Each team was given a dataset containing 100 million recommendations from close to half a million Netflix subscribers. The dataset was completely anonymized. No personal details were given, but each subscriber was identifiable by a unique number, so the researches could see if one subscriber had rated multiple movies. Within two weeks of the start of the challenge, a team of researchers announced that it had successfully matched the Netflix recommendations to movie recommendations given on the popular movie website IMDB.com. Statistically they could prove that certain recommendations given by an anonymous Netflix user were given by the same person that rated the movies on IMDB.com.

Since people on IMDB are not anonymous, the researchers could now identify individual Netflix users and their recommendations, breaking the anonymization of the Netflix dataset. The incident caused quite a stir because the recommendations provided insights into both the political and sexual preferences of the Netflix customers. One customer actually sued Netflix after the prize for violation of her privacy reasoning that, as a lesbian mother, ‘were her sexual orientation public knowledge, it would negatively affect her ability to pursue her livelihood […]’.42 Following the lawsuit, Netflix withdrew its plans for a follow-up contest in which it had already announced it would include ZIP codes, age and gender of customers.

Paul Ohm is an associate professor at the University of Colorado Law School. Round about the time of the Netflix Prize in 2009, he presented a paper at the Privacy Law Scholars Conference in which he stated that anonymization of data can often easily be undone or reverted.43 So easily, in fact, that he warns that any anonymized dataset that is useful for research, either by science or business, contains so much explicit data that it can be de-anonymized. To state a simple statistic: 87% of all Americans are uniquely identifiable by the combination of their ZIP code, age and gender. Every dataset that has been stripped of specific names and addresses but does include these three facts cannot be treated as an anonymous dataset.

Each organization should assess its own risks for breach of privacy and its adherence to laws in each of the countries that apply to it. But apart from that, there are guaranteed ways to protect the anonymity of users. For instance, by not selling individual records but selling aggregated data. No person could be individually traced when the dataset only states that 25% of people living on Manhattan’s Park Avenue have rated the movies W. and Brokeback Mountain with three or more stars. Or by performing or initiating the desired action based on the analysis of the data without ever putting the data itself into the hands of the data customer. ING Bank may sell data analysis, but it does not actually provide its data to its customers. It sells the analysis of the data and then places a targeted advertisement based on that analysis on a website owned by ING itself. The advertisers will never know who sees this advertising.

Companies that successfully monetize data do not treat data quality and privacy as ‘issues’ but as a hygiene factor. It is not a requirement for success but rather a requirement for an operational business. As I’ve shown before, these requirements may get in the way of traditional organization of activities and even cross the interests of managers within the existing business simply because they impose costs and limitations on activities that do not require quality and privacy in the way that data monetization does. That is why, as the next chapter will show, many data monetization initiatives are developed not inside but outside the organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.96.247