Chapter 11


​Governance and legal compliance

You have three primary concerns for securing and governing your data:

  1. Proper collection and safeguarding of personal data.
  2. Internal governance of your own data.
  3. Complying with local laws and law enforcement in each jurisdiction in which you operate.

This last one can be a huge headache for multinationals, particularly in Europe, where the General Data Protection Regulation, effective May 2018, carries with it fines for violations of up to 4 per cent of global turnover or 20 million euros (whichever is larger). The EU will hold accountable even companies headquartered outside of Europe if they collect or process data of sufficient numbers of EU residents.

Regardless of legal risk, you risk reputational damage if society perceives you as handling personal data inappropriately.

Personal data

When we talk about personal data, we often use the term personally identifiable information (PII), which, in broad terms, is data that is unique to an individual. A passport or driver’s license number is PII, but a person’s age, ethnicity or medical condition is not. There is no clear definition of PII. The IP address of the browser used to visit a website is considered PII in some but not all legal jurisdictions.

There is increased awareness that identities can be determined from non-PII data using data science techniques, and hence we speak of ‘quasi-identifiers’, which are not PII but can be made to function like PII. You’ll need to safeguard these as well, as we’ll see in the Netflix example below.

Identify all PII and quasi-identifiers that you process and store. Establish internal policies for monitoring and controlling access to them. Your control over this data will facilitate compliance with current and future government regulations, as well as some third-party services which will refuse to process PII.

PII becomes sensitive when it is linked to private information. For example, a database with the names and addresses of town residents is full of PII but is usually public data. A database of medical conditions (not PII) must be protected when the database can be linked to PII. Jurisdictions differ in their laws governing what personal data must be protected (health records, ethnicity, religion, etc.). These laws are often rooted in historic events within each region.

There are two focus areas for proper use of sensitive personal data: data privacy and data protection.

  • Data privacy relates to what data you may collect, store and use, such as whether it is appropriate to place hidden video cameras in public areas or to use web cookies to track online browsing without user consent.
  • Data protection relates to the safeguarding and redistribution of data you have legally collected and stored. It addresses questions such as whether you can store private data of European residents in data centres outside of Europe.

Privacy laws

If you’re in a large organization, you will have an internal privacy officer who should be on a first name basis with your data and analytics leader. If you don’t have a privacy officer, you should find resources that can advise you in the privacy and data protection laws of the jurisdictions in which you have customer bases or data centres.

Each country determines its own privacy and data protection laws, with Europe having some of the most stringent. The EU’s Data Protection Directive of 1995 laid out recommendations for privacy and data protection within the EU, but, before the activation of the EU-wide General Data Protection Regulation (GDPR) in May 2018, each country was left to determine and enforce its own laws. If you have EU customers, you’ll need to become familiar with the requirements of the GDPR. Figure 11.1, which shows the rapid rise in number of Google searches for the term ‘GDPR’ since January 2017, demonstrates that you won’t be alone in this.

Figure 11.1 Increase in worldwide online searches for ‘GDPR.’

Figure 11.1 Increase in worldwide online searches for ‘GDPR.’

Source: Google Trends: July 2016–June 2017.

The extent to which privacy laws differ by country has proven challenging for multinational organizations, particularly for data-driven organizations that rely on vast stores of personal data to better understand and interact with customers. Within Europe over the past years, certain data that could be collected in one country could not be collected in a neighbouring country, and the personal data that could be collected within Europe could not be sent outside of Europe unless the recipient country provided data protection meeting European standards.

Keep in mind

Privacy and data protection laws vary by legal jurisdiction, and you may be subject to local laws even if you don’t have a physical presence there.

The European Union’s Safe Harbour Decision in 2000 allowed US companies complying with certain data governance standards to transfer data from the EU to the US. The ability of US companies to safeguard personal data came into question following the Edward Snowden affair, so that, on 6 October 2015, the European Court of Justice invalidated the EC’s Safe Harbour Decision, noting that ‘legislation permitting the public authorities to have access on a generalized basis to the content of electronic communications must be regarded as compromising the essence of the fundamental right to respect for private life.’85 A replacement for Safe Harbou, the EU–US Privacy Shield, was approved by the European Commission nine months later (July 2016).

The United States tends to have laxer privacy laws than Europe, with some exceptions. There is an interesting example dating back to the late 1980s, when federal circuit judge Robert Bork had been nominated for the US Supreme Court. Bork, a strict constitutionalist, had previously argued that Americans only have such privacy rights as afforded them by direct legislation. This strong statement prompted a reporter to walk into a Washington D.C. video rental store and ask the manager on duty for a look at Bork’s video rental history. He walked out of the store with a list of the 146 tapes the judge had checked out over the preceding two years. He subsequently published that list of tapes.86 Amazingly, all of this was legal at the time. As it happened, the list contained nothing scandalous, but the US Congress, which had watched in awe as this saga unfolded, quickly penned and passed the Video Privacy Protection Act of 1988, making video rental history an explicitly protected class of data in the USA.

Organizations can run afoul of the law by improperly handling personal data even when it is not PII and not linkable to PII. In early 2017, US television maker Vizio paid a settlement of $2.2 million for secretly recording and selling the (anonymized) viewing history from its brand of televisions. Not only did this privacy violation cost them financially, it also made international headlines.87

Data science and privacy revelations

To protect yourself from legal and reputational risk, you’ll need more than just an understanding of laws. You’ll need to understand how customers perceive your use of data, and you’ll need to be conscious of how data science techniques can lead to unintended legal violations.

When Target used statistical models to identify and target pregnant shoppers, they were not collecting private data, but they were making private revelations with a high degree of accuracy. They weren’t breaking laws, but they were taking a public relations risk.

Case study – Netflix gets burned despite best intentions

Another American company got itself into trouble by not realizing how data science techniques could de-anonymize legally protected data. In 2006, video streaming company Netflix was 9 years old and had grown to roughly 6 million subscribers. It had developed a recommendation engine to increase engagement and was looking for ways to improve those recommendations. In a stroke of apparent genius, Netflix came up with the Netflix Prize: $1 million to the team that could develop a recommendation algorithm capable of beating Netflix’s own by a margin of at least 10 per cent. To support the effort, Netflix released anonymized rental histories and corresponding ratings for 480,000 viewers. Remember that the Video Privacy Protection Act of 1988 forbade them to release rental histories linked to individuals, but these were anonymized.

Things moved quickly following Netflix’s release of data on 2 October 2006. Within six days, a team had already beaten the performance of Netflix’s own recommendation algorithm by a small margin. Within a few weeks, however, a team of researchers from the University of Texas had also hit a breakthrough. They had de-anonymized some of the anonymized rental histories. The researchers had carried out what is called a linkage attack, linking nameless customer viewing histories to named individuals from online forums using reviews common to both Netflix and the forums.

The saga played out for another three years, at which point a team reached the 10 per cent improvement mark and won the Netflix Prize. Shortly thereafter, a class action lawsuit was filed against Netflix, accusing it of violating privacy laws. Netflix settled out of court and understandably cancelled their scheduled follow-up competition.

It’s interesting to compare these examples, as no laws were broken by Target, but the company took reputational risk through non-transparent use of personal information. Netflix, on the other hand, aligned its efforts in a very open and transparent way with the interests of customers, in this case to arrive at better video recommendations. There was little reputational risk, but there were legal consequences.

Other companies and even governments have fallen victim to such ‘linkage attacks’, in which linking data sources allows attackers to compromise privacy measures. If your projects require you to distribute anonymized personal information, you can apply techniques in differential privacy, an area of research in methods to protect against linkage attacks while maintaining data accuracy for legitimate applications. You may need this even for internal use of data, as laws are increasingly limiting companies’ rights to use personal data without explicit consent.

Be aware that the behavioural data you are storing on your customers may hide more sensitive information than you realize. To illustrate, the Proceedings of the National Academy of Sciences documented a study conducted on the Facebook Likes of 58,000 volunteers. The researchers created a model that could, based only on a person’s ‘Likes’, identify with high accuracy a range of sensitive personal attributes, including:

  • sexual orientation;
  • ethnicity;
  • religious and political views;
  • personality traits;
  • intelligence;
  • happiness;
  • use of addictive substances;
  • parental separation;
  • age; and
  • gender.

By analysing the Facebook Likes of the users, the model could distinguish between Caucasians and African Americans with a 95 per cent accuracy.88

So we see that two of the most fundamental tools within data science: the creative linking of data sources and the creation of insight-generating algorithms, both increase the risk of revealing sensitive personal details within apparently innocuous data. Be aware of such dangers as you work to comply with privacy laws in a world of analytic tools that are increasingly able to draw insights from and identify hidden patterns within big data.

Data governance

Establish and enforce policies within your organization for how employees access and use the data in your systems. Designated individuals in your IT department, in collaboration with your privacy officers and the owners of each data source, will grant and revoke access to restricted data tables using named or role-based authorization policies and will enforce these policies with security protocols, often keeping usage logs to verify legitimate data usage. If you are in a regulated industry, you will be subject to more stringent requirements, where data scientists working with production systems may need to navigate a half dozen layers of security to get to the source data. In this case, you’ll want to choose an enterprise big data product with features developed for high standards of security and compliance.

Adding a big data repository to your IT stack may make it more difficult to control access to, usage of and eventual removal of personal information. In traditional data stores, data is kept in a structured format and each data point can be assessed for sensitivity and assigned appropriate access rights. Within big data repositories, data is often kept in unstructured format (‘schema on read’ rather than ‘schema on write’), so it is not immediately evident what sensitive data is present.

You may need to comply with right to be forgotten or right to erasure laws, particularly within Europe, in which case you must delete certain personal data on request. With big data stores, particularly the prevalent ‘data lakes’ of yet-to-be-processed data, it’s harder to know where personal data is stored in your systems.

GDPR will limit your use of data from European customers, requiring express consent for many business applications. This will limit the efforts of your data scientists, and you’ll also be accountable under ‘right to explanation’ laws for algorithms that impact customers, such as calculations of insurance risk or credit score. You will likely need to introduce new access controls and audit trails for data scientists to ensure compliance with GDPR.

A full discussion of GDPR is beyond the scope of this book, and we’ve barely touched on the myriad other regulations in Europe and around the world. Also (quick disclaimer) I’m not a lawyer. Connect with privacy experts knowledgeable in the laws of the jurisdictions in which you operate.

Keep in mind

Laws restrict how you can use personal data, even if you have a right to collect and store that data.

Governance for reporting

Moving on from the topics of legal compliance and data protection, I’ll briefly touch on an optional governance framework, which should reduce internal chaos in your organization and ease the lives of you and your colleagues. You should develop and maintain a tiered governance model for how internal reports and dashboards are assembled and distributed within your organization. Most organizations suffer tremendously from not having such a model. Executives sit at quarter-end staring in dismay at a collection of departmental reports, each of which defines a key metric in a slightly different way. At other times, a quick analysis from an intern works its way up an email chain and may be used as input for a key decision in another department.

From my experience, you’ll spare yourself tremendous agony if you develop a framework for:

  1. Unifying definitions used in reports and dashboards.
  2. Clarifying the reliability and freshness of all reports and dashboards.

One way to do this is to introduce a multi-tiered certification standard for your reports and dashboards. The first tier would be self-service analysis and reports that are run against a development environment. Reports at this level should never leave the unit in which they are created. A tier one report that demonstrates business value can be certified and promoted to tier two. Such a certification process would require a degree of documentation and consistency and possibly additional development, signed off by designated staff. Tier-two reports that take on more mission-critical or expansive roles may be promoted to a third tier, etc. By the time a report lands on the desk of an executive, the executive can be confident of its terminology, consistency and accuracy.

Takeaways

  • It is important that you identify and govern your use of personally identifiable information (PII) and quasi-identifiers.
  • Establish and enforce governance and auditing of internal data usage.
  • Laws related to privacy and data governance differ greatly by jurisdiction and may impact your organization even if it does not have a physical presence within that jurisdiction.
  • Europe’s GDPR will have a strong impact on any company with customers in the EU.
  • Linkage attacks and advanced analytic techniques can reveal private information despite your efforts to protect it.
  • Creating a tiered system for your internal reports and dashboards can provide consistency and reliability.

Ask yourself

  • What measures are you taking to protect personally identifiable information (PII) within your systems, including protection against linkage attacks? Make sure you are compliant with regional laws in this area and are not putting your reputation at risk from privacy infringement, even if legal.
  • If you have customers in Europe, what additional steps will you need to take to become compliant with GDPR? Remember that GDPR fines reach 4 per cent of global revenue.
  • If your organization does not have a privacy officer, whom can you consult for questions related to privacy and data protection laws? There are global firms that can provide advice spanning multiple jurisdictions.
  • When was the last time you reviewed an important internal report and realized the terminology used was unclear or the data was inaccurate? What steps did you take to address the problem? Perhaps you want to initiate an internal reporting governance programme, such as the one outlined in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.246.223