Chapter 13. Secure Data

Throughout this book, we have visited many areas of data science, often straying into those that are not traditionally associated with a data scientist's core working knowledge. In particular, we dedicated an entire chapter, Chapter 2, Data Acquisition, to data ingestion, which explains how to solve an issue that is always present, but rarely acknowledged or addressed adequately. In this chapter, we will visit another of those often overlooked fields, secure data. More specifically, how to protect your data and analytic results at all stages of the data life cycle. This ranges from ingestion, right through to presentation, at all times considering the important architectural and scalability requirements that naturally form the Spark paradigm.

In this chapter, we will cover the following topics:

  • How to implement coarse-grained data access controls using HDFS ACLs
  • A guide to fine-grained security, with explanations using the Hadoop ecosystem
  • How to ensure data is always encrypted, with an example using Java KeyStore
  • Techniques for obfuscating, masking, and tokenizing data
  • How Spark implements Kerberos
  • Data security - the ethical and technical issues

Data security

The final piece to our data architecture is security, and in this chapter we will discover that data security is always important, and the reasons for this. Given the huge increase in the volume and variety of data in recent times, caused by many factors, but in no small part due to the popularity of the Internet and related technologies, there is a growing need to provide fully scalable and secure solutions. We are going to explore those solutions along with the confidentiality, privacy, and legal concerns associated with the storing, processing, and handling of data; we will relate these to the tools and techniques introduced in previous chapters.

We will continue on by explaining the technical issues involved in securing data at scale and introduce ideas and techniques that tackle these concerns using a variety of access, classification, and obfuscation strategies. As in previous chapters, ideas are demonstrated with examples using the Hadoop ecosystem, and public cloud infrastructure strategies will also be present.

The problem

We have explored many and varied topics in previous chapters, usually concentrating on the specifics of a particular issue and the approaches that can be taken to solve them. In all of these cases, there has been the implicit idea that the data that is being used, and the content of the insights gathered, does not need protecting in any way; or at least the protection provided at the operating system level, such as login credentials, is sufficient.

In any environment, whether it is a home or a commercial one, data security is a huge issue that must always be considered. Perhaps, in a few instances, it is enough to write the data to a local hard drive and take no further steps; this is rarely an acceptable course of action and certainly should be a conscious decision rather than default behavior. In a commercial environment, computing resources are often provided with built-in security. In this case, it is still important for the user to understand those implications and decide whether further steps should be taken; data security is not just about protection from malicious entities or accidental deletion, but also everything in-between.

As an example, if you work in a secure, regulated, commercial, air-gapped environment (no access to the Internet) and within a team of like-minded data scientists, individual security responsibilities are still just as important as in an environment where no security exists at all; you may have access to data that must not be viewed by any of your peers and you may need to produce analytical results that are available to different and diversified user groups, all of whom are not to see each other's data. The emphasis may be explicitly or implicitly on you to ensure that the data is not compromised; therefore, a strong understanding of the security layers in your software stack is imperative.

The basics

Security considerations are everywhere, even in places that you probably hadn't even thought of. For example, when Spark is running a parallel job on a cluster, do you know the points at which data may touch physical disk during that life cycle? If you are thinking that everything is done in RAM, then you have a potential security issue right there, as data can be spilled to disk. More on the implications of this further on in this chapter. The point here is that you cannot always delegate security responsibility to the frameworks you are using. Indeed, the more varied the software you use, the more security concerns increase, both user and data related.

Security can be broadly split into three areas:

  • Authentication: determining the legitimacy of the identity of a user
  • Authorization: the privileges that a user holds to perform specific actions
  • Access: the security mechanisms used to protect data, both in transit and at rest

There are important differences between these points. A user may have full permissions to access and edit a file, but if the file has been encrypted outside of the user security realm, then the file may still not be readable; user authorization intervenes. Equally, a user may send data across a secure link to be processed on a remote server before a result is returned, but this does not guarantee that the data has not left a footprint on that remote server; the security mechanisms are unknown.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.96.214