Chapter 7. Securing Your Data Lake

Security architectures in the cloud are very different from those on-premises. Today, the cloud is reasonably secure. But it has taken some time to get here.

When the public cloud began, it lacked security functionality. For example, AWS EC2-Classic instances received public IP addresses. A few years later, Amazon introduced its virtual private cloud (Amazon VPC), which included private subnetting and boundaries. Since then, the cloud has matured from typical compute in which security is a minor consideration to an environment with the extensibility and functionality that allow a security professional to more reasonably protect infrastructure.

We’ve seen three generations of cloud security thus far. In the first generation there was little—typically just ad hoc—security. Then, the second generation introduced virtual private clouds and third-party services to enhance security, such as application firewalls. The third generation has included logging approaches like setting up Lambdas to trigger on certain events and AWS Gatekeeper. There is room for a fourth generation to improve cloud security even more.

It’s very important to look at the security primitives differently than those of on-premises setups, and to adopt those security primitives to create a secure data infrastructure.

When securing a data lake in the cloud for the first time, security engineers need to:

  • Understand the different parties involved in cloud security.

  • Expect a lot of noise from your security tools as well as the big data frameworks and solutions that your team needs to use.

  • Carefully establish privileges and policies on your data based on what you’re trying to protect and from whom.

  • Use big data analytics internally to inform and improve your organization’s security capabilities and decision making.

We will explore these considerations in the rest of the chapter.

Consideration 1: Understand the Three “Distinct Parties” Involved in Cloud Security

Ensuring security in a cloud-based data lake is a relatively new science; we’re still figuring it all out. But when your data lake resides in the cloud, the first thing you need to realize is that you still must think about security. Yes, your cloud vendor has broad security responsibilities, which we examine later in the chapter, but you do, too. And if you use a cloud platform for big data like Qubole or another vendor with which to build your data lake, you have three organizations that must work together to ensure that your data and systems are protected.

There’s you, the company that is building a data lake. There’s the cloud platform owner. And then there’s the cloud provider, which not only supports the platform, but also provides the tools and resources that allow the customer to store the data that is relevant, and do the processing required to analyze that data.

You can’t just depend on the cloud provider to protect you. You must rigorously practice safe access control and safe security throughout your organization. And you need to question whether the cloud platform provider has sufficient security in place as well. You need to ask the tough questions. How do I know my data is secure? How can I ensure that the compute resources are safe?

Here are your responsibilities for security in the cloud:

  • Customer data

  • Data encryption

  • User management

  • Infrastructure identity and access management

  • Definitions of users’ roles and responsibilities (perhaps using personas)

Here are the cloud platform vendor’s responsibilities for security in the cloud:

  • Secure access to the data platform

  • Secure transport of commands

  • Third-party attestations or validations: service organization controls, HIPAA, PCI

  • Operating system firewall configuration

  • Metadata security

And, finally, here are the cloud provider’s responsibilities for security in the cloud:

  • Storage

  • Availability and redundancy

  • Compute

  • Networking

Consideration 2: Expect a Lot of Noise from Your Security Tools

Security tools are notoriously “noisy.” They generate a lot of false positives or alerts that are really about nothing.

A 2018 survey by Advanced Threat Analytics found that nearly half (44%) of respondents reported a 50% or higher false-positive rate. Of that 44%, 22% reported a 50 to 75% false-positive rate, and 22% reported that from 75 to 99% of their alerts were false positives.

Security mechanisms such as intrusion detection systems, file integrity tools, and firewalls tend to generate a lot of logs. Take the firewall, for example. It has rules programmed in to either deny or allow traffic. It creates logs for every decision. Intrusion detection systems are also rules based, and they’re looking for traffic anomalies, explains Drew Daniels, VP and CISO of Qubole. They’re saying, “‘These are the things that I care about, and if you see any of these logins, take action.’” But, he says, they still generate “volumes and volumes of data.” And there are other security mechanisms that generate logs and have the potential to alert you when something unwanted or unknown occurs.

Today, 80% of security tools are rules driven. But we are beginning to see tools that are AI and machine learning aware. It’s now possible for companies to take these rules-based solutions and learn from what they find. More important, they can correlate events and alerts to ease incident response.

Additionally, many security professionals dream of a “single pane” that gives them insight into their entire infrastructure instead of having to log in to multiple security tools to figure out what is going on. Having a tool that links or uses data from multiple sources would satisfy this desire.

The reason that rules-based engines are not ideal is because an incident responder must collect all the relevant log data and then put it in the correct order. Only then can security professionals determine how severe an incident was and how to best address it.

It can take up to an hour to analyze whether something is really an incident. And if you’re getting 20 to 30 alerts per day, you’re going to have two or three people doing nothing but chasing incidents. Because most of those reports end up being nothing, that’s quite a lot of noise, and a real problem.

Consideration 3: Protect Critical Data

Security professionals need to first consider what they are trying to protect, and from what. They don’t think in terms of preventing users from accessing data; instead, they view security in terms of what’s sensitive, and how they can best keep the company’s data safe and secure. After all, data is increasingly a business’s most valuable asset.

A best practice in the cloud (as when on-premises) is to apply the principle of least privilege: you don’t want to deny access to data if someone needs it to do their job, but you do want to restrict that person’s access to only the data they need to carry out that job.

How best to allow access? That depends on the organization. For example, Qubole has a policy that if you’re in the office, you must use multifactor authentication (MFA) and you must change your password every 90 days. When you’ve used MFA to log in, it’s valid for 24 hours. Then you need to sign in again.

However, when you’re out of the office (remote), your MFA login will expire after 8 hours. Some companies are even more restrictive and don’t allow personnel to access resources remotely without the use of a virtual private network (VPN) connection to ensure that the data transmitted and received is properly encrypted.

“Some of our most successful customers set up roles and groups. Some also use SAML [Security Assertion Markup Language] for extra protection,” says Daniels.

A good first step is to create roles, decide what access rights belong to each role, and then assign users to the correct roles. Sometimes, you can have an entire group of users with the same access rights; at other times, you need to allow access to certain data or systems or tools on a case-by-case basis. But using the principle of least privilege is critical throughout.

Because certain users need to work with multiple systems and use cases, there can be difficulties. You want to make sure that you understand what access each person needs to be able to do their job. Then you want to provide that access, but not overprovide it. Many companies dictate access rights by role—for example, data scientists have certain rights, and data analysts have others. Or you can assign rights by functional team. You can tailor the specifics to your context, as long as you’re operating on the basis of the principle of least privilege.

“For example, you don’t necessarily want employees from marketing or sales accessing corporate financial data,” says Daniels. “So, you can limit their access based on their function.” Following from this, you might want to go quite granular on what you allow users to do within a system or with data. You could allow your data analysts to use existing clusters, but not to configure or create new ones. In such a case, says Daniels, “data scientists may want to start certain clusters and tune them to do something different.”

Many of today’s cloud-native tools allow for remarkably granular controls. Fundamentally, it is about protecting the business’s data. After you get past that, you should be making sure that the right configurations exist for the right users. Tools like Apache Ranger (discussed in Chapter 6) make this possible.

In a typical business environment, the data team will sit down with whoever needs data—perhaps the finance team—and they’ll capture requirements from that team. What do they need to see? What do they need to do? And then the data team creates a role that grants certain rights to financial employees. They might be able to look at the last two quarters of financial results, but they can’t go in and do a custom query or ask questions that haven’t already been structured in a report or a notebook. Then, if users need more access than that, they can come to the data team, open a support ticket, and make their requests, which can be decided on a case-by-case basis.

Consideration 4: Use Big Data to Enhance Security

The fourth and final issue is this: instead of simply serving the needs of the other functions in the organizations, why not take advantage of all of these tools—big data, advanced analytics, AI, machine learning—to improve security operations?

“Much like other areas in technology, security generates a lot of data, and one of my passions is trying to figure out how I can take a look at that data, combine it with other datasets, and be able to find things that I might not otherwise be able to discover,” says Daniels. “A lot of these tools generate terabytes of log data every day, maybe even more depending on how large the infrastructure is and how widely the tool is deployed.”

Companies need to find ways to leverage all this security data, and use data science and machine learning to learn how to make their organizations more secure. This is the dream of incident response leaders, who, as mentioned earlier, currently spend far too much time per incident, which is both inefficient and costly.

Very few people, if any, are doing this today. Managed services security operations centers (MSSOCs) are trying, however. With an MSSOC, you have a third party adjusting all of the logs and doing the analysis for you. Often this raises concerns. Does this MSSOC support your tools and data? How do you send them your data? What are they doing with the data? Who owns and controls the data?

From a security perspective, this is something that many CIOs or CISOs worry about. “I don’t like when somebody is saying, ‘Give me all your data. Oh, here’s your incidents,’” says Daniels. What are they potentially missing? Because it’s completely outsourced, you don’t know what that is, and you don’t know how bad it might be.

“I think the second thing is security professionals are often paranoid,” says Daniels. “They want to know how the lasagna’s being made.”

Right now, the most common way that companies manage security is by sifting through all the data that their security tools generate. But the more tools they use, the more difficult that becomes.

“We at Qubole have started taking the approach of mandating that our security vendors dump all that data in a common format and location so that we can take it and figure out what issues might need our attention,” says Daniels.

When you get to this advanced stage of security, you’re not just focusing on data governance, but are also proactively seeking out use cases around fraud detection, or hacking attacks, or easier user administration and management.

“Typically, security teams are understaffed and underfunded,” says Daniels. “So I get a lot of people coming to me asking, ‘How did you do it?’ They’re extremely thrilled about the possibility of doing this themselves.”

Qubole is close, he says. It has been working with its vendors to get security data outboarded to Amazon S3 so that it can ingest it, “and we now have three or four of those data sources ready to be ingested,” he says. “I’m excited at the prospect.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.16.23