Security ecosystem

We will conclude with a brief rundown of some of the popular security tools we may encounter while developing with Apache Spark - and some advice about when to use them.

Apache sentry

As the Hadoop ecosystem grows ever larger, products such as Hive, HBase, HDFS, Sqoop, and Spark all have different security implementations. This means that duplicate policies are often required across the product stack in order to provide the user with a seamless experience, as well as enforce the overarching security manifest. This can quickly become complicated and time consuming to manage, which often leads to mistakes and even security breaches (whether intentional or otherwise). Apache Sentry pulls many of the mainstream Hadoop products together, particularly with Hive/HS2, to provide fine-grained (up to column level) controls.

Using ACLs is simple, but high maintenance. The setting of permissions for a large number of new files and amending umasks is very cumbersome and time consuming. As abstractions are created, authorization becomes more complicated. For example, the fusing of files and directories can become tables, columns, and partitions. Therefore, we need a trusted entity to enforce access control. Hive has a trusted service - HiveServer2 (HS2), which parses queries and ensures that users have access to the data they are requesting. HS2 runs as a trusted user with access to the whole data warehouse. Users don't run code directly in HS2, so there is no risk of code bypassing access checks.

To bridge Hive and HDFS data, we can use the Sentry HDFS plugin, which synchronizes HDFS file permissions with higher level abstractions. For example, permissions to read a table = permission to read table's files and, similarly, permissions to create a table = permission to write to a database's directory. We still use HDFS ACL's for fine-grained user permissions, however we are restricted to the Filesystem view of the world and therefore cannot provide column-level and row-level access, it's "all or nothing". As mentioned previously, Accumulo provides a good alternative when this scenario is important. There is a product, however, that also addresses this issue - see the RecordService section.

The quickest and easiest way to implement Apache Sentry is to use Apache Hue. Apache Hue has been developed over the last few years, starting life as a simple GUI to pull together a few of the basic Hadoop services, such as HDFS, and has grown into a hub for many of the key building blocks in the Hadoop stack; HDFS. Hive, Pig, HBase, Sqoop, Zookeeper, and Oozie all feature together with integrated Sentry to handle the security. A demonstration of Hue can be found at http://demo.gethue.com/, providing a great introduction to the feature set. We can also see many of the ideas discussed in this chapter in practice, including HDFS ACLs, RBACs, and Hive HS2 access.

RecordService

One of the key aspects of the Hadoop ecosystem is decoupling storage managers (for example, HDFS and Apache HBase) and compute frameworks (for example, MapReduce, Impala, and Apache Spark). Although this decoupling allows for far greater flexibility, thus allowing the user to choose their framework components, it leads to excessive complexity due to the compromises required to ensure that everything works together seamlessly. As Hadoop becomes an increasingly critical infrastructure component for users, the expectations for compatibility, performance, and security also increase.

RecordService is a new core security layer for Hadoop that sits between the storage managers and compute frameworks to provide a unified data access path, fine-grained data permissions, and enforcement across the stack.

RecordService

RecordService is only compatible with Cloudera 5.4 or later and, thus, cannot be used in a standalone capacity, or with Hortonworks, although HDP uses Ranger to achieve the same goals. More information can be found at www.recordservice.io.

Apache ranger

The aims of Apache ranger are broadly the same as RecordService, the primary goals being:

  • Centralized security administration to manage all security related tasks in a central UI, or using REST APIs
  • Fine-grained authorization to perform a specific action and/or operation with a Hadoop component/tool and manage through a central administration tool
  • Standardize authorization methods across all Hadoop components
  • Enhanced support for different authorization methods including role-based access control and attribute based access control
  • Centralized auditing of user access and administrative actions (security related) within all components of Hadoop

At the time of writing, Ranger is an Apache Incubator project and, therefore, is not at a major point release. Although, it is fully integrated with Hortonworks HDP supporting HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, NiFi, YARN, and, crucially, a scalable cryptographic key management service for HDFS encryption. Full details can be found at http://ranger.incubator.apache.org/ and http://hortonworks.com/apache/ranger/.

Apache Knox

We have discussed many of the security areas of the Spark/Hadoop stack, but they are all related to securing individual systems or data. An area that has not been mentioned in any detail is that of securing a cluster itself from unauthorized external access. Apache Knox fulfills this role by "ring fencing" a cluster and providing a REST API Gateway through which all external transactions must pass.

Coupled with a Kerberos secured Hadoop cluster, Knox provides authentication and authorization, protecting the specifics of the cluster deployment. Many of the common services are catered for, including HDFS (via WEBHDFS), YARN Resource Manager, and Hive.

Knox is another project that is heavily contributed to by Hortonworks and, therefore, is fully integrated into the Hortonworks HDP platform. Whilst Knox can be deployed into virtually any Hadoop cluster, it can be done with a fully integrated approach in HDP. More information can be found at knox.apache.org.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.237.164