Kerberos authentication

Many installations of Apache Spark use Kerberos to provide security and authentication to services such as HDFS and Kafka. It's also especially common when integrating with third-party databases and legacy systems. As a commercial data scientist, at some point, you'll probably find yourself in a situation where you'll have to work with data in a Kerberized environment, so, in this part of the chapter, we'll cover the basics of Kerberos - what it is, how it works, and how to use it.

Kerberos is a third-party authentication technique that's particularly useful where the primary form of communication is over a network, which makes it ideal for Apache Spark. It's used in preference to alternative methods of authentication, for example, username and password, because it provides the following benefits:

  • No passwords are stored in plain text in application configuration files
  • Facilitates centralized management of services, identities, and permissions
  • Establishes a mutual trust, so both entities are identified
  • Prevents spoofing - trust is only established temporarily, just for a timed session, meaning replay attacks are not possible, but sessions are renewable for convenience

Let's look at how it works with Apache Spark.

Use case 1: Apache Spark accessing data in secure HDFS

In the most basic use case, once you're logged on to an edge node (or similar) of your secure Hadoop cluster and before running your Spark program, Kerberos must be initialized. This is done by using the kinit command that comes with Hadoop and entering your user's password when prompted:

> kinit 
Password for user: 
> spark-shell 
Spark session available as 'spark'. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _ / _ / _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_   version 2.0.1 
      /_/ 
          
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) 
Type in expressions to have them evaluated. 
Type :help for more information. 
 
scala> val file = sc.textFile("hdfs://...") 
scala> file.count 

At this point, you will be fully authenticated and able to access any data within HDFS, subject to the standard permissions model.

So, the process seems simple enough, let's take a deeper look at what happened here:

  1. When the kinit command runs, it immediately sends a request to the Kerberos key distribution centre (KDC), to acquire a ticket granting ticket (TGT). The request is sent in plain text, and it essentially contains what is known as the principal, which is basically the "username@kerberosdomain" in this case (you can find out this string using the klist command). The Authentication Server (AS) responds to this request, with a TGT that has been signed using client's private key, a key that was shared ahead of time and is already known to the AS. This ensures secure transfer of the TGT.
  2. The TGT is cached locally on the client, along with a Keytab file - which is a container for Kerberos keys and it is accessible to any Spark processes running as the same user.
  3. Next, when the spark-shell is started, Spark uses the cached TGT to request that the Ticket Granting Server (TGS), provide a session ticket for accessing the HDFS service. This Ticket is signed using the HDFS NameNode's private key. In this way, the secure transfer of the Ticket is guaranteed, ensuring that only the NameNode can read it.
  4. Armed with a ticket, Spark attempts to retrieve a delegation token from the NameNode. The purpose of this token is to prevent a flood of requests into the TGT when the executors start reading data (as the TGT was not designed with big data in mind!), but it also helps overcome problems Spark has with delayed execution times and ticket session expiry.
  5. Spark ensures that all executors have access to the delegation token by placing it on the distributed cache so that it's available as a YARN local file.
  6. When each executor makes a request to the NameNode for access to a block stored in HDFS, it passes across the delegation token it was given previously. The NameNode replies with the location of the block, along with a block token that is signed by the NameNode with a private secret. This key is shared by all of the DataNodes in the cluster and is only known by them. The purpose of this added block token is to ensure that the access is fully secured and, as such, it is only issued to authenticated users and it can only be read by verified DataNodes.
  7. The last step is for the executors to supply the block token to the relevant DataNode and receive the requested block of data.

Use case 1: Apache Spark accessing data in secure HDFS

Use case 2: extending to automated authentication

By default, Kerberos tickets last for 10 hours and then expire, making them useless after this time, but they can be renewed. Therefore, when executing long-running Spark jobs or Spark Streaming Jobs (or jobs where a user is not directly involved and kinit cannot be run manually), it is possible to pass enough information upon starting a Spark process in order to automate the renewal of tickets issued during the previously discussed handshake.

This is done by passing in the location of the keytab file and associated principal using the command line options provided, like so:

spark-submit 
   --master yarn-client
   --class SparkDriver
   --files keytab.file
   --keytab keytab.file
   --principal username@domain
ApplicationName

When attempting to execute a long running job as your local user, the principal name can be found using klist otherwise, dedicated service principals can be configured within Kerberos using ktutils and ktadmin.

Use case 3: connecting to secure databases from Spark

When working in a corporate setting, it may be necessary to connect to a third-party database that has been secured with Kerberos, such as PostgreSQL or Microsoft SQLServer.

In this situation, it is possible to use JDBC RDD to connect directly to the database and have Spark issue an SQL query to ingest data in parallel. Care should be taken when using this approach, as traditional databases are not built for high levels of parallelism, but if used sensibly, it is sometimes a very useful technique, particularly well-suited to rapid data exploration.

Firstly, you will need the native JDBC drivers for your particular database - here we've used Microsoft SQLServer as an example, but drivers should be available for all modern databases that support Kerberos (see RFC 1964).

You'll need to configure spark-shell to use the JDBC drivers on startup, like so:

> JDBC_DRIVER_JAR=sqljdbc.jar 
> spark-shell  
  --master yarn-client  
  --driver-class-path $JDBC_DRIVER_JAR  
  --files keytab.file   --conf spark.driver.extraClassPath=$JDBC_DRIVER_JAR 
  --conf spark.executor.extraClassPath=$JDBC_DRIVER_JAR 
  --jars $JDBC_DRIVER_JAR 

Then, in the shell, type or paste the following (replacing the environment specific variables, which are highlighted):

import org.apache.spark.rdd.JdbcRDD 
 
new JdbcRDD(sc, ()=>{ 
        import org.apache.hadoop.security.UserGroupInformation 
        import UserGroupInformation.AuthenticationMethod 
        import org.apache.hadoop.conf.Configuration 
        import org.apache.spark.SparkFiles 
        import java.sql.DriverManager 
        import java.security.PrivilegedAction 
        import java.sql.Connection 
 
        val driverClassName = "com.microsoft.sqlserver.jdbc.SQLServerDriver" 
        val url = "jdbc:sqlserver://" + 
                  "host:port;instanceName=DB;" + 
                  "databaseName=mydb;" +  
                  "integratedSecurity=true;" +  
                  "authenticationScheme=JavaKerberos" 
 
        Class.forName(driverClassName) 
        val conf = new Configuration 
        conf.addResource("/etc/hadoop/conf/core-site.xml") 
        conf.addResource("/etc/hadoop/conf/mapred-site.xml") 
        conf.addResource("/etc/hadoop/conf/hdfs-site.xml") 
        UserGroupInformation.setConfiguration(conf) 
 
        UserGroupInformation 
           .getCurrentUser 
           .setAuthenticationMethod(AuthenticationMethod.KERBEROS) 
        UserGroupInformation 
           .loginUserFromKeytabAndReturnUGI(principal, keytab.file) 
           .doAs(new PrivilegedAction[Connection] { 
             override def run(): Connection =  
                  DriverManager.getConnection(url) 
           }) 
 
},  
"SELECT * FROM books WHERE id <= ? and id >= ?",  
1,           // lowerBound    - the minimum value of the first placeholder 
20,          // upperBound    - the maximum value of the second placeholder 
4)           // numPartitions - the number of partitions 

Spark runs the SQL passed into the constructor of JdbcRDD, but instead of running it as a single query, it is able to chunk it using the last three parameters as a guide.

So, in this example, in fact, four queries would be run in parallel:

SELECT * FROM books WHERE id <= 1 and id >= 5 
SELECT * FROM books WHERE id <= 6 and id >= 10 
SELECT * FROM books WHERE id <= 11 and id >= 15 
SELECT * FROM books WHERE id <= 16 and id >= 20 

As you can see, Kerberos is a huge and complicated subject. The level of knowledge required for a data scientist can vary depending upon the role. Some organizations will have a DevOps team to ensure that everything is implemented correctly. However, in the current climate, where there is a big skills shortage in the market, it could well be the case that data scientists will have to solve these issues themselves.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.70.38