Many installations of Apache Spark use Kerberos to provide security and authentication to services such as HDFS and Kafka. It's also especially common when integrating with third-party databases and legacy systems. As a commercial data scientist, at some point, you'll probably find yourself in a situation where you'll have to work with data in a Kerberized environment, so, in this part of the chapter, we'll cover the basics of Kerberos - what it is, how it works, and how to use it.
Kerberos is a third-party authentication technique that's particularly useful where the primary form of communication is over a network, which makes it ideal for Apache Spark. It's used in preference to alternative methods of authentication, for example, username and password, because it provides the following benefits:
Let's look at how it works with Apache Spark.
In the most basic use case, once you're logged on to an edge node (or similar) of your secure Hadoop cluster and before running your Spark program, Kerberos must be initialized. This is done by using the kinit
command that comes with Hadoop and entering your user's password when prompted:
> kinit Password for user: > spark-shell Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) Type in expressions to have them evaluated. Type :help for more information. scala> val file = sc.textFile("hdfs://...") scala> file.count
At this point, you will be fully authenticated and able to access any data within HDFS, subject to the standard permissions model.
So, the process seems simple enough, let's take a deeper look at what happened here:
kinit
command runs, it immediately sends a request to the Kerberos key distribution centre (KDC), to acquire a ticket granting ticket (TGT). The request is sent in plain text, and it essentially contains what is known as the principal, which is basically the "username@kerberosdomain" in this case (you can find out this string using the klist
command). The Authentication Server (AS) responds to this request, with a TGT that has been signed using client's private key, a key that was shared ahead of time and is already known to the AS. This ensures secure transfer of the TGT.
By default, Kerberos tickets last for 10 hours and then expire, making them useless after this time, but they can be renewed. Therefore, when executing long-running Spark jobs or Spark Streaming Jobs (or jobs where a user is not directly involved and kinit
cannot be run manually), it is possible to pass enough information upon starting a Spark process in order to automate the renewal of tickets issued during the previously discussed handshake.
This is done by passing in the location of the keytab file and associated principal using the command line options provided, like so:
spark-submit --master yarn-client --class SparkDriver --files keytab.file --keytab keytab.file --principal username@domain ApplicationName
When attempting to execute a long running job as your local user, the principal name can be found using klist
otherwise, dedicated service principals can be configured within Kerberos using ktutils
and ktadmin
.
When working in a corporate setting, it may be necessary to connect to a third-party database that has been secured with Kerberos, such as PostgreSQL or Microsoft SQLServer.
In this situation, it is possible to use JDBC RDD to connect directly to the database and have Spark issue an SQL query to ingest data in parallel. Care should be taken when using this approach, as traditional databases are not built for high levels of parallelism, but if used sensibly, it is sometimes a very useful technique, particularly well-suited to rapid data exploration.
Firstly, you will need the native JDBC drivers for your particular database - here we've used Microsoft SQLServer as an example, but drivers should be available for all modern databases that support Kerberos (see RFC 1964).
You'll need to configure spark-shell to use the JDBC drivers on startup, like so:
> JDBC_DRIVER_JAR=sqljdbc.jar > spark-shell --master yarn-client --driver-class-path $JDBC_DRIVER_JAR --files keytab.file --conf spark.driver.extraClassPath=$JDBC_DRIVER_JAR --conf spark.executor.extraClassPath=$JDBC_DRIVER_JAR --jars $JDBC_DRIVER_JAR
Then, in the shell, type or paste the following (replacing the environment specific variables, which are highlighted):
import org.apache.spark.rdd.JdbcRDD new JdbcRDD(sc, ()=>{ import org.apache.hadoop.security.UserGroupInformation import UserGroupInformation.AuthenticationMethod import org.apache.hadoop.conf.Configuration import org.apache.spark.SparkFiles import java.sql.DriverManager import java.security.PrivilegedAction import java.sql.Connection val driverClassName = "com.microsoft.sqlserver.jdbc.SQLServerDriver" val url = "jdbc:sqlserver://" + "host:port;instanceName=DB;" + "databaseName=mydb;" + "integratedSecurity=true;" + "authenticationScheme=JavaKerberos" Class.forName(driverClassName) val conf = new Configuration conf.addResource("/etc/hadoop/conf/core-site.xml") conf.addResource("/etc/hadoop/conf/mapred-site.xml") conf.addResource("/etc/hadoop/conf/hdfs-site.xml") UserGroupInformation.setConfiguration(conf) UserGroupInformation .getCurrentUser .setAuthenticationMethod(AuthenticationMethod.KERBEROS) UserGroupInformation .loginUserFromKeytabAndReturnUGI(principal, keytab.file) .doAs(new PrivilegedAction[Connection] { override def run(): Connection = DriverManager.getConnection(url) }) }, "SELECT * FROM books WHERE id <= ? and id >= ?", 1, // lowerBound - the minimum value of the first placeholder 20, // upperBound - the maximum value of the second placeholder 4) // numPartitions - the number of partitions
Spark runs the SQL passed into the constructor of JdbcRDD
, but instead of running it as a single query, it is able to chunk it using the last three parameters as a guide.
So, in this example, in fact, four queries would be run in parallel:
SELECT * FROM books WHERE id <= 1 and id >= 5 SELECT * FROM books WHERE id <= 6 and id >= 10 SELECT * FROM books WHERE id <= 11 and id >= 15 SELECT * FROM books WHERE id <= 16 and id >= 20
As you can see, Kerberos is a huge and complicated subject. The level of knowledge required for a data scientist can vary depending upon the role. Some organizations will have a DevOps team to ensure that everything is implemented correctly. However, in the current climate, where there is a big skills shortage in the market, it could well be the case that data scientists will have to solve these issues themselves.
3.145.70.38