The DbUtils package

The previous Scala-based script, which uses the DbUtils package, and creates the mount in the last section, only uses a small portion of the functionality of this package. In this section, I would like to introduce some more features of the DbUtils package, and the Databricks File System (DBFS). The help option within the DbUtils package can be called within a Notebook connected to a Databricks cluster, to learn more about its structure and functionality. As the following screenshot shows, executing dbutils.fs.help() in a Scala Notebook provides help on fsutils, cache, and the mount-based functionality:

The DbUtils package

It is also possible to obtain help on individual functions, as the text in the previous screenshot shows. The example in the following screenshot explains the cacheTable function, providing descriptive text and a sample function call with the parameter and return types:

The DbUtils package

The next section will briefly examine the DBFS before moving on to examining more of the dbutils functionality.

Databricks file system

The DBFS can be accessed using URL's of the dbfs:/* form, and using the functions available within dbutils.fs.

Databricks file system

The previous screenshot shows the /mnt file system being examined using the ls function, and then showing mount directories—s3data and s3data1. These were the directories created during the previous Scala S3 mount example.

Dbutils fsutils

The fsutils group of functions, within the dbutils package, covers functions such as cp, head, mkdirs, mv, put, and rm. The help calls, shown previously, can provide more information about them. You can create a directory on DBFS using the mkdirs call, as shown next. Note that I have created a number of directories under dbfs:/, named as data* in this session. The following example has created the directory called data2:

Dbutils fsutils

The previous screenshot shows by executing an ls that there are many default directories that already exist on DBFS. For instance, see the following:

  • /tmp is a temporary area
  • /mnt is a mount point for remote directories—that is, S3
  • /user is a user storage area that currently contains Hive
  • /mount is an empty directory
  • /FileStore is a storage area for tables, JARs, and job JARs
  • /databricks-datasets is datasets provided by Databricks

The dbutils copy command, shown next, allows a file to be copied to a DBFS location. In this instance, the external1.txt file had been copied to the /data2 directory, as shown in the following screenshot:

Dbutils fsutils

The head function can be used to return the first maxBytes characters from the head of a file on DBFS. The following example shows the format of the external1.txt file. This is useful, as it tells me that this is a CSV file, and so shows me how to process it.

Dbutils fsutils

It is also possible to move files within DBFS. The following screenshot shows the mv command being used to move the external1.txt file from the directory data2 to the directory called data1. The ls command is then used to confirm the move.

Dbutils fsutils

Finally, the remove function (rm) is used to remove the file called external1.txt, which was just moved. The following ls function call shows that the file no longer exists within the data1 directory, because there is no FileInfo record in the function output:

Dbutils fsutils

The DbUtils cache

The cache functionality, within DbUtils, provides the means to cache (and uncache) both tables and files to DBFS. Actually, the tables are saved as files also to the DBFS directory called /FileStore. The following screenshot shows that the cache functions are available:

The DbUtils cache

The DbUtils mount

The mount functionality allows you to mount remote file systems, refresh mounts, display mount details, and unmount specific mounted directories. An example of an S3 mount was already given in the previous sections, so I won't repeat it here. The following screenshot shows the output from the mounts function. The s3data and s3data1 mounts have been created by me. The other two mounts for root and datasets already existed. The mounts are listed in a sequence of the MountInfo objects. I have rearranged the text to be more meaningful, and to be better presented on the page.

The DbUtils mount
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.197