Moving data

Some of the methods of moving data in and out of Databricks have already been explained in Chapter 8, Spark Databricks and Chapter 9, Databricks Visualization. What I would like to do in this section is provide an overview of all of the methods available for moving data. I will examine the options for tables, workspaces, jobs, and Spark code.

The table data

The table import functionality for Databricks cloud allows data to be imported from an AWS S3 bucket, from the Databricks file system (DBFS), via JDBC and finally from a local file. This section gives an overview of each type of import, starting with S3. Importing the table data from AWS S3 requires the AWS Key, the AWS secret key, and the S3 bucket name. The following screenshot shows an example. I have already provided an example of S3 bucket creation, including adding an access policy, so I will not cover it again.

The table data

Once the form details are added, you will be able to browse your S3 bucket for a data source. Selecting DBFS as a table data source enables your DBFS folders, and files to be browsed. Once a data source is selected, it can display a preview as the following screenshot shows:

The table data

Selecting JDBC as a table data source allows you to specify a remote SQL database as a data source. Just add an access URL, Username, and Password. Also, add some SQL to define the table, and columns to source. There is also an option of adding extra properties to the call via the Add Property button, as the following screenshot shows:

The table data

Selecting the File option to populate a Databricks cloud instance table, from a file, creates a drop down or browse. This upload method was used previously to upload CSV-based data into a table. Once the data source is specified, it is possible to specify a data separator string or header row, define column names or column types and preview the data before creating the table.

The table data

Folder import

From either a workspace, or a folder drop-down menu, it is possible to import an item. The following screenshot shows a compound image from the Import Item menu option:

Folder import

This creates a file drop or browse window, which when clicked, allows you to browse the local server for the items to import. Selecting the All Supported Types option shows that the items to import can be JAR files, dbc archives, Scala, Python, or SQL files.

Library import

The following screenshot shows the New Library functionality, from the Workspace and folder menu options. This allows an externally created and tested library to be loaded to your Databricks cloud instance. The library can be in the form of a Java or Scala JAR file, a Python Egg or a Maven coordinate for repository access. In the following screenshot, a JAR file is being selected from the local server via a browse window. This functionality has been used in this chapter to test stream-based Scala programming:

Library import
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.128.113