Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Moving data

Some of the methods of moving data in and out of Databricks have already been explained in Chapter 8, Spark Databricks and Chapter 9, Databricks Visualization. What I would like to do in this section is provide an overview of all of the methods available for moving data. I will examine the options for tables, workspaces, jobs, and Spark code.

The table data

The table import functionality for Databricks cloud allows data to be imported from an AWS S3 bucket, from the Databricks file system (DBFS), via JDBC and finally from a local file. This section gives an overview of each type of import, starting with S3. Importing the table data from AWS S3 requires the AWS Key, the AWS secret key, and the S3 bucket name. The following screenshot shows an example. I have already provided an example of S3 bucket creation, including adding an access policy, so I will not cover it again.

Once the form details are added, you will be able to browse your S3 bucket for a data source. Selecting DBFS as a table data source enables your DBFS folders, and files to be browsed. Once a data source is selected, it can display a preview as the following screenshot shows:

Selecting JDBC as a table data source allows you to specify a remote SQL database as a data source. Just add an access URL, Username, and Password. Also, add some SQL to define the table, and columns to source. There is also an option of adding extra properties to the call via the Add Property button, as the following screenshot shows:

Selecting the File option to populate a Databricks cloud instance table, from a file, creates a drop down or browse. This upload method was used previously to upload CSV-based data into a table. Once the data source is specified, it is possible to specify a data separator string or header row, define column names or column types and preview the data before creating the table.

Folder import

From either a workspace, or a folder drop-down menu, it is possible to import an item. The following screenshot shows a compound image from the Import Item menu option:

This creates a file drop or browse window, which when clicked, allows you to browse the local server for the items to import. Selecting the All Supported Types option shows that the items to import can be JAR files, dbc archives, Scala, Python, or SQL files.

Library import

The following screenshot shows the New Library functionality, from the Workspace and folder menu options. This allows an externally created and tested library to be loaded to your Databricks cloud instance. The library can be in the form of a Java or Scala JAR file, a Python Egg or a Maven coordinate for repository access. In the following screenshot, a JAR file is being selected from the local server via a browse window. This functionality has been used in this chapter to test stream-based Scala programming:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Moving data

Create new playlist

Sign In

Sign Up

Moving data

The table data

Folder import

Library import

Table of Contents for
Moving data