© Dmitry Anoshin, Dmitry Shirokov, Donna Strok 2020
D. Anoshin et al.Jumpstart Snowflakehttps://doi.org/10.1007/978-1-4842-5328-1_4

4. Loading Bulk Data into Snowflake

Dmitry Anoshin1 , Dmitry Shirokov2 and Donna Strok3
(1)
British Columbia, Canada
(2)
Burnaby, BC, Canada
(3)
Seattle, WA, USA
 
There are two ways to load data into Snowflake: bulk data loading with the COPY statement and continuous data loading with Snowpipe. This chapter is focused on bulk data loading. We will cover the following topics:
  • Overview of bulk data loading: We will cover what bulk data loading is, file load locations, supported file formats and encoding, compression handling, and encryption options.

  • Bulk data loading recommendations: We will cover file preparation including file sizing and splitting, the CSV and semistructured formats, staging, loading, and querying.

Note

Continuous data loading with Snowpipe is covered in Chapter 6.

Overview of Bulk Data Loading

The bulk loading of data using COPY has been done longer than Snowflake has been around. Many other database management systems support using the COPY statement . Therefore, it is no surprise that Snowflake offers the same support. To better understand bulk data loading, we will review and answer these key questions:
  • What is bulk data loading?

  • Where can we bulk data load from?

  • What are the compression and encryption options?

  • What file formats are supported?

What Is Bulk Data Loading?

To get data into a database table, you need to insert it. Insert statements can take a while since they need to be executed one row at a time. Bulk copying can take a large amount of data and insert it into a database all in one batch. The bulk data loading option in Snowflake allows batch loading of data from files that are in cloud storage, like AWS S3.

If your data files are not currently in cloud storage, then there is an option to copy the data files from a local machine to a cloud storage staging area before loading them into Snowflake. This is known as Snowflake’s internal staging area. The data files are transmitted from a local machine to an internal, Snowflake-designated, cloud storage staging location and then loaded into tables using the COPY command.

Tip

External tables can be created instead of loading data into Snowflake. This would be useful when only a portion of data is needed.

Bulk Load Locations

Snowflake supports loading data from files staged in any of the following cloud storage locations, regardless of the cloud platform for your Snowflake account:
  • Snowflake-designated internal storage staging location

  • AWS S3, where files can be loaded directly from any user-supplied S3 bucket

  • GCP Cloud Storage, where files can be loaded directly from any user-supplied GCP Cloud Storage container

  • Azure Blob storage, where files can be loaded directly from any user-supplied Azure container

Note

Data transfer billing charges may apply when loading data from files staged across different platforms. Refer to Chapter 2 for more information.

Supported File Formats and Encoding

Snowflake supports most of the common file formats used for loading data. These file formats include the following:
  • Delimited files (any valid delimiter is supported; the default is a comma)

  • JSON and XML

  • Avro, including the automatic detection and processing of staged Avro files that were compressed using Snappy

  • ORC, including the automatic detection and processing of staged ORC files that were compressed using Snappy or zlib

  • Parquet, including the automatic detection and processing of staged Parquet files that were compressed using Snappy

For delimited files, the default character set is UTF-8. To use any other characters set, you must explicitly specify the encoding to use for loading. For all other supported file formats (JSON, Avro, etc.), the only supported character set is UTF-8.

Note

Many character encoding sets are supported for the loading of delimited files. Refer to Snowflake’s online documentation for more details on which character encodings are supported.

Snowflake also allows you to configure a file format object for reuse. This can be done through the user interface or with SQL code. This is useful for formats that are frequently used by many load jobs. Here is an example of a file format named "DEMO_DB"."PUBLIC".sample_file_format. The file format object name can then be referenced during the bulk load command.
CREATE FILE FORMAT "DEMO_DB"."PUBLIC".sample_file_format
TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = ' ' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\N');

Compression Handling

When staging uncompressed files in a Snowflake stage, the files are automatically compressed using gzip, unless compression is explicitly disabled. Snowflake can automatically the detect gzip, bzip2, deflate, and raw_deflate compression methods. Autodetection is not yet supported for brotli and zstandard. Therefore, when staging or loading files compressed with either of these methods, you must explicitly specify the compression method that was used.

Encryption Options

When staging unencrypted files in an internal Snowflake location, the files are automatically encrypted using 128-bit keys. 256-bit keys can be enabled (for stronger encryption); however, additional configuration is required. Files that are already encrypted can be loaded into Snowflake from external cloud storage; the key used to encrypt the files must be provided to Snowflake.

Bulk Data Loading Recommendations

Loading large data sets can affect query performance. Snowflake recommends dedicating separate warehouses to loading and querying operations to optimize the performance for each. In this section, we will cover the recommended ways to prepare the files.

File Preparation

The number of data files that can be processed in parallel is determined by the number and capacity of servers in a warehouse. If you follow the file sizing guidelines described in the following section, the data loading will require minimal resources. Note that these recommendations apply to both bulk data loads and continuous loading using Snowpipe.

File Sizing

Know the following about file sizing:
  • The number of load operations that can run in parallel cannot exceed the number of data files to be loaded.

  • To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 10 MB to 100 MB in size, compressed.

  • Aggregate smaller files to minimize the processing overhead for each file.

  • Split larger files into a greater number of smaller files to distribute the load among the servers in an active warehouse. The number of data files processed in parallel is determined by the number and capacity of servers in a warehouse.

  • Snowflake recommends splitting large files by line to avoid records that span chunks.

  • Data loads of large Parquet files (e.g., greater than 3 GB) could time out. Split large files into files 1 GB in size (or smaller) for loading.

File Splitting

If your source database does not allow you to export data files in smaller chunks, use a third-party utility to split large CSV files. Windows does not include a native file split utility; however, Windows supports many third-party tools and scripts that can split large data files. Linux has the split utility, which enables you to split a CSV file into multiple smaller files.

Note

Splitting larger data files allows the load to scale linearly. Using a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in any performance increase.

CSV File Preparation

Consider the following guidelines when preparing your delimited text (CSV) files for loading:
  • UTF-8 is the default character set; however, additional encodings are supported. Use the ENCODING file format option to specify the character set for the data files.

  • Snowflake supports ASCII characters (including high-order characters) as delimiters. Common field delimiters include the pipe (|), comma (,), caret (^), and tilde (~).

  • A field can be optionally enclosed by double quotes, and, within the field, all special characters are automatically escaped, except the double quote itself needs to be escaped by having two double quotes right next to each other (“”). For unenclosed fields, a backslash () is the default escape character.

  • Common escape sequences can be used (e.g., for tab, for newline, for carriage return, and \ for backslash).

  • Fields containing carriage returns should also be enclosed in quotes (single or double).

  • The number of columns in each row should be consistent.

Semistructured Data File Preparation

Semistructured data is data that does not conform to the standards of traditional structured data, but it contains tags or other types of markup that identify individual, distinct entities within the data.

Two of the key attributes that distinguish semistructured data from structured data are nested data structures and the lack of a fixed schema.
  • Structured data requires a fixed schema that is defined before the data can be loaded and queried in a relational database system. Semistructured data does not require a prior definition of a schema and can constantly evolve; i.e., new attributes can be added at any time.

  • In addition, entities within the same class may have different attributes even though they are grouped together, and the order of the attributes is not important.

  • Unlike structured data, which represents data as a flat table, semistructured data can contain n level of hierarchies of nested information.

The steps for loading semistructured data into tables are identical to those for loading structured data into relational tables. Snowflake loads semistructured data into a single VARIANT column. You can also use a COPY INTO table statement during data transformation to extract selected columns from a staged data file into separate table columns.

When semistructured data is inserted into a VARIANT column, what Snowflake is really doing is extracting information about the key locations and values and saving it into a semistructured document. The document is referenced by the metadata engine for fast SQL retrieval.

Note

VARIANT “null” values (not to be confused with SQL NULL values) are not loaded to the table. To avoid this, extract semistructured data elements containing “null” values into relational columns before loading them, Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, Snowflake recommends setting the file format option STRIP_NULL_VALUES to TRUE when loading the semistructured data files.

File Staging

Both internal and external stage locations in Snowflake can include a path (referred to as a prefix in AWS). When staging regular data sets, Snowflake recommends partitioning the data into logical paths to identify details such as geographical location, along with the date, when the data is written.

Organizing your data files by path allows you to copy the data into Snowflake with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations.

For example, if you were storing data for a company that did business all over the world, you might include identifiers such as continent, country, and city in paths along with data write dates. Here are two examples:
  • NA/Mexico/Quintana_Roo/Cancun/2020/01/01/01/

  • EU/France/Paris/2020/01/01/05/

When you create a named stage, you can specify any part of a path. For example, create an external stage using one of the previous example paths:
create stage test_stage url='s3://bucketname/united_states/washington/seattle/' credentials=(aws_key_id='1234lkj' aws_secret_key="asdlj1234");
You can also add a path when you stage files in an internal user or table stage. For example, you can stage mydata.csv in a specific path in the t1 table stage with this:
put file:///local/myfile.csv @%t1/united_states/washington/seattle/2020/01/01/01/
When loading your staged data, narrow the path to the most granular level that includes your data for improved data load performance.
  • If the file names match except for a suffix or extension, include the matching part of the file names in the path. Here’s an example:

copy into t1 from @%t1/united_states/washington/seattle/2020/01/01/01/myfile;
  • Add the FILES or PATTERN option. Here’s an example:
    • copy into t1 from @%t1/united_states/california/los_angeles/2016/06/01/11/ files=('mydata1.csv', 'mydata1.csv');

    • copy into t1 from @%t1/united_states/california/los_angeles/2016/06/01/11/ pattern='.*mydata[^[0-9]{1,3}$$].csv';

When planning regular data loads, such as with extract-transform-load (ETL) processing , it is important to partition the data in your internal (i.e., Snowflake) stage or external locations (S3 buckets or Azure containers) using logical, granular paths. Create a partitioning structure that includes identifying details such as the application or location, along with the date when the data was written. You can then copy any fraction of the partitioned data into Snowflake with a single command. You can copy data into Snowflake by the hour, day, month, or even year when you initially populate tables.

Here are some examples of partitioned S3 buckets using paths:
  • s3://bucket_name/brand/2016/07/01/11/

  • s3://bucket_name/region/country/2016/07/01/14/

    Note S3 transmits a directory list with each COPY statement used by Snowflake, so reducing the number of files in each directory improves the performance of your COPY statements.

Loading

The COPY command supports several options for loading data files from a stage.
  • By path of internal location or prefix of external location

  • By listing specific files to load (up to 1,000 per COPY command)

  • By using pattern matching to identify specific files by pattern

These options enable you to copy a fraction of the staged data into a Snowflake table with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations. Do take special note that the file being copied must have the same data structure (i.e., number of columns, data type) as the table.

Tip

Listing specific files to load from a stage is generally the fastest.

Here’s an example of a list of files:
copy into sample_table from @%sample_data/data1/ files=('sample_file1.csv', 'sample_file2.csv', 'sample_file3.csv')
Here’s a pattern matching example:
copy into sample_table from @%sample_data/data1/ pattern='.*sample_file[^0-9{1,3}$$].csv';

In general, pattern matching using a regular expression is the slowest of the three options for identifying/specifying data files to load from a stage; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order. Pattern matching can be combined with paths for further control over data loading.

When data from staged files is loaded successfully, consider removing the staged files to ensure the data isn’t inadvertently loaded again (duplicated). Staged files can be deleted from a Snowflake stage (user stage, table stage, or named stage) using the following methods:
  • Files that were loaded successfully can be deleted from the stage during a load by specifying the PURGE copy option in the COPY INTO <table> command.

  • After the load completes, use the REMOVE command to remove the files in the stage.

Querying Staged Files

Snowflake automatically generates metadata for files in Snowflake’s internal file staging or external (i.e., AWS S3, Google Cloud Storage, or Microsoft Azure) file staging. This metadata can be queried with the following:
  • A standard SELECT statement.

  • During a COPY into a table. Transformations may be applied to the columns in the same SELECT statement.

Note

Querying is primarily for performing simple queries during the data loading only and is not intended to replace querying already loaded tables.

Here’s the query syntax for a standard SELECT statement:
SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] , ...  ]
  FROM { <internal_location> | <external_location> }
  [ ( FILE_FORMAT => <namespace>.<named_file_format> ) ]
  [ <alias> ]
Here’s the query syntax during a load:
/* Data load with transformation */
COPY INTO [<namespace>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
     FROM ( SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] ... ]
            FROM { internalStage | externalStage } )

Bulk Loading with the Snowflake Web Interface

For smaller files (less than 50 MB), loading from the Snowflake web interface is fine.

Note

If a file is large, then it should be loaded using SnowSQL or Snowpipe. See the following chapters for more information on how to bulk load with SnowSQL and Snowpipe.

Bulk File Loading Through the Snowflake User Interface

Prerequisites
  1. 1.

    You need a Snowflake account; please review Chapter 2 to set one up.

     
  2. 2.

    Download the file named zips2000.csv.

     

Instructions

  1. 1.

    Log into the Snowflake web user interface.

     
  2. 2.

    Click Databases + Tables.

     
  3. 3.
    Click Create Table and enter the values shown in Figure 4-1.
    ../images/482416_1_En_4_Chapter/482416_1_En_4_Fig1_HTML.jpg
    Figure 4-1

    Create Table dialog that allows you to enter a table name along with column names and their data types

     
  1. 4.
    Click Load Data, and the Load Data wizard (Figure 4-2) will appear. Select the warehouse you want to use and click Next.
    ../images/482416_1_En_4_Chapter/482416_1_En_4_Fig2_HTML.jpg
    Figure 4-2

    The Load Data wizard will appear once you click Load Data

     
  1. 5.
    Click Select Files to browse for zips2000.csv in the location you saved it in (Figure 4-3). Click Next.
    ../images/482416_1_En_4_Chapter/482416_1_En_4_Fig3_HTML.jpg
    Figure 4-3

    Source files getting selected

     
  1. 6.
    Click the Add (looks like a plus sign) button to add a file format. Enter what is shown in Figure 4-4. Click Finish.
    ../images/482416_1_En_4_Chapter/482416_1_En_4_Fig4_HTML.jpg
    Figure 4-4

    Create File Format dialog

     
  1. 7.

    Click Load, and your table will load. Once it has completed loading, you can query it as usual.

     

Summary

In this chapter, we reviewed bulk data loading with COPY, and we covered what bulk data loading is, file load locations, supported file formats and encoding, compression handling, and encryption options. We also covered bulk data loading recommendations including file preparation and staging, loading, and querying. In addition, we went through some sample exercises on bulk loading data using COPY in our virtual warehouse.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.200.14