Chapter 4: Managing Your Data in CAS

Overview

Getting Started with Caslibs and CAS Tables

Loading Data into a CAS Table

Displaying Data in a CAS Table

Computing Simple Statistics

Dropping a CAS Table

CAS Data Types

Caslib and CAS Table Visibility

The Active Caslib

Uploading Data Files to CAS Tables

Uploading Data from URLs to CAS Tables

Uploading Data from a Pandas DataFrame to a CAS Table

Using Data Message Handlers

The HTML Data Message Handler

The Excel Data Message Handler

The PandasDataFrame Data Message Handler

Using Data Message Handlers with Databases

Writing Your Own Data Message Handlers

Variable Definition Details

Adding Data Transformers

Managing Caslibs

Creating a Caslib

Setting an Active Caslib

Dropping a Caslib

Conclusion

Before you can do any sort of analysis in CAS, you need some data to work with. There are two components to data access in CAS: caslibs and CAS tables. Caslibs are definitions that give access to a resource that contains data. These resources can be files that are located in a file system, a database, streaming data from an ESP (Event Stream Processing) server, or other data sources that SAS can access. Caslibs contain all of the connection information as well as authentication and authorization settings. In addition, caslibs are containers for tables that are loaded in the server, effectively enabling you to create namespaces for tables.

When you want to analyze data from one of your caslib resources, you load the data into a CAS table. A CAS table contains columns of data and information about the data in the columns (including column labels, data types, data formats, and so on).

CAS tables and caslibs are the topic of this chapter.

Overview

The subject of caslibs and CAS tables is fairly extensive, especially when you consider the multiple ways of loading and parsing data from various sources. We have attempted to boil it down to one simple diagram to give you an overall view of the process and components involved.

image

From the preceding diagram, you see that SWAT has multiple methods for loading data, and the CAS actions used to load the data vary as well. Each session can have one or more caslibs, with each connected to a different data source. The caslibs and tables within the caslibs might have session scope or global scope depending on the options that they were created with. These details are covered in the following sections.

Getting Started with Caslibs and CAS Tables

Since all data that is accessed by CAS must be in a caslib, it makes sense to start there when discussing how to get your data into CAS. Depending on your CAS server setup, you might already have one or more caslibs configured and ready to use. To find out what caslibs are available, use the caslibinfo action.

In [1]: import swat

 

In [2]: conn = swat.CAS('server-name.mycompany.com', 5570,

   ...:                 'username', 'password')

 

In [3]: conn.caslibinfo()

Out[3]:

[CASLibInfo]

 

                 Name  Type                  Description  

 0  CASUSER(username)  PATH  Personal File System Caslib

 1            Formats  PATH                Format Caslib

 

                     Path Definition  Subdirs  Local  Active  

 0           /u/username/                 1.0    0.0     1.0

 1  /bigdisk/lax/formats/                 1.0    0.0     0.0

 

    Personal  Hidden

 0       1.0     0.0

 1       0.0     0.0

 

+ Elapsed: 0.000434s, mem: 0.0971mb

We can see from this that we have two caslibs defined already. The first caslib points to our home directory. The second caslib is for holding user-defined data formats. Both of these initial caslibs are set up by a system administrator. Depending on your permissions, you might be able to add others as needed. You see that the table describing the caslibs also contains additional information about each caslib.

A human-readable description can be added when creating a caslib as well as information about the breadth and scope of the caslib. The Subdirs column indicates whether subdirectories of the specified path of the caslib can be accessed. A value of 0 means no, and 1 means yes. The Local column indicates whether the caslib is visible in the current session (indicated by 1), or if it is visible by all sessions (indicated by 0). The Active column indicates which caslib is active. When a caslib is active, it is used if a caslib is not explicitly specified during any operation that requires one. The Personal column indicates whether the caslib is specific to the user. These caslibs can be seen only by the current user regardless of whether the caslib is local or global in scope. Finally, the Hidden column indicates whether the caslib should be displayed in caslib listings. The caslib is displayed only when the showhidden parameter is set to True in the caslibinfo action.

Now that we know what caslibs are available to us, let’s see what they contain. We have some data files that are stored in the data subdirectory of our home directory, which is pointed to by the caslib Casuser. Let’s use the fileinfo action to list the contents of that directory.

In [4]: conn.fileinfo('data', caslib='casuser')

Out[4]:

[FileInfo]

 

     Permission     Owner  Group                               Name  

 0   -rw-r--r--  username  users                           iris.csv

 1   -rw-r--r--  username  users                           cars.csv

 2   -rw-r--r--  username  users              sashelp_class.sashdat

 3   -rw-r--r--  username  users                          class.csv

 

        Size Encryption                Time

 0      3716             18Feb2016:17:25:43

 1     42177             18Feb2016:17:25:49

 2     82136       NONE  19Feb2016:17:02:37

 3       519             14Apr2016:10:39:02

 

+ Elapsed: 0.0165s, user: 0.001s, sys: 0.003s, mem: 0.121mb

You’ll notice that when we specified the caslib to look in, we used Casuser, but we didn’t include the username enclosed in parentheses. The username part can be specified, but it is primarily there to differentiate multiple Casuser caslibs in the listing. You’ll also notice that caslib names as well as CAS table names are case-insensitive. In addition, since Casuser is our active caslib, we don’t have to specify a caslib option at all.

In [5]: conn.fileinfo('data')

Out[5]:

[FileInfo]

 

     Permission     Owner  Group                               Name  

 0   -rw-r--r--  username  users                           iris.csv

 1   -rw-r--r--  username  users                           cars.csv

 2   -rw-r--r--  username  users              sashelp_class.sashdat

 3   -rw-r--r--  username  users                          class.csv

 

        Size Encryption                Time

 0      3716             18Feb2016:17:25:43

 1     42177             18Feb2016:17:25:49

 2     82136       NONE  19Feb2016:17:02:37

 3       519             14Apr2016:10:39:02

 

+ Elapsed: 0.0103s, user: 0.003s, sys: 0.003s, mem: 0.118mb

Although you can omit the caslib option, it’s generally a good practice to include it. When you create larger programs and start moving code around, your active caslib might not be what was in your original program context.

In addition to the files that are accessible to this caslib, there can also be tables that have already been loaded that are available to the caslib. These can be seen using the tableinfo action.

In [6]: out = conn.tableinfo()

NOTE: No tables are available in caslib CASUSER(username) of Cloud Analytic Services.

Out[6]: + Elapsed: 0.000399s, mem: 0.0749mb

 

At this point, we don’t have any tables already loaded. Since we have data files that are located in an area that is accessible to our Casuser caslib, let’s load one into a table.

Loading Data into a CAS Table

To load data from a file in a caslib into a CAS table, you use the loadtable action by specifying the file path and the caslib name.

In [6]: out = conn.loadtable('data/iris.csv', caslib='casuser')

NOTE: Cloud Analytic Services made the file data/iris.csv available as table DATA.IRIS in caslib CASUSER(username).

 

In [7]: out

Out[7]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'DATA.IRIS'

 

[casTable]

 

 CASTable('DATA.IRIS', caslib='CASUSER(username)')

 

+ Elapsed: 0.109s, user: 0.075s, sys: 0.036s, mem: 64.9mb

You might remember from Chapter 2 that we used the upload method on the CAS connection object to load data. That method loads data from the client side, whereas the loadtable action loads files from the server side. We’ll get into more details about the upload method later in this chapter.

The result of the loadtable action includes the table name (tableName), the caslib (caslib), and a CAS table object (casTable). We discuss the CAS table object in much more detail in a later chapter. For now, let’s focus on the table name and caslib outputs.

The table name that you get from the loadtable action is generated and is based on the file name that was loaded. You can change the name of the loaded table using the casout option. In the following example, we specify both the name and the caslib that the loaded table is written to:

In [8]: out = conn.loadtable('data/iris.csv',

   ...:                      caslib='casuser',

   ...:                      casout=dict(name='mydata',

   ...:                                  caslib='casuser'))

 

NOTE: Cloud Analytic Services made the file data/iris.csv available as table MYDATA in caslib CASUSER(username).

 

In [9]: out

Out[9]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'MYDATA'

 

[casTable]

 

 CASTable('MYDATA', caslib='CASUSER(username)')

 

+ Elapsed: 0.0907s, user: 0.054s, sys: 0.039s, mem: 64.7mb

Now that our data is loaded into a table, let’s request information about the table from the server. The first action you’ll want to use is the tableinfo action.

In [10]: conn.tableinfo('data.iris', caslib='casuser')

Out[10]:

[TableInfo]

 

         Name  Rows  Columns Encoding CreateTimeFormatted  

 0  DATA.IRIS   150        5    utf-8  02Nov2016:11:54:28

 

      ModTimeFormatted JavaCharSet    CreateTime       ModTime  

 0  02Nov2016:11:54:28        UTF8  1.793707e+09  1.793707e+09

 

    Global  Repeated  View     SourceName       SourceCaslib  

 0       0         0     0  data/iris.csv  CASUSER(username)

 

    Compressed   Creator Modifier

 0           0  username

 

+ Elapsed: 0.00101s, user: 0.000999s, mem: 0.104mb

From these results, we see that there are 150 rows of data in 5 columns. Another field of particular note is the Global value. This indicates whether the table can be seen in other CAS sessions. In our case, the value is 0, and therefore, it is local to this session only.

Now that we have some information about the table as a whole, let’s see what information we can get about the columns. To get information about CAS table columns, you use the columninfo action. Since the tableinfo action gives a broad view of the table information, it needs only a table name and a caslib name. The columninfo action is somewhat more complex and uses a dictionary as a table argument.

In [11]: out = conn.columninfo(table=dict(name='data.iris',

   ....:                                  caslib='casuser'))

In [12]: out

Out[12]:

[ColumnInfo]

 

          Column  ID     Type  RawLength  FormattedLength  NFL  NFD

 0  sepal_length   1   double          8               12    0    0

 1   sepal_width   2   double          8               12    0    0

 2  petal_length   3   double          8               12    0    0

 3   petal_width   4   double          8               12    0    0

 4       species   5  varchar         10               10    0    0

 

+ Elapsed: 0.000751s, mem: 0.168mb

From this output, you can see the column names, the data types, the data sizes, and various formatting information.

Now that we know about our data, let’s look at the data itself.

Displaying Data in a CAS Table

Of course, the most important thing about the table is the data itself. We can get a sample of the data by using the fetch action. In this case, we use the to=5 parameter to fetch only five rows.

In [13]: conn.fetch(table=dict(name='data.iris',

   ....:                       caslib='casuser'),

   ....:            to=5)

Out[13]:

[Fetch]

 

 Selected Rows from Table DATA.IRIS

 

    sepal_length  sepal_width  petal_length  petal_width species

 0           5.1          3.5           1.4          0.2  setosa

 1           4.9          3.0           1.4          0.2  setosa

 2           4.7          3.2           1.3          0.2  setosa

 3           4.6          3.1           1.5          0.2  setosa

 4           5.0          3.6           1.4          0.2  setosa

 

+ Elapsed: 0.00381s, user: 0.004s, mem: 1.64mb

Note that unless you specify sorting options, the order of the data that comes back is not guaranteed to be in the same order that the data was entered in. This is because data can be spread across the CAS grid and the result that you see is the compiled version of all of the results of the nodes on a grid. In the following code, we fetch the data using the sepal_length and sepal_width sort keys to guarantee ordering.

In [14]: conn.fetch(table=dict(name='data.iris',

   ....:                       caslib='casuser'),

   ....:            to=5,

   ....:            sortby=['sepal_length', 'sepal_width'])

Out[14]:

[Fetch]

 

 Selected Rows from Table DATA.IRIS

 

    sepal_length  sepal_width  petal_length  petal_width species

 0           4.3          3.0           1.1          0.1  setosa

 1           4.4          2.9           1.4          0.2  setosa

 2           4.4          3.0           1.3          0.2  setosa

 3           4.4          3.2           1.3          0.2  setosa

 4           4.5          2.3           1.3          0.3  setosa

 

+ Elapsed: 0.0171s, user: 0.014s, sys: 0.003s, mem: 8.58mb

Now that we have seen a sample of the data, let’s compute some simple statistics.

Computing Simple Statistics

Just to whet your appetite a bit, let’s get some summary statistics on the table that we just loaded. We’ll delve into running actions on data in later chapters, but it seems a shame to load our first set of data and not do any sort of analysis on it. Let’s run the summary action on our table.

In [15]: conn.summary(table=dict(name='data.iris',

   ....:                         caslib='casuser'))

Out[15]:

[Summary]

 

 Descriptive Statistics for DATA.IRIS

 

          Column  Min  Max      N  NMiss      Mean    Sum       Std  

 0  sepal_length  4.3  7.9  150.0    0.0  5.843333  876.5  0.828066

 1   sepal_width  2.0  4.4  150.0    0.0  3.054000  458.1  0.433594

 2  petal_length  1.0  6.9  150.0    0.0  3.758667  563.8  1.764420

 3   petal_width  0.1  2.5  150.0    0.0  1.198667  179.8  0.763161

 

      StdErr       Var      USS         CSS         CV     TValue  

 0  0.067611  0.685694  5223.85  102.168333  14.171126  86.425375

 1  0.035403  0.188004  1427.05   28.012600  14.197587  86.264297

 2  0.144064  3.113179  2583.00  463.863733  46.942721  26.090198

 3  0.062312  0.582414   302.30   86.779733  63.667470  19.236588

 

            ProbT

 0  3.331256e-129

 1  4.374977e-129

 2   1.994305e-57

 3   3.209704e-42

 

+ Elapsed: 0.0266s, user: 0.028s, sys: 0.003s, mem: 1.74mb

When you are finished using a table, it can be removed from memory. Let’s see how to do that in the next section.

Dropping a CAS Table

If a table is no longer needed, you can remove it by using the droptable action.

In [16]: conn.droptable('data.iris', caslib='casuser')

NOTE: Cloud Analytic Services dropped table data.iris from caslib CASUSER(username).

Out[16]: + Elapsed: 0.000332s, mem: 0.0535mb

In some of the previous listings, you might have seen data types that haven’t previously been available in a SAS product. CAS has a much richer set of data types than traditional SAS does, we’ll look at those next.

CAS Data Types

When parsing the Iris CSV file in the previous examples, the character data was stored as a varchar (variable-width character) column. This means that the values are not padded with extra space like a fixed-width character column. This is just one of the new data types in CAS that aren’t available in traditional SAS. Here is a full list of the data types that are available in CAS tables1.

Name Description
Double 8-byte IEEE floating point number
Int32 32-bit integer
Int64 64-bit integer
DecQuad 128-bit fixed-decimal value
DecSext 192-bit fixed-decimal value
Char Fixed-width, blank-padded UTF-8 encoded character data
Varchar Variable-width UTF-8 encoded character data
Binary Fixed-width binary data
Varbinary Variable-width binary data
Date CAS date value (32-bit integer containing the number of days since January 1, 1960)
Time CAS time value (64-bit integer containing the number of microseconds since midnight)
Datetime CAS datetime value (64-bit integer containing the number of microseconds since January 1, 1960)

With all of these data types, it is possible to do much more targeted data processing.

Caslib and CAS Table Visibility

There are two scopes in CAS for caslibs and CAS tables: session (or local) and global. By default, both caslibs and CAS tables are created with session scope. That means that they are visible only in the session that created them. If you need to use them across sessions, you need to specify a parameter indicating the global scope. The addcaslib action uses the session parameter to indicate session or global visibility. The loadtable and addtable actions use the promote parameter to indicate whether the table should be promoted to the global scope. CAS tables can also be promoted after creation by using the promote action. However, note that the promotion of a table is possible only in a global caslib.

The Active Caslib

As mentioned previously, there is one caslib that is considered active at any point in time. To be active means that any table references in actions that are not accompanied by a caslib name are assumed to be in the active caslib. You can use the caslibinfo action to see which caslib is active. In the following output, the Casuser caslib is marked as the active caslib.

In [17]: conn.caslibinfo()

Out[17]:

[CASLibInfo]

 

                 Name  Type                  Description  

 0  CASUSER(username)  PATH  Personal File System Caslib

 1            Formats  PATH                Format Caslib

 

                     Path Definition  Subdirs  Local  Active  

 0           /u/username/                 1.0    0.0     1.0

 1  /bigdisk/lax/formats/                 1.0    0.0     0.0

 

    Personal  Hidden

 0       1.0     0.0

 1       0.0     0.0

 

+ Elapsed: 0.000427s, mem: 0.095mb

Alternatively, you can use the getsessopt action to query the caslib option.

In [18]: conn.getsessopt('caslib')

Out[18]:

[caslib]

 

 'CASUSER(username)'

 

+ Elapsed: 0.000268s, mem: 0.0477mb

The active caslib can be set at creation using the activeonadd parameter of addcaslib, or it can be set at any time using the setsessopt action.

In [19]: conn.setsessopt(caslib='formats')

NOTE: 'Formats' is now the active caslib.

Out[19]: + Elapsed: 0.000289s, mem: 0.0948mb

Regardless of which caslib is active, it’s generally a good idea to always specify the caslib that you load a table from. This practice helps avoid errors where the active caslib gets changed at another point in the program.

Let’s set the active caslib back to casuser before moving on.

In[20]: conn.setsessopt(caslib='casuser')

With the information from this chapter so far, you should be able load basic data files into CAS. If you have some data that you want to analyze, you could jump ahead to the chapters that cover the topics that you are interested in. However, if you want to learn about other ways of getting data into CAS, go to the next section.

Uploading Data Files to CAS Tables

In the previous section, we showed how to load a data file from a file-based caslib. But what if you don’t have direct access to the file system where the CAS server is running, or you just want to upload data from a file that exists on the client side? The easiest solution is to use the upload method on the CAS connection.

In its simplest form, the upload method just takes a filename, a URL, or a Pandas DataFrame as an argument. The type of data that is in the file is inferred from the file extension, and the CAS table name is automatically generated. At the time of this writing, the file types that are supported by the upload method are sashdat, sas7bdat, XLS, CSV, DTA, JMP, and SPSS. Note that when loading data using the upload method, the table exists only in-memory. The file is not stored anywhere on the server. Let’s load the Iris data set from a local file rather than from a file on the server.

In [21]: conn.upload('/u/username/data/iris.csv')

NOTE: Cloud Analytic Services made the uploaded file available as table IRIS in caslib Formats.

NOTE: The table IRIS has been created in caslib Formats from binary data uploaded to Cloud Analytic Services.

Out[21]:

[caslib]

 

 'CASUSER'

 

[tableName]

 

 'IRIS'

 

[casTable]

 

 CASTable('IRIS', caslib='CASUSER(username)')

 

+ Elapsed: 0.0683s, user: 0.038s, sys: 0.024s, mem: 48.4mb

As you can see, the output from the upload method is very similar to that of the loadtable action. We get back a results object that contains a caslib name, a table name, and a CAS table object. We can now query for the table, the column, and data information as we did in the previous section.

In [22]: conn.columninfo(table=dict(name='iris',

   ....:                            caslib='casuser'))

Out[22]:

[ColumnInfo]

 

          Column  ID     Type  RawLength  FormattedLength  NFL  NFD

 0  sepal_length   1   double          8               12    0    0

 1   sepal_width   2   double          8               12    0    0

 2  petal_length   3   double          8               12    0    0

 3   petal_width   4   double          8               12    0    0

 4       species   5  varchar         10               10    0    0

 

+ Elapsed: 0.000707s, mem: 0.17mb

Also, just as with the loadtable action, you can specify a casout option in order to assign a different table name or caslib.

In [23]: conn.upload('/u/username/data/iris.csv',

   ....:             casout=dict(name='iris2', caslib='casuser'))

NOTE: Cloud Analytic Services made the uploaded file available as table IRIS2 in caslib CASUSER(username).

NOTE: The table IRIS2 has been created in caslib CASUSER(username) from binary data uploaded to Cloud Analytic Services.

Out[23]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'IRIS2'

 

[casTable]

 

 CASTable('IRIS2', caslib='CASUSER(username)')

 

+ Elapsed: 0.0614s, user: 0.037s, sys: 0.021s, mem: 48.4mb

So far, we have been using the default data parsing parameters for our CSV file. The upload method supports various options for each of the supported file types. The parameter that is used to specify parsing options is the importoptions parameter. This parameter is the same as in the loadtable action, so we can use the Help from the loadtable action to see what options are available. Here is a partial listing.

In [24]: conn.loadtable?

 

...

 

Loads a table from a caslib's data source

 

Parameters

----------

path : string

    specifies the file, directory, or table name.

 

readahead : boolean, optional

    when set to True, loads the table into memory immediately. By

    default, a table is loaded into memory when it is first used.

    Default: False

 

importoptions : dict, optional

    specifies the settings for reading a table from a data source.

 

    importoptions.filetype : string

        Default: auto

        Values: auto, hdat, csv, delimited, excel, jmp, spss, dta,

                esp, lasr, basesas, mva, xls, fmt

 

    if importoptions.filetype == hdat:

 

        importoptions.encryptionpassword : string or blob, optional

            specifies a password for encrypting or decrypting stored

            data.

 

    if importoptions.filetype == csv:

 

        importoptions.guessrows : int64, optional

            specifies the number of rows to scan in order to determine

            data types for variables. Specify 0 to scan all rows.

            Default: 20

            Note: Value range is 1 <= n < 9223372036854775807

 

        importoptions.delimiter : string, optional

            specifies the character to use as the field delimiter.

            Default: ,

 

        importoptions.vars : list of dicts, optional

            specifies the names, types, formats, and other metadata

            for variables.

 

            importoptions.vars[*].name : string, optional

                specifies the name for the variable.

 

            importoptions.vars[*].label : string, optional

                specifies the descriptive label for the variable.

 

... truncated ...

Using the importoptions parameter that was previously described, you can parse a tab-delimited file rather than a comma-delimited file by changing the delimiter.

In [25]: out = conn.upload('/u/username/data/iris.tsv',

   ....:                   importoptions=dict(filetype='csv',

   ....:                                      delimiter=' '),

   ....:                   casout=dict(name='iris_tsv',

   ....:                               caslib='casuser'))

NOTE: Cloud Analytic Services made the uploaded file available as table IRIS_TSV in caslib CASUSER(username).

NOTE: The table IRIS_TSV has been created in caslib CASUSER(username) from binary data uploaded to Cloud Analytic Services.

 

In [26]: out

Out[26]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'IRIS_TSV'

 

[casTable]

 

 CASTable('IRIS_TSV', caslib='CASUSER(username)')

 

+ Elapsed: 0.0918s, user: 0.051s, sys: 0.04s, mem: 64.7mb

Now that we have seen how to upload local files, let’s examine uploading data from URLs.

Uploading Data from URLs to CAS Tables

Uploading data from a URL works exactly the same way as uploading from a local file. The only difference is that you are specifying a URL as the first parameter of the upload method.

In [27]: out = conn.upload('https://raw.githubusercontent.com/'

   ....:                   'sassoftware/sas-viya-programming/'

   ....:                   'master/data/class.csv')

NOTE: Cloud Analytic Services made the uploaded file available as table CLASS in caslib CASUSER(username).

NOTE: The table CLASS has been created in caslib CASUSER(username) from binary data uploaded to Cloud Analytic Services.

 

In [28]: out

Out[28]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'CLASS'

 

[casTable]

 

 CASTable('CLASS', caslib='CASUSER(username)')

 

+ Elapsed: 0.0948s, user: 0.06s, sys: 0.035s, mem: 64.7mb

A fact to note about uploading from a URL is that the file is not downloaded by the CAS server. The URL is downloaded to a temporary file on the client side, and then it is uploaded to the server. Although this form of uploading is quick and convenient, you’ll want to keep this fact in mind when using it.

Uploading Data from a Pandas DataFrame to a CAS Table

If you already have your data in a Pandas DataFrame and you want to upload it to a CAS table, again, the upload method is the quickest and easiest way. In the case of DataFrames, you give the DataFrame object as the first argument to the upload method rather than a filename. In the following example, we first read a CSV file into a DataFrame, and then we’ll upload that DataFrame to CAS.

# Read the CSV file with Pandas

In [29]: import pandas as pd

 

In [30]: df = pd.read_csv('/u/username/data/iris.csv')

 

# Quick check of data values

In [31]: df.head()

Out[31]:

   sepal_length  sepal_width  petal_length  petal_width species

0           5.1          3.5           1.4          0.2  setosa

1           4.9          3.0           1.4          0.2  setosa

2           4.7          3.2           1.3          0.2  setosa

3           4.6          3.1           1.5          0.2  setosa

4           5.0          3.6           1.4          0.2  setosa

 

# Upload the DataFrame

In [32]: conn.upload(df)

NOTE: Cloud Analytic Services made the uploaded file available as table TMPWZJKVO_1 in caslib CASUSER(username).

NOTE: The table TMPWZJKVO_1 has been created in caslib CASUSER(username) from binary data uploaded to Cloud Analytic Services.

Out[32]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'TMPWZJKVO_1'

 

[casTable]

 

 CASTable('TMPWZJKVO_1', caslib='CASUSER(username)')

 

+ Elapsed: 0.0897s, user: 0.059s, sys: 0.031s, mem: 64.7mb

 

In [33]: conn.fetch(table=dict(name='TMPWZJKVO_1',

   ....:                       caslib='casuser'), to=5)

Out[33]:

[Fetch]

 

 Selected Rows from Table TMPWZJKVO_1

 

    sepal_length  sepal_width  petal_length  petal_width species

 0           5.1          3.5           1.4          0.2  setosa

 1           4.9          3.0           1.4          0.2  setosa

 2           4.7          3.2           1.3          0.2  setosa

 3           4.6          3.1           1.5          0.2  setosa

 4           5.0          3.6           1.4          0.2  setosa

 

+ Elapsed: 0.00321s, user: 0.003s, mem: 1.65mb

You’ll see that in the case of DataFrames, the table name is always generated since it can’t be inferred from a filename. Of course, you can always specify a name manually using the casout argument to the upload method.

Since it is possible to upload data directly from a Pandas DataFrame, you can use any of the Pandas data readers to construct your data and then push the result to CAS. However, although the DataFrame uploader is a simple way to upload data, it has limitations as well. When a DataFrame is uploaded, it gets exported to CSV first, and then the CSV file is uploaded. This works fine for character and numeric data, but if your data includes dates or times, those are always treated as character data by the loadtable action that gets invoked behind-the-scenes. It is possible to create more customized data loaders. That is the topic of the next section.

Using Data Message Handlers

If you can’t get your data properly uploaded using the simpler methods such as the loadtable action or the upload method, you might require the heavier duty technique called data message handlers.  Data message handlers are Python classes that can handle requests from the CAS server for batches of data. The data message handler classes are used when invoking the addtable action. They set up the list of variable attributes, and they respond to requests from the server for chunks of data.

Although the data message handlers are more powerful than the upload method, they work only with the binary protocol to CAS. You cannot use them with the REST interface. If you use the REST interface, you must stick with the upload method, or you must load server-side tables using the loadtable action.

The SWAT package includes three primary data message handler classes: CASDataMsgHandler, PandasDataFrame, and DBAPI. The CASDataMsgHandler class is an abstract class, which means that it isn’t meant to be used directly. It is used only as a subclass to create concrete data message handler classes such as PandasDataFrame and DBAPI.

As you might have suspected from the names, the PandasDataFrame data message handler class supports Pandas DataFrame objects, and the DBAPI data message handler class supports DB-API 2.0 compliant database connection objects. There are several subclasses of the PandasDataFrame class that correspond to many of the parsers and database connections that the Pandas package supplies. These include CSV, Text, HTML, Excel, JSON, SQLQuery, SQLTable, and Clipboard. These classes are convenient classes that bundle both the parsing of data with the data upload in one class. You could also use the Pandas parsers to create your own DataFrame, and then use the generic PandasDataFrame class to upload the result.

All of the data message handler classes are included in the swat.cas.datamsghandlers package. That’s quite a lot to enter, so we’ll import it as follows so that we can use the dmh alias.

In [34]: from swat.cas import datamsghandlers as dmh

To see the available data message handler classes, you can use Python’s __subclasses__ method. Note that this method prints only the immediate subclasses. Therefore, you must traverse the subclasses as well, in order to see all of the classes.

In [35]: dmh.CASDataMsgHandler.__subclasses__()

Out[35]: [swat.cas.datamsghandlers.PandasDataFrame,   

          swat.cas.datamsghandlers.DBAPI]

 

In [36]: dmh.PandasDataFrame.__subclasses__()

Out[36]:

[swat.cas.datamsghandlers.SQLTable,

 swat.cas.datamsghandlers.CSV,

 swat.cas.datamsghandlers.Text,

 swat.cas.datamsghandlers.JSON,

 swat.cas.datamsghandlers.HTML,

 swat.cas.datamsghandlers.SAS7BDAT,

 swat.cas.datamsghandlers.Clipboard,

 swat.cas.datamsghandlers.Excel,

 swat.cas.datamsghandlers.SQLQuery]

We’ve practiced loading the CSV data type. Now let’s look at another data type in our next example: HTML.

The HTML Data Message Handler

For this example, we’ll look at loading data from an HTML file. The URL that we use is https://www.fdic.gov/bank/individual/failed/banklist.html. This document has a nice table of information about banks that have failed since October 1, 2000. This is an especially good example because the data in this table includes character, numeric, and date-based columns.

image

The HTML data message handler has an option to specify which table in the file should be read. In this case, the data table of bank information happens to be the first table in the file. To use the HTML data message handler, you create an instance of it using the URL of the HTML file as the first argument. We also specify index=0 as the index of the table to load, which is also the default.

In [37]: htmldmh = dmh.HTML('https://www.fdic.gov/bank/' +         

   ....:                    'individual/failed/banklist.html',

   ....:                    index=0)

Now that we have a data message handler instance that is bound to some data, we can use it with the addtable action in order to upload the data to CAS. The addtable action takes various parameters that describe the data. These parameters are generated by the data message handler classes. Since the addtable action might not be the only action that uses data messages, the data message handler classes are set up to support various parameter sets. The parameters for the addtable action are accessed in the htmldmh.args.addtable property.

In [38]: htmldmh.args.addtable

Out[38]:

{'datamsghandler': <swat.cas.datamsghandlers.HTML at 0x7fd4ba20da20>,

 'reclen': 104,

 'vars': [{'length': 16,

   'name': 'Bank Name',

   'offset': 0,

   'rtype': 'CHAR',

   'type': 'VARCHAR'},

  {'length': 16,

   'name': 'City',

   'offset': 16,

   'rtype': 'CHAR',

   'type': 'VARCHAR'},

  {'length': 16,

   'name': 'ST',

   'offset': 32,

   'rtype': 'CHAR',

   'type': 'VARCHAR'},

  {'length': 8,

   'name': 'CERT',

   'offset': 48,

   'rtype': 'NUMERIC',

   'type': 'INT64'},

  {'length': 16,

   'name': 'Acquiring Institution',

   'offset': 56,

   'rtype': 'CHAR',

   'type': 'VARCHAR'},

  {'length': 16,

   'name': 'Closing Date',

   'offset': 72,

   'rtype': 'CHAR',

   'type': 'VARCHAR'},

  {'length': 16,

   'name': 'Updated Date',

   'offset': 88,

   'rtype': 'CHAR',

   'type': 'VARCHAR'}]}

As you can see from the preceding lengthy listing a lot of information is generated to describe the columns that are uploaded to CAS. You might also notice that there might be more columns in this code than are displayed in the HTML table. That’s because some of the columns in the HTML table are hidden by CSS (Cascading Style Sheet) rules. Even though those columns are hidden from view, they still exist in the HTML table and are parsed by Pandas.

To use these generated parameters, we can take advantage of the Python ** operator, which expands a dictionary to keyword parameters in a method call. Since we are using the addtable action, the invocation appears as follows. Note that we must also specify a name for our table, and we specify a caslib name as well in order to be explicit about where the table should be created.

In [39]: out = conn.addtable(table='banklist', caslib='casuser',

   ....:                     **htmldmh.args.addtable)

 

In [40]: out

Out[40]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'BANKLIST'

 

[casTable]

 

 CASTable('BANKLIST', caslib='CASUSER(username)')

 

+ Elapsed: 0.109s, user: 0.003s, mem: 0.56mb

The output from addtable action looks just like the output from the loadtable action. We get a table name, a caslib name, and a CASTable object in the results. Now that we have the table loaded into CAS, we can query it in order to verify that it loaded as expected.

In [41]: conn.columninfo(table=dict(name='banklist',

   ....:                            caslib='casuser'))

Out[41]:

[ColumnInfo]

 

                   Column  ID     Type  RawLength  FormattedLength  

 0              Bank Name   1  varchar         90               90

 1                   City   2  varchar         17               17

 2                     ST   3  varchar          2                2

 3                   CERT   4    int64          8               12

 4  Acquiring Institution   5  varchar         65               65

 5           Closing Date   6  varchar         18               18

 6           Updated Date   7  varchar         18               18

 

    NFL  NFD

 0    0    0

 1    0    0

 2    0    0

 3    0    0

 4    0    0

 5    0    0

 6    0    0

 

+ Elapsed: 0.000628s, mem: 0.172mb

You might notice that even Pandas didn’t automatically detect the date columns. You can use the parse_dates option for the Pandas HTML reader to specify where to look for dates in your data. In this case, the date columns are columns 5 and 6. In our data message handler parameters, we add parse_dates=[5, 6] in order to have the date columns entered correctly.

Also, in order to reload the data, you must create a new data message handler object. Once the data from a data message handler has been exhausted, it can't be used again.

In [42]: htmldmh = dmh.HTML('https://www.fdic.gov/bank/' +

   ....:                    'individual/failed/banklist.html',

   ....:                    index=0, parse_dates=[5, 6])

 

In [43]: out = conn.addtable(table='banklist', caslib='casuser',

   ....:                     replace=True,

   ....:                     **htmldmh.args.addtable)

 

In [44]: conn.columninfo(table=dict(name='banklist',

   ....:                            caslib='casuser'))

Out[44]:

[ColumnInfo]

 

                   Column  ID      Type  RawLength  FormattedLength  

 0              Bank Name   1   varchar         90               90

 1                   City   2   varchar         17               17

 2                     ST   3   varchar          2                2

 3                   CERT   4     int64          8               12

 4  Acquiring Institution   5   varchar         65               65

 5           Closing Date   6  datetime          8               20

 6           Updated Date   7  datetime          8               20

 

      Format  NFL  NFD

 0              0    0

 1              0    0

 2              0    0

 3              0    0

 4              0    0

 5  DATETIME    0    0

 6  DATETIME    0    0

 

+ Elapsed: 0.000638s, user: 0.001s, mem: 0.175mb

There are a couple of things to note about the preceding output. First, columns 5 and 6 now have the datetime type and a DATETIME format. We know that our date parsing option worked. Second, notice the replace=True option in our second call to the addtable action. If a table already exists in CAS with the given name, you receive an error and the existing table is not overwritten. In order for the addtable action to overwrite the existing table, you must supply the replace=True option.

We can fetch a sample of the data to verify that it looks correct. We specify sastypes=False on the fetch call so that it returns the data in the richest form possible.

In [45]: conn.fetch(table=dict(name='banklist',

   ....:                       caslib='casuser'),

   ....:            sastypes=False, to=3)

Out[45]:

[[Fetch]

 

 Selected Rows from Table BANKLIST

 

                       Bank Name             City  ST   CERT  

 0                   Allied Bank         Mulberry  AR     91

 1  The Woodbury Banking Company         Woodbury  GA  11297

 2        First CornerStone Bank  King of Prussia  PA  35312

 

                  Acquiring Institution Closing Date Updated Date

 0                         Today's Bank   2016-09-23   2016-10-17

 1                          United Bank   2016-08-19   2016-10-17

 2  First-Citizens Bank & Trust Company   2016-05-06   2016-09-06

 

+ Elapsed: 0.00369s, user: 0.001s, sys: 0.001s, mem: 1.69mb

In addition to parse_dates option, you can specify any of the Pandas read_html options that deal with parsing such as header, index_col, thousands, encoding, and so on.

Now that we have seen one example of how the data message handlers work, let’s look at one that is likely to be popular: Excel.

The Excel Data Message Handler

The Excel data message handler works like the HTML data message handler. The only difference is that it uses the Pandas read_excel function “under the covers” rather than the read_html function. For this example, we use the Crop Year 2014 Disaster Declarations from the USDA website:

http://www.fsa.usda.gov/Internet/FSA_File/disaster_cty_list_ytd_14.xls2

image

Like the HTML in the previous example, this file has a mixture of data types.

We start by creating a data message handler object, just like in the section about HTML.

In [46]: exceldmh = dmh.Excel('http://www.fsa.usda.gov/Internet/' +  

   ....:                      'FSA_File/disaster_cty_list_ytd_14.xls')

Now we can add the table to the server using the addtable action.

In [47]: out = conn.addtable(table='crops', caslib='casuser',

   ....:                     **exceldmh.args.addtable)

 

In [48]: out

Out[48]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'CROPS'

 

[casTable]

 

 CASTable('CROPS', caslib='CASUSER(username)')

 

+ Elapsed: 0.992s, user: 0.007s, sys: 0.013s, mem: 2.83mb

Here is a partial list of the output from the columninfo action on the resulting table:

In [49]: conn.columninfo(table=dict(name='crops', caslib='casuser'))

Out[49]:

[ColumnInfo]

 

                      Column  ID      Type  RawLength  

 0                      FIPS   1     int64          8

 1                    County   2   varchar         17

 2                     State   3   varchar         14

 3          Designation Code   4     int64          8

 4        Designation Number   5   varchar          5

 5                   DROUGHT   6     int64          8

 6     FLOOD, Flash flooding   7     int64          8

 ..                      ...  ..       ...        ...

 28            Approval date  29   varchar          0

 29               Begin Date  30  datetime          8

 30                 End Date  31   varchar         10

 31  Description of disaster  32   varchar        251

 32       CROP DISASTER YEAR  33     int64          8

 33              Unnamed: 33  34   varchar        101

 34              Unnamed: 34  35    double          8

 

     FormattedLength    Format  NFL  NFD

 0                12              0    0

 1                17              0    0

 2                14              0    0

 3                12              0    0

 4                 5              0    0

 5                12              0    0

 6                12              0    0

 ..              ...       ...  ...  ...

 28                0              0    0

 29               20  DATETIME    0    0

 30               10              0    0

 31              251              0    0

 32               12              0    0

 33              101              0    0

 34               12              0    0

 

 [35 rows x 8 columns]

 

+ Elapsed: 0.00075s, user: 0.001s, mem: 0.166mb

You can see that unlike HTML, the datetime column was automatically detected and put into the correct column type. Again, you can pass any of the Pandas read_excel arguments to the Excel data message handler to modify the behavior of the parsing.

If you already have a Pandas DataFrame or you just want to use the Pandas file reader functions directly, you can advance to the PandasDataFrame data message handler.

The PandasDataFrame Data Message Handler

Most of the data message handlers that are supplied by SWAT are subclasses of the PandasDataFrame data message handler class. This data message handler class handles all of the work in computing the CAS data types, constructing data buffers, and passing them to the server. The subclasses merely have to parse data into a Pandas DataFrame and then pass it to the PandasDataFrame class to handle the transfer of data to the server. Rather than using one of the convenience classes, let’s just use the PandasDataFrame class directly.

Just so that you can see what is going on behind the scenes in the Excel data message handler, let’s repeat the example from the previous section but parse the Excel file on our own and then pass the result to PandasDataFrame. This exercise provides insight into writing your own data message handlers.

In the following example, we first parse the Excel file into a DataFrame, then we load it into CAS.

# Import Pandas

In [50]: import pandas as pd

 

# Read the Excel file into a DataFrame

In [51]: exceldf = pd.read_excel('http://www.fsa.usda.gov/Internet/' +

   ....:                      'FSA_File/disaster_cty_list_ytd_14.xls')

 

# Create a PandasDataFrame data message handler

In [52]: exceldmh = dmh.PandasDataFrame(exceldf)

 

# Add the table to the server

In [53]: out = conn.addtable(table='crops_df', caslib='casuser',

   ....:                     **exceldmh.args.addtable)

 

In [54]: out

Out[54]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'CROPS_DF'

 

[casTable]

 

 CASTable('CROPS_DF', caslib='CASUSER(username)')

 

+ Elapsed: 0.993s, user: 0.032s, sys: 0.006s, mem: 3.07mb

As you can see, there are a few extra steps involved in order to do the parsing of the Excel yourself. However, the PandasDataFrame data message handler gives you a lot of power since you can pass any DataFrame to it for loading on the server regardless of its origin.

In addition to files and URLs, Pandas DataFrames have additional methods to import data from databases. We look at these techniques in the following section.

Using Data Message Handlers with Databases

Pandas includes a few methods for importing data from a database, including read_sql_table and read_sql_query. These are used in the SQLTable and SQLQuery data message handler classes, respectively. The easiest way to demonstrate database connectivity is by using SQLite. SQLite is a self-contained SQL database engine that requires no configuration or setup. It can be used “out of the box.”

We use the Iris data set again for this example. However, in this case, we load the data into an SQLite database, and then use the SQLTable and SQLQuery data message handlers to load the data from the database. Of course, Python supports many other commercial and non-commercial databases, so this same procedure will work with those as well.

# Import csv parser and sqlite3

In [55]: import csv

 

In [56]: import sqlite3

 

# Create an in-memory database

In [57]: sqlc = sqlite3.connect('iris.db')

 

In [58]: cur = sqlc.cursor()

 

# Define the table

In [59]: cur.execute('''CREATE TABLE iris (sepal_length REAL,

   ....:                                   sepal_width REAL,

   ....:                                   petal_length REAL,

   ....:                                   petal_width REAL,

   ....:                                   species CHAR(10));''')

Out[59]: <sqlite3.Cursor at 0x7fbf3eedd810>

 

# Parse the iris CSV file and format it as tuples

In [60]: with open('/u/username/data/iris.csv', 'r') as iris:

   ....:     data = csv.DictReader(iris)

   ....:     rows = [(x['sepal_length'],

   ....:              x['sepal_width'],

   ....:              x['petal_length'],

   ....:              x['petal_width'],

   ....:              x['species']) for x in data]

 

# Load the data into the database

In [61]: cur.executemany('''INSERT INTO iris (sepal_length,

   ....:                                      sepal_width,

   ....:                                      petal_length,

   ....:                                      petal_width,

   ....:                                      species)

   ....:                    VALUES (?, ?, ?, ?, ?);''', rows)

Out[61]: <sqlite3.Cursor at 0x7fbf3eedd810>

 

In [62]: sqlc.commit()

 

# Verify that the data looks correct

In [63]: cur.execute('SELECT * from iris')

Out[63]: <sqlite3.Cursor at 0x7fbf3b9afb90>

 

In [64]: cur.fetchmany(5)

Out[64]:

[(5.1, 3.5, 1.4, 0.2, 'setosa'),

 (4.9, 3.0, 1.4, 0.2, 'setosa'),

 (4.7, 3.2, 1.3, 0.2, 'setosa'),

 (4.6, 3.1, 1.5, 0.2, 'setosa'),

 (5.0, 3.6, 1.4, 0.2, 'setosa')]

Now that we have a database with a table in it, we can connect to the table using the SQLTable data message handler and load the iris table into CAS. To do this, we need to create an SQLAlchemy engine. Then, we pass that engine and the name of the table to SQLTable.

# Create the SQLAlchemy database engine

In [65]: eng = dmh.SQLTable.create_engine('sqlite:///iris.db')

 

# Create the data message handler

In [66]: sqldmh = dmh.SQLTable('iris', eng)

 

# Load the database table into CAS

In [67]: out = conn.addtable(table='iris_sql', caslib='casuser',

                              **sqldmh.args.addtable)

 

In [68]: out

Out[68]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'IRIS_SQL'

 

[casTable]

 

 CASTable('IRIS_SQL', caslib='CASUSER(username)')

 

+ Elapsed: 0.0585s, user: 0.002s, mem: 0.239mb

 

# Reality check the data in the server

In [69]: conn.columninfo(table=dict(name='iris_sql',

   ....:                            caslib='casuser'))

Out[69]:

[ColumnInfo]

 

          Column  ID     Type  RawLength  FormattedLength  NFL  NFD

 0  sepal_length   1   double          8               12    0    0

 1   sepal_width   2   double          8               12    0    0

 2  petal_length   3   double          8               12    0    0

 3   petal_width   4   double          8               12    0    0

 4       species   5  varchar         10               10    0    0

 

+ Elapsed: 0.000665s, mem: 0.155mb

 

In [70]: conn.fetch(table=dict(name='iris_sql',

   ....:                       caslib='casuser'), to=5)

Out[70]:

[Fetch]

 

 Selected Rows from Table IRIS_SQL

 

    sepal_length  sepal_width  petal_length  petal_width species

 0           5.1          3.5           1.4          0.2  setosa

 1           4.9          3.0           1.4          0.2  setosa

 2           4.7          3.2           1.3          0.2  setosa

 3           4.6          3.1           1.5          0.2  setosa

 4           5.0          3.6           1.4          0.2  setosa

 

+ Elapsed: 0.00278s, user: 0.002s, mem: 1.62mb

Using the SQLQuery data message handler, we can load the result of a SELECT statement in the database. The result may be a subset of a table or the result of a complex merge operation. Since we have only one table in our example database, we’ll just use a WHERE clause to subset the data and then upload the data to CAS.

# Set up the SQLQuery data message handler with a query

In [71]: sqldmh = dmh.SQLQuery('''SELECT * FROM iris

   ....:                       WHERE species = "versicolor"

   ....:                       AND sepal_length > 6.6''', eng)

 

# Load the query result into CAS

In [72]: out = conn.addtable(table='iris_sql2', caslib='casuser',

   ....:                     **sqldmh.args.addtable)

 

In [73]: out

Out[73]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'IRIS_SQL2'

 

[casTable]

 

 CASTable('IRIS_SQL2', caslib='CASUSER(username)')

 

+ Elapsed: 0.0121s, user: 0.001s, mem: 0.212mb

 

# Reality check the result

In [74]: conn.fetch(table=dict(name='iris_sql2',

   ....:                       caslib='casuser'))

Out[74]:

[Fetch]

 

 Selected Rows from Table IRIS_SQL2

 

    sepal_length  sepal_width  petal_length  petal_width     species

 0           7.0          3.2           4.7          1.4  versicolor

 1           6.9          3.1           4.9          1.5  versicolor

 2           6.7          3.1           4.4          1.4  versicolor

 3           6.8          2.8           4.8          1.4  versicolor

 4           6.7          3.0           5.0          1.7  versicolor

 5           6.7          3.1           4.7          1.5  versicolor

 

+ Elapsed: 0.00379s, user: 0.003s, sys: 0.001s, mem: 1.58mb

So far, all of the data message handlers that we have looked at use a Pandas DataFrame in the background. That means that all of the data gets loaded into memory before it gets shipped to CAS. For small tables, it’s not an issue. But if you want to load large amounts of data, you could run into problems.

There is one data message handler that loads data from a Python DB-API 2.0 compliant database into CAS while just loading one buffer’s worth of data from the database in each pass. That is the topic for the next section.

Streaming Data from a Database into a CAS Table

The DBAPI data message handler loads data a little differently than all of the previously discussed data message handlers. It connects to a database and requests just enough data to fill a buffer to pass on to CAS. Then, it sends the data to CAS and refills the buffer with the next set of data from the database. It fills and refills the buffer until all of the data from the query has been exhausted. This method of data loading prevents the data message handler from storing all of the data in memory before loading it into CAS.

In the code below, we load the SQLite database that we created earlier and execute a query on a cursor. The SQLite module and the cursor are used in creating the DBAPI data message handler instance. Then, just as with all of the other data message handlers, the data message handler is used to generate the parameters to the addtable action.

# Import SQLite package and load database

In [75]: import sqlite3

 

In [76]: sqlc = sqlite3.connect('iris.db')

 

# Perform query

In [77]: c = sqlc.cursor()

 

In [78]: c.execute('SELECT * FROM iris')

Out[78]: <sqlite3.Cursor at 0x7f44b98f3030>

 

# Create data message handler with the current query

In [79]: dbdmh = dmh.DBAPI(sqlite3, c, nrecs=10)

 

# Upload the data to the server

In [80]: conn.addtable(table='iris_db', caslib='casuser',

   ....:               **dbdmh.args.addtable)

Out[80]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'IRIS_DB'

 

[casTable]

 

 CASTable('IRIS_DB', caslib='CASUSER(username)')

 

+ Elapsed: 0.0462s, user: 0.003s, mem: 0.239mb

 

# Reality check the data

In [81]: conn.columninfo(table=dict(name='iris_db',

   ....:                            caslib='casuser'))

Out[81]:

[ColumnInfo]

 

          Column  ID     Type  RawLength  FormattedLength  NFL  NFD

 0  sepal_length   1   double          8               12    0    0

 1   sepal_width   2   double          8               12    0    0

 2  petal_length   3   double          8               12    0    0

 3   petal_width   4   double          8               12    0    0

 4       species   5  varchar         10               10    0    0

 

+ Elapsed: 0.000766s, user: 0.001s, mem: 0.173mb

 

In [82]: conn.fetch(table=dict(name='iris_db',

                               caslib='casuser'), to=5)

Out[82]:

[Fetch]

 

 Selected Rows from Table IRIS_DB

 

    sepal_length  sepal_width  petal_length  petal_width species

 0           5.1          3.5           1.4          0.2  setosa

 1           4.9          3.0           1.4          0.2  setosa

 2           4.7          3.2           1.3          0.2  setosa

 3           4.6          3.1           1.5          0.2  setosa

 4           5.0          3.6           1.4          0.2  setosa

 

+ Elapsed: 0.00347s, user: 0.003s, mem: 1.58mb

You might notice in the preceding code that when we created the DBAPI data message handler, we specified the nrecs keyword argument. Valid on all data message handler classes, this argument specifies the number of records to include in the buffer that is being sent to CAS. In this case, it also specifies the number of rows from the database that can be stored in-memory at one time before being sent to CAS. You can tune this value as needed, but most of the time the default is fine.

Although you can stream as much data into CAS as you need, using this method, if you have very large amounts of data, this method might not be the best way. There is a lot of overhead in getting values from a database, converting them to Python objects, converting them into a format that CAS can read, and then uploading them. Only you can decide whether any of the methods described in the previous sections are right for your use.

So far, in this chapter, we have covered ways of uploading data to CAS from commonly known data formats. If you have data that you want to access from less commonly known sources, you might want to consider writing a custom data message handler. That is covered in the next section.

Writing Your Own Data Message Handlers

Although most people are well-served with the supplied data message handler classes, you might have a custom data format or an unsupported database that you want to use for loading data into CAS. The first thing you need to decide is if using a Python-based data message handler is the appropriate choice. As mentioned in the previous section, there is a lot of overhead in transforming data into and out of Python objects as well as performance limitations of Python’s looping constructs. So, if you have a large amount of data to move into CAS, you might want to consider exporting it to CSV or another common format that can be copied to the server and loaded through the loadtable action. However, if you have a reasonable amount of data for this technique and simply want a way to load custom data from Python, then writing your own data message handler could be the solution that you are looking for.

Writing a custom data message handler isn’t difficult. Most of the work is already done for you in the CASDataMsgHandler class. This class does all of the low-level communication with the server and creates all of the needed data structures behind the scenes. All you need to do is create the attributes for each of the columns to be uploaded and implement a getrow method that can return a specified row of data from the underlying data store.

Here is a simple example that loads data from a hardcoded structure in the class definition:

class MyDMH(dmh.CASDataMsgHandler):

 

    def __init__(self):

        self.data = [

            ('Alfred',  'M', 14, 69,   112.5),

            ('Alice',   'F', 13, 56.5, 84),

            ('Barbara', 'F', 13, 65.3, 98),

            ('Carol',   'F', 14, 62.8, 102.5),

            ('Henry',   'M', 14, 63.5, 102.5),           

        ]

 

        vars = [

            dict(name='name', label='Name', type='varchar'),

            dict(name='sex', label='Sex', type='varchar'),

            dict(name='age', label='Age', type='int32'),

            dict(name='height', label='Height', type='double'),

            dict(name='weight', label='Weight', type='double'),

        ]

 

        super(MyDMH, self).__init__(vars)

 

    def getrow(self, row):

        try:

            return self.data[row]

        except IndexError:

            return

That’s all there is to it. The __init__ constructor sets up the data source and the descriptions of the columns of data, and the getrow method returns the requested row of data. When there is no more data, the getrow method returns None. Note that the getrow method requests a row number rather than just the next row because if there is an error in sending a buffer of data, the same row could be requested repeatedly.

In this example, the data and the variable definitions are hardcoded, but you can use any available facilities to generate the data and the variable definitions, as necessary.

Now that we have a new data message handler to work with, let’s try it out.

# Create an instance of the data message handler

In [83]: mydmh = MyDMH()

 

# Call addtable using the new data message handler

In [84]: conn.addtable(table='myclass', caslib='casuser',

   ....:               **mydmh.args.addtable)

Out[84]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'MYCLASS'

 

[casTable]

 

 CASTable('MYCLASS', caslib='CASUSER(username)')

 

+ Elapsed: 0.00909s, user: 0.002s, mem: 0.221mb

 

# Reality check the resulting table

In [85]: conn.columninfo(table=dict(name='myclass',

   ....:                            caslib='casuser'))

Out[85]:

[ColumnInfo]

 

    Column   Label  ID     Type  RawLength  FormattedLength  NFL  NFD

 0    name    Name   1  varchar          7                7    0    0

 1     sex     Sex   2  varchar          1                1    0    0

 2     age     Age   3    int32          4               12    0    0

 3  height  Height   4   double          8               12    0    0

 4  weight  Weight   5   double          8               12    0    0

 

+ Elapsed: 0.000739s, user: 0.001s, mem: 0.173mb

 

In [86]: conn.fetch(table=dict(name='myclass',

   ....:                       caslib='casuser'), to=5)

Out[86]:

[Fetch]

 

 Selected Rows from Table MYCLASS

 

       name sex   age  height  weight

 0   Alfred   M  14.0    69.0   112.5

 1    Alice   F  13.0    56.5    84.0

 2  Barbara   F  13.0    65.3    98.0

 3    Carol   F  14.0    62.8   102.5

 4    Henry   M  14.0    63.5   102.5

 

+ Elapsed: 0.00355s, user: 0.004s, mem: 1.58mb

As you can see from the preceding output, the data has been uploaded and is of the specified types.  Let’s take a closer look at how to specify the variable definitions.

Variable Definition Details

In the previous section, we created a custom data message handler that used hardcoded data and variable definitions.  If you look at the documentation for the addtable action, you’ll see that the variable definition section is slightly different.  SWAT simplifies variable definitions somewhat since the buffer that is sent to CAS is also created by the SWAT client.  The only required fields for the variable definitions in the vars parameter are name and type. The required field name specifies the name of the column, and type is the name of the CAS data type.  Labels and SAS data formats for the columns can also be specified using the parameters label, format, and formattedlength. The remaining parameters in the addtable action should rarely be used.

Adding Data Transformers

Sometimes the data that you are accessing needs to be converted to another form before it can be sent to CAS. This situation is common when dealing with dates and times that might be read in as character data, but they should be converted to the proper numeric value for CAS. You could do this step yourself while reading in the data, but there is also a feature in data message handlers that can do it for you “on the fly.” These are called data transformers.

Let’s use our custom data message handler from the previous section to demonstrate transformers. We’re going to add a column to the table called birthdate. The changes in the following code have been emphasized.

class MyDMH(dmh.CASDataMsgHandler):

 

    def __init__(self):

        self.data = [

            ('Alfred',  'M', 14, 69,   112.5, '1987-03-01'),

            ('Alice',   'F', 13, 56.5, 84,    '1988-06-12'),

            ('Barbara', 'F', 13, 65.3, 98,    '1988-12-13'),

            ('Carol',   'F', 14, 62.8, 102.5, '1987-04-17'),

            ('Henry',   'M', 14, 63.5, 102.5, '1987-01-30'),           

        ]

 

        vars = [

            dict(name='name', label='Name', type='varchar'),

            dict(name='sex', label='Sex', type='varchar'),

            dict(name='age', label='Age', type='int32'),

            dict(name='height', label='Height', type='double'),

            dict(name='weight', label='Weight', type='double'),

            dict(name='birthdate', label='Birth Date',

                 type='date', format='DATE', formattedlength=12),

        ]

 

        super(MyDMH, self).__init__(vars)

 

    def getrow(self, row):

        try:

            return self.data[row]

        except IndexError:

            return

If we tried to use this class without modifications, it would fail because the last column is defined as a numeric type, but the column’s data is a character type.  We need to be able to transform the character data to the correct numeric value before it gets sent to CAS.  To do that, we can use a utility function from the datamsghandlers module: str2cas_date. The utility functions str2cas_datetime and str2cas_time are also available.

class MyDMH(dmh.CASDataMsgHandler):

 

    def __init__(self):

        self.data = [

            ('Alfred',  'M', 14, 69,   112.5, '1987-03-01'),

            ('Alice',   'F', 13, 56.5, 84,    '1988-06-12'),

            ('Barbara', 'F', 13, 65.3, 98,    '1988-12-13'),

            ('Carol',   'F', 14, 62.8, 102.5, '1987-04-17'),

            ('Henry',   'M', 14, 63.5, 102.5, '1987-01-30'),

        ]

 

        vars = [

            dict(name='name', label='Name', type='varchar'),

            dict(name='sex', label='Sex', type='varchar'),

            dict(name='age', label='Age', type='int32'),

            dict(name='height', label='Height', type='double'),

            dict(name='weight', label='Weight', type='double'),

            dict(name='birthdate', label='Birth Date',

                 type='date', format='DATE', formattedlength=12),

        ]

 

        transformers = {

            'birthdate': dmh.str2cas_date,

        }

 

        super(MyDMH, self).__init__(vars, transformers=transformers)

 

    def getrow(self, row):

        try:

            return self.data[row]

        except IndexError:

            return

We can now use our custom data message handler to upload dates in the correct form.

In [87]: mydmh = MyDMH()

 

In [88]: conn.addtable(table='myclass', caslib='casuser',

   ....:               replace=True,

   ....:               **mydmh.args.addtable)

Out[88]:

[caslib]

 

 'CASUSER(username)'

 

[tableName]

 

 'MYCLASS'

 

[casTable]

 

 CASTable('MYCLASS', caslib='CASUSER(username)')

 

+ Elapsed: 0.0232s, sys: 0.001s, mem: 0.254mb

 

In [89]: conn.columninfo(table=dict(name='myclass',

   ....:                            caslib='casuser'))

Out[89]:

[ColumnInfo]

 

       Column       Label  ID     Type  RawLength  FormattedLength  

 0       name        Name   1  varchar          7                7

 1        sex         Sex   2  varchar          1                1

 2        age         Age   3    int32          4               12

 3     height      Height   4   double          8               12

 4     weight      Weight   5   double          8               12

 5  birthdate  Birth Date   6     date          4               12

 

   Format  NFL  NFD

 0           0    0

 1           0    0

 2           0    0

 3           0    0

 4           0    0

 5   DATE    0    0

 

+ Elapsed: 0.00105s, mem: 0.172mb

 

In [90]: conn.fetch(table=dict(name='myclass', caslib='casuser'),

   ....:            sastypes=False)

Out[90]:

[Fetch]

 

 Selected Rows from Table MYCLASS

 

       name sex  age  height  weight   birthdate

 0   Alfred   M   14    69.0   112.5  1987-03-01

 1    Alice   F   13    56.5    84.0  1988-06-12

 2  Barbara   F   13    65.3    98.0  1988-12-13

 3    Carol   F   14    62.8   102.5  1987-04-17

 4    Henry   M   14    63.5   102.5  1987-01-30

 

+ Elapsed: 0.00363s, user: 0.001s, sys: 0.002s, mem: 1.58mb

Although we used our custom data message handler to demonstrate data transformers, you can pass a dictionary of functions to any of the data message handler classes that are described in this chapter using the transformers parameter on the constructor.

That’s all there is to writing your own data message handlers. You just need to create a subclass CASDataMsgHandler and override the __init__constructor and the getrow method. The constructor creates the list of variable attributes and sets up the data source, and the getrow method returns a specified row of data. Transformers can be defined in the constructor to convert automatically data values to the appropriate form for CAS. With a custom data message handler, you have the capability to load data from any data source into CAS.

Managing Caslibs

We have used file system path-based caslibs throughout this chapter. These are the most common form of caslib, but there are other types such as HDFS, DNFS, ESP, and Hadoop. Coverage of all of them is beyond the scope of this book, but we cover how to set up and manage a new caslib in this section.

Creating a Caslib

Suppose that you have a collection of data files in a directory called /research/data/ and you want to create a caslib that enables access to only the files (but not the subdirectories) in that directory. The caslib should be accessible by all sessions and should not be the active caslib.

Consulting the documentation for the addcaslib action, you’ll see that we need to use the parameters subdirs, session, and activeonadd. We’ll also add a description just to be complete. Here is the code:

In [91]: conn.addcaslib(path='/research/data',

   ....:                caslib='research',

   ....:                description='Research Data',  

   ....:                subdirs=False,

   ....:                session=False,

   ....:                activeonadd=False)

NOTE: Cloud Analytic Services added the caslib 'research'.

Out[91]:

[CASLibInfo]

 

        Name  Type    Description      Path Definition  

 0  research  PATH  Research Data      /research/data/

 

    Subdirs  Local  Active  Personal  Hidden

 0      0.0    0.0     0.0       0.0     0.0

 

+ Elapsed: 0.00125s, user: 0.001s, mem: 0.12mb

As you can see from the output, we now have a new caslib that matches the parameters that we specified. Subdirs is set to 0, which indicates that subdirectories cannot be accessed. Local is set to 0, which means that it can be seen in any session. Note that this setting might require additional permissions for your user account. Active is set to 0, which means that it needs to be specified explicitly in action calls in conjunction with paths and table names.

Setting an Active Caslib

If you later decide that you want to make this caslib the active caslib, you can use the setsessopt action with the caslib parameter.

In [92]: conn.setsessopt(caslib='research')

NOTE: 'research' is now the active caslib.

Out[92]: + Elapsed: 0.00039s, mem: 0.085mb

 

In [92]: conn.caslibinfo(caslib='research')

Out[92]:

[CASLibInfo]

 

        Name  Type    Description      Path Definition  

 0  research  PATH  Research Data      /research/data/

 

    Subdirs  Local  Active  Personal  Hidden

 0      0.0    0.0     1.0       0.0     0.0

 

+ Elapsed: 0.000553s, mem: 0.0945mb

Dropping a Caslib

If you want to remove a caslib, it’s as simple as using the dropcaslib action.

In [93]: conn.dropcaslib('research')

NOTE: 'CASUSER(username)' is now the active caslib.

NOTE: Cloud Analytic Services removed the caslib 'research'.

Out[93]: + Elapsed: 0.000631s, sys: 0.001s, mem: 0.0682mb

Notice that if you drop a caslib that was the active caslib, the active flag gets set on the previously active caslib. Also, if you drop a global caslib, multiple users might be affected.

Conclusion

We have covered a lot of ground in this chapter about getting your data into CAS. We started with learning what caslibs and CAS tables are. We then used the loadtable action to load data files that are stored on the server. From the client side, we used the upload method on the CAS connection object to upload local files, URLs, and Pandas DataFrames. The next step was to load data from various sources using data message handlers. Finally, we wrote our own custom data message handlers.

We also went through the process of creating and dropping a caslib so that you can create your own.

With all of these techniques, you now have the ability to get data into CAS in various ways and to organize it. In the next chapter, we describe the CASAction and CASTable objects.

1 Support for data types might vary depending on your version of CAS.

2 US Department of Agriculture. “Crop Year 2014 Disaster Declarations.” US Department of Agriculture Data Catalog. Available http://www.fsa.usda.gov/Internet/FSA_File/disaster_cty_list_ytd_14.xls. Accessed December 6, 2016.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.77.63