Chapter 6: Working with CAS Tables

Using CASTable Objects like a DataFrame

CAS Table Introspection

Computing Simple Statistics

Creating Plots from CASTable Data

Exporting CASTables to Other Formats

Sorting, Data Selection, and Iteration

Fetching Data with a Sort Order

Iterating through Columns and Rows

Techniques for Indexing and Selecting Data

Data Wrangling on the Fly

Creating Computed Columns

By Group Processing

Conclusion

In the previous chapter, we introduced the CASAction and CASTable objects. We covered the methods of setting parameters and calling actions on a CASTable object, but our coverage only included the basics. In this chapter we continue to work on CASTable objects on a variety of topics, including the DataFrame APIs, exporting data, creating visualizations, fetching and indexing, creating temporary columns, and by group processing.

Using CASTable Objects like a DataFrame

The Pandas DataFrame has taken the elegance of the Python programming language and applied it beautifully to data structures that are used in everyday statistics and data analysis. It supports many basic statistical analyses on data sets that can fit into memory on your computer, and it also defines interesting ways to index data sets using Python’s “fancy” indexing syntax.

The CASTable object adopts these same APIs and applies them to tables in CAS. This means that you can use the familiar DataFrame API on CAS tables. The result types that are returned by the CASTable versions of the Pandas DataFrame methods are, in most cases, the same as those that are returned by DataFrame methods of the same name. The primary difference is that the amount of data that is referenced by a CASTable object can be much greater than what can be handled on a single desktop machine.

In the following sections, we start with some methods and attributes that describe the CAS table itself, and then move on to methods for simple statistics and other areas of the DataFrame API.

CAS Table Introspection

Basic information about CAS tables includes the names of the columns, the data types, and the number of rows. Until now, we’ve used the tableinfo and columninfo CAS actions to get that information. However, we can also use the DataFrame attributes and methods for the same purpose. Probably the most common attribute that is used to get this type of information is the columns attribute. It simply lists the names of all of the columns in the table. We load the Iris data into both a CAS table and a DataFrame for comparison.

# Import pandas

In [1]: import pandas as pd

 

# Read iris.csv into a DataFrame

In [2]: df = pd.read_csv('/u/username/data/iris.csv')

 

In [3]: df.columns

Out[3]: Index(['sepal_length', 'sepal_width', 'petal_length',

                 'petal_width', 'species'], dtype='object')

 

# Load iris.csv into a CAS table

In [4]: tbl = conn.loadtable('data/iris.csv',

                              caslib='casuser').casTable

NOTE: Cloud Analytic Services made the file data/iris.csv available as table DATA.IRIS in caslib CASUSER(username).

 

In [5]: tbl.columns

Out[5]: Index(['sepal_length', 'sepal_width', 'petal_length',

                 'petal_width', 'species'], dtype='object')

In both cases, you see that the columns attribute returns a Pandas Index object that contains the names of the columns. You can also find out the data types using dtypes.

In [6]: df.dtypes

Out[6]:

sepal_length    float64

sepal_width     float64

petal_length    float64

petal_width     float64

species          object

dtype: object

 

In [7]: tbl.dtypes

Out[7]:

sepal_length     double

sepal_width      double

petal_length     double

petal_width      double

species         varchar

dtype: object

In this case, you’ll see a small difference. The data types that are supported by DataFrames and those that are supported by CAS tables have different names. The different names don’t matter since the Python float64 types is equivalent to the CAS double type. However, if you use the type names for filtering in other parts of the API, you want to use the CAS type names.

You can also get the counts of each data type using the get_dtype_counts method.

In [8]: df.get_dtype_counts()

Out[8]:

float64    4

object     1

dtype: int64

 

In [9]: tbl.get_dtype_counts()

Out[9]:

double     4

varchar    1

dtype: int64

 

To obtain the size or the shape of a CAS table, you use the size and the shape attributes just as in a DataFrame.

In [10]: tbl.size

Out[10]: 750

 

In [11]: tbl.shape

Out[11]: (150, 5)

 

For a quick summary of information about the entire table, you can use the info method.

In [12]: tbl.info()

CASTable('DATA.IRIS', caslib='CASUSER(username)')

Data columns (total 5 columns):

                N   Miss     Type

sepal_length  150  False   double

sepal_width   150  False   double

petal_length  150  False   double

petal_width   150  False   double

species       150  False  varchar

dtypes: double(4), varchar(1)

data size: 8450

vardata size: 1250

memory usage: 8656

Finally, to get a sample of the data, you can use the head or tail methods. Keep in mind that CAS commonly uses distributed data storage that has no particular sort order, so the values that you get from the head and tail methods are not deterministic unless the table has a specified sort order, which is covered later in this chapter.

In [13]: tbl.head(n=3)

Out[13]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width species

0           5.1          3.5           1.4          0.2  setosa

1           4.9          3.0           1.4          0.2  setosa

2           4.7          3.2           1.3          0.2  setosa

 

In [14]: tbl.tail(n=4)

Out[14]:

Selected Rows from Table DATA.IRIS

 

     sepal_length  sepal_width  petal_length  petal_width    species

146           6.4          2.8           5.6          2.2  virginica

147           6.3          2.8           5.1          1.5  virginica

148           6.1          2.6           5.6          1.4  virginica

149           7.7          3.0           6.1          2.3  virginica

As you can see from the preceding code, the head and tail methods accept the n parameter to indicate the number of records to retrieve (the default is 5). CAS commonly deals with large data sets and you might not want to bring all of the variables back. Using the columns parameter, it is possible to subset the columns that are retrieved.

In [15]: tbl.head(columns=['sepal_length', 'petal_length'])

Out[15]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  petal_length

0           5.1           1.4

1           4.9           1.4

2           4.7           1.3

3           4.6           1.5

4           5.0           1.4

Although this part of the DataFrame API is fairly simple, it already shows how easy working with CASTable objects can be. Let’s move on to computing some basic statistics.

Computing Simple Statistics

A common DataFrame method that is used to learn about the characteristics of your data is the describe method. Here is a simple example of getting basic summary statistics from a DataFrame and a CASTable, both using the describe method.

# Run the describe method on a DataFrame

In [16]: desc = df.describe()

 

In [17]: desc

Out[17]:

       sepal_length  sepal_width  petal_length  petal_width

count    150.000000   150.000000    150.000000   150.000000

mean       5.843333     3.054000      3.758667     1.198667

std        0.828066     0.433594      1.764420     0.763161

min        4.300000     2.000000      1.000000     0.100000

25%        5.100000     2.800000      1.600000     0.300000

50%        5.800000     3.000000      4.350000     1.300000

75%        6.400000     3.300000      5.100000     1.800000

max        7.900000     4.400000      6.900000     2.500000

 

In [18]: type(desc)

Out[18]: pandas.core.frame.DataFrame

 

# Run the describe method on a CASTable

In [19]: casdesc = tbl.describe()

 

In [20]: casdesc

Out[20]:

       sepal_length  sepal_width  petal_length  petal_width

count    150.000000   150.000000    150.000000   150.000000

mean       5.843333     3.054000      3.758667     1.198667

std        0.828066     0.433594      1.764420     0.763161

min        4.300000     2.000000      1.000000     0.100000

25%        5.100000     2.800000      1.600000     0.300000

50%        5.800000     3.000000      4.350000     1.300000

75%        6.400000     3.300000      5.100000     1.800000

max        7.900000     4.400000      6.900000     2.500000

 

In [21]: type(casdesc)

Out[21]: pandas.core.frame.DataFrame

As you can see, in both cases, we run the describe method and get back the same results even in the same data type. This might not seem impressive for only 150 records, but how about running it on a larger collection?  Our table contains 70 million rows and 46 columns, 35 of which are numeric. Running the describe method on our grid of 48 nodes requires only seconds.

In [22]: conn.tableinfo('datasources.megacorp5m')

Out[22]:

[TableInfo]

 

                     Name      Rows  Columns Encoding  

0  DATASOURCES.MEGACORP5M  70732833       46    utf-8

 

  CreateTimeFormatted    ModTimeFormatted JavaCharSet    CreateTime  

0  01Mar2016:12:57:08  01Mar2016:12:57:08        UTF8  1.772456e+09

 

        ModTime  Global  Repeated  View                  SourceName  

0  1.772456e+09       0         0     0  datasources/megacorp5m.csv

 

  SourceCaslib  Compressed

0   CASTestTmp           0

 

In [23]: %time mega.describe()

CPU times: user 261 ms, sys: 24 ms, total: 285 ms

Wall time: 12.1 s

Out[23]:

                  Date       DateByYear      DateByMonth  

count  70732833.000000  70732833.000000  70732833.000000

mean      14389.447935     14209.155789     14374.679408

std        2653.775328      2651.783194      2653.766319

min        7305.000000      7305.000000      7305.000000

25%       12460.000000     12419.000000     12450.000000

50%       14605.000000     14245.000000     14579.000000

75%       16517.000000     16437.000000     16496.000000

max       18992.000000     18628.000000     18962.000000

 

...

 

         StateLatitude    CityLongitude     CityLatitude

count  70732833.000000  70119950.000000  70119950.000000

mean         35.090183       -95.995716        34.365083

std           4.313599        12.363752         5.171787

min          31.051683      -122.418330        27.800283

25%          31.169358       -97.396111        30.694170

50%          32.615651       -92.289440        33.509170

75%          37.271892       -88.043060        37.775000

max          47.272951       -71.455280        47.606390

 

[8 rows x 35 columns]

Now that we’ve seen that we can analyze both small and large quantities of data in the familiar Pandas DataFrame API, let’s see what else it can do.

In addition to running the describe method with the default settings, you can use any of the options that it supports including percentiles, include, and exclude to modify the output. Let’s go back to the Iris data set and try these options.

The percentiles that you get with the default options are 25%, 50%, and 75%. We can customize those values by passing in float values between 0 and 1 to the percentiles option.

In [24]: tbl

Out[24]: CASTable('DATA.IRIS', caslib='CASUSER(username)')

 

In [25]: tbl.describe(percentiles=[0.3, 0.8])

Out[25]:

       sepal_length  sepal_width  petal_length  petal_width

count    150.000000   150.000000    150.000000   150.000000

mean       5.843333     3.054000      3.758667     1.198667

std        0.828066     0.433594      1.764420     0.763161

min        4.300000     2.000000      1.000000     0.100000

30%        5.250000     2.800000      1.700000     0.400000

50%        5.800000     3.000000      4.350000     1.300000

80%        6.550000     3.400000      5.350000     1.900000

max        7.900000     4.400000      6.900000     2.500000

As you can see, we now get the 30% and 80% percentiles in addition to 50%, which always gets computed by the DataFrame method. The include and exclude options enable you to specify the data types that should be included in the analysis. You can specify them as a list of strings that indicate the CAS data types that should be included, or you can use the terms ‘numeric’, ‘character’, and ‘all’ to refer to all numeric types, all character types, and all variables, respectively. The values for this option vary slightly from the DataFrame method since the names of the data types are different between Pandas and CAS, but the form of the option is the same. In either case, the default is to analyze numeric data only.

# Analyze only character data in the CAS table

In [26]: tbl.describe(include='character')

Out[26]:

          species

count         150

unique          3

top     virginica

freq           50

 

# Analyze all data in the CAS table

In [27]: tbl.describe(include='all')

Out[27]:

       sepal_length sepal_width petal_length petal_width    species

count           150         150          150         150        150

unique           35          23           43          22          3

top               5           3          1.5         0.2  virginica

freq             10          26           14          28         50

mean        5.84333       3.054      3.75867     1.19867        NaN

std        0.828066    0.433594      1.76442    0.763161        NaN

min             4.3           2            1         0.1     setosa

25%             5.1         2.8          1.6         0.3        NaN

50%             5.8           3         4.35         1.3        NaN

75%             6.4         3.3          5.1         1.8        NaN

max             7.9         4.4          6.9         2.5  virginica

 

# Same as above

In [28]: tbl.describe(include=['numeric', 'character'])

Out[28]:

       sepal_length sepal_width petal_length petal_width    species

count           150         150          150         150        150

unique           35          23           43          22          3

top               5           3          1.5         0.2  virginica

freq             10          26           14          28         50

mean        5.84333       3.054      3.75867     1.19867        NaN

std        0.828066    0.433594      1.76442    0.763161        NaN

min             4.3           2            1         0.1     setosa

25%             5.1         2.8          1.6         0.3        NaN

50%             5.8           3         4.35         1.3        NaN

75%             6.4         3.3          5.1         1.8        NaN

max             7.9         4.4          6.9         2.5  virginica

The last option that is supported is called stats. This is not a DataFrame option, but was added to allow statistics that are supported by the summary action to be included in the results as well. It accepts a list of strings, including ‘count’, ‘std’, ‘min’, ‘pct’, ‘max’, ‘unique’, ‘top’, ‘freq’, ‘nmiss’, ‘sum’, ‘stderr’, ‘var’, ‘uss’, ‘cv’, ‘tvalue’, and ‘probt’. The value of ‘pct’ includes all of the percentiles that have been computed. If ‘all’ is specified, all of the previously mentioned statistics are displayed.

In [29]: tbl.describe(stats=['count', 'nmiss', 'sum',

   ....:                     'probt', 'freq'])

Out[29]:

        sepal_length    sepal_width  petal_length   petal_width

count   1.500000e+02   1.500000e+02  1.500000e+02  1.500000e+02

nmiss   0.000000e+00   0.000000e+00  0.000000e+00  0.000000e+00

sum     8.765000e+02   4.581000e+02  5.638000e+02  1.798000e+02

probt  3.331256e-129  4.374977e-129  1.994305e-57  3.209704e-42

freq    1.000000e+01   2.600000e+01  1.400000e+01  2.800000e+01

 

In [30]: tbl.describe(stats='all')

Out[30]:

         sepal_length    sepal_width  petal_length   petal_width

count    1.500000e+02   1.500000e+02  1.500000e+02  1.500000e+02

unique   3.500000e+01   2.300000e+01  4.300000e+01  2.200000e+01

mean     5.843333e+00   3.054000e+00  3.758667e+00  1.198667e+00

std      8.280661e-01   4.335943e-01  1.764420e+00  7.631607e-01

min      4.300000e+00   2.000000e+00  1.000000e+00  1.000000e-01

25%      5.100000e+00   2.800000e+00  1.600000e+00  3.000000e-01

50%      5.800000e+00   3.000000e+00  4.350000e+00  1.300000e+00

75%      6.400000e+00   3.300000e+00  5.100000e+00  1.800000e+00

max      7.900000e+00   4.400000e+00  6.900000e+00  2.500000e+00

nmiss    0.000000e+00   0.000000e+00  0.000000e+00  0.000000e+00

sum      8.765000e+02   4.581000e+02  5.638000e+02  1.798000e+02

stderr   6.761132e-02   3.540283e-02  1.440643e-01  6.231181e-02

var      6.856935e-01   1.880040e-01  3.113179e+00  5.824143e-01

uss      5.223850e+03   1.427050e+03  2.583000e+03  3.023000e+02

cv       1.417113e+01   1.419759e+01  4.694272e+01  6.366747e+01

tvalue   8.642537e+01   8.626430e+01  2.609020e+01  1.923659e+01

probt   3.331256e-129  4.374977e-129  1.994305e-57  3.209704e-42

It is also possible to access the individual statistics through the DataFrame methods count, mean, std, and max. The methods nmiss, sum, stderr, var, uss, cv, tvalue, and probt are also available in order to complete the set that is provided by the summary action. These methods return a Pandas Series object that contains the values for each variable.

In [31]: tbl.count()

Out[31]:

sepal_length    150

sepal_width     150

petal_length    150

petal_width     150

species         150

dtype: int64

 

In [32]: tbl.mean()

Out[32]:

sepal_length    5.843333

sepal_width     3.054000

petal_length    3.758667

petal_width     1.198667

dtype: float64

 

In [33]: tbl.probt()

Out[33]:

sepal_length    3.331256e-129

sepal_width     4.374977e-129

petal_length     1.994305e-57

petal_width      3.209704e-42

dtype: float64

At the time of this writing, not all of the DataFrame methods were supported on the CASTable object. As development of the SWAT package continues, more of the DataFrame methods will likely become available.

Now that we have seen some basic DataFrame operations working on CASTable objects, let’s move on to some plotting operations.

Creating Plots from CASTable Data

The CASTable objects support all of the same chart types as the DataFrame object because they use the same code. Plotting is done on the client side, so all of the data being visualized must be downloaded first. The plotting DataFrame methods are mirrored in the CASTable objects, and the data is downloaded automatically. Although these methods are convenient, for iterative work that might require tuning your plots, you probably want to bring the data back to a local DataFrame and use the plotting methods on it to avoid downloading data from the server at each iteration.

The easiest way to generate a plot of data in a CASTable is to use the plot method. We’ll use our original Iris data set for this example. The following creates a plot:

In [34]: tbl.plot()

Out[34]: <matplotlib.axes.AxesSubplot at 0x7f4b6650c4e0>

If you use the Jupyter notebook as your client, your plot should show up in-line. If it doesn’t appear in-line, you might try running the following IPython magic command and then do the plot again:

In [35]: %matplotlib inline

If you are using something other than the Jupyter notebook and the plot doesn’t show up, you can try the following:

In [36]: from matplotlib.pyplot import show

 

In [37]: show()

The show function tells Matplotlib, which is the plotting package, to show the figure window. The following screenshot shows the plot that is generated on a Linux machine:

image

Of course, you can display subsets of columns as well using the standard DataFrame syntax.

In [38]: tbl[['sepal_length', 'sepal_width']].plot()

Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa3efc23240>

 

In [39]: show()

Here is a plot of selected columns from the Iris data set:

image

Other chart types include area, bar, horizontal bar, box plot, density, histogram, KDE, and pie. Depending on your version of Pandas, these chart types are accessible in multiple ways. The traditional way is to specify the type of chart in the kind argument of the plot method as follows:

In [40]: tbl[['sepal_length', 'sepal_width']].plot(kind='area')

Out[40]: <matplotlib.axes._subplots.AxesSubplot at 0x7f77480ce908>

 

In [41]: show()

Here is an area plot of selected column of the Iris data set:

image

In newer versions of Pandas (releases 0.17.0 and later), you can use the individual chart type attributes on the plot method.

In [42]: tbl[['sepal_length', 'sepal_width']].plot.area()

Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7747ea1fd0>

 

In [43]: show()

Options to the plotting methods can be used to customize the output (see the Pandas documentation for additional details). However, this interface does not provide publication-ready charts. It is primarily for exploration. For highly customized graphics, you should use the Matplotlib package directly.

Exporting CASTables to Other Formats

The data in CAS tables can be saved to all of the formats that are supported by the Pandas DataFrame’s export methods. It accomplishes this by fetching the data and using the DataFrame export methods to save the data. Export methods include to_pickle, to_csv, to_hdf, to_sql, to_dict, to_excel, to_json, to_html, to_latex, to_stata, to_msgpack, to_gbq, to_records, to_sparse, to_dense, to_string, and to_clipboard. Keep in mind that all of these methods save the output on the client side. To save the data in a CAS table to a data source, you use the save CAS action.

The save action stores the data in the CAS table to the appropriate format for the caslib where the data is stored. The most basic form is the file-based caslib. You can store the data from a CAS table to either the CSV form or the SASHDAT form in file-based caslibs. To store the data, you simply specify the caslib to which the file should be saved, and you specify the path to the file in the name parameter.

In [44]: tbl.save(name='data/irisout.csv', caslib='casuser')

NOTE: Cloud Analytic Services created the file data/irisout.csv in caslib CASUSER(username).

Out[44]:

[caslib]

 

 'CASUSER(username)'

 

[name]

 

 'data/irisout.csv'

 

+ Elapsed: 0.0143s, user: 0.004s, mem: 2.14mb

 

In [45]: tbl.save(name='data/irisout.sashdat', caslib='casuser')

NOTE: Cloud Analytic Services created the file data/irisout.sashdat in caslib CASUSER(username).

Out[45]:

[caslib]

 

 'CASUSER(username)'

 

[name]

 

 'data/irisout.sashdat'

 

+ Elapsed: 0.0214s, user: 0.016s, sys: 0.003s, mem: 16.2mb

For caslibs with other data sources, see the SAS documentation for the options and parameters that are needed to export the data to a specific data source type.

Sorting, Data Selection, and Iteration

Some of the most basic data operations deal with sorting rows, selecting data based on labels or expressions, and iterating over rows or columns of a table. The following sections introduce techniques for accomplishing each data operation using the CASTable object.

Fetching Data with a Sort Order

As mentioned previously, data in a CAS table might be distributed across a grid, or it can be accessible in ways that do not specify a particular row order. This means that anytime you fetch data from CAS, you can’t depend on the order unless you specify a sort order when retrieving it. A sort order can be specified with the sortby parameter on the fetch action, but a different method is used with DataFrames. DataFrame methods preserve order and can be sorted using the sort_values method (or the sort method in older versions of Pandas). To mimic the DataFrame behavior, CASTable objects also support those sorting methods.

The simplest form of sort_values is to specify the names of the columns to sort on. The order is ascending by default.

In [46]: tbl.sort_values(['sepal_length', 'sepal_width'])

Out[46]: CASTable('DATA.IRIS',

                   caslib='CASUSER(username)').sort_values(

                          ['sepal_length', 'sepal_width'])

There is a subtle point in the preceding code that you might not have noticed if you aren’t familiar with Pandas. When you use the sort_values method on a DataFrame, it returns a new DataFrame object with an applied sort order unless you specify the inplace=True option. This is true with CASTable objects as well. The preceding code probably didn’t accomplish its sorting goal. Instead, it created a new CASTable object with the sort order applied, but since we didn’t save it to a variable, it simply disappeared. We have two choices: we can either use the inplace=True option on sort_values, or specify a variable for storing the returned value. Let’s store the value in a variable in the following example:

In [47]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'])

 

In [48]: sorttbl

Out[48]: CASTable('DATA.IRIS',

                    caslib='CASUSER(username)').sort_values(

                           ['sepal_length', 'sepal_width'])

Now we have a new CASTable object with the sort order applied as indicated by the sort_values(…) that is shown in the string representation above. To clarify, this sorting does nothing on the server side. In fact, nothing was sent to the server when this operation was invoked. The sort_values method call simply binds a sort order to the CASTable object that gets used anytime that the CASTable object is used to retrieve data from that CAS table.

We can now use the head or tail method to retrieve a sample of the data in the specified sort order.

In [50]: sorttbl.head(10)

Out[50]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width species

0           4.3          3.0           1.1          0.1  setosa

1           4.4          2.9           1.4          0.2  setosa

2           4.4          3.0           1.3          0.2  setosa

3           4.4          3.2           1.3          0.2  setosa

4           4.5          2.3           1.3          0.3  setosa

5           4.6          3.1           1.5          0.2  setosa

6           4.6          3.2           1.4          0.2  setosa

7           4.6          3.4           1.4          0.3  setosa

8           4.6          3.6           1.0          0.2  setosa

9           4.7          3.2           1.6          0.2  setosa

 

In [51]: sorttbl.tail(5)

Out[51]:

Selected Rows from Table DATA.IRIS

 

     sepal_length  sepal_width  petal_length  petal_width    species

145           7.7          2.6           6.9          2.3  virginica

146           7.7          2.8           6.7          2.0  virginica

147           7.7          3.0           6.1          2.3  virginica

148           7.7          3.8           6.7          2.2  virginica

149           7.9          3.8           6.4          2.0  virginica

To change the order to descending, you can specify ascending=False or you can give a list of Booleans to indicate ascending or descending order for each column.  

In [52]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'],

   ....:                            ascending=[False, True])

 

In [53]: sorttbl

Out[53]: CASTable('DATA.IRIS',

                   caslib='CASUSER(username)').sort_values(

                           ['sepal_length', 'sepal_width'],

                           ascending=[False, True])

If we fetch a sample of the data with our mixed descending/ascending sort order, you’ll get the following.

In [54]: sorttbl.head(10)

Out[54]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.9          3.8           6.4          2.0  virginica

1           7.7          2.6           6.9          2.3  virginica

2           7.7          2.8           6.7          2.0  virginica

3           7.7          3.0           6.1          2.3  virginica

4           7.7          3.8           6.7          2.2  virginica

5           7.6          3.0           6.6          2.1  virginica

6           7.4          2.8           6.1          1.9  virginica

7           7.3          2.9           6.3          1.8  virginica

8           7.2          3.0           5.8          1.6  virginica

9           7.2          3.2           6.0          1.8  virginica

Now that we have seen how to apply a sort order to a CASTable, we can go to the next set of DataFrame features that pertain to iteration.

Iterating through Columns and Rows

CASTable objects support all of the DataFrame iteration methods. However, you should use the iteration methods judiciously. Although you can iterate through the rows of a CAS table, you probably don’t want to do that if your table contains millions of rows. Realistically, you should use iteration methods with CAS tables that contain only summarization data or CAS tables that have a constrained number of records.

Basic Iteration

Iterating through the column names is the most basic form of iteration on a CASTable object. Iterating through columns is done by using the table object itself as the iterator just like with DataFrame objects.

In [55]: for col in sorttbl:

   ....:     print(col)

   ....:

sepal_length

sepal_width

petal_length

petal_width

species

Of course, you can also wrap the table in the various Python iteration functions to get a different behavior. For example, you can use the enumerate function to give the index value of each column name, or you can even use the zip function to walk through multiple lists of the same size.

In [56]: for i, col in enumerate(sorttbl):

   ....:     print(i, col)

   ....:

0 sepal_length

1 sepal_width

2 petal_length

3 petal_width

4 species

 

In [57]: for col, dtype in zip(sorttbl, sorttbl.dtypes):

   ....:     print(col, dtype)

   ....:

sepal_length double

sepal_width double

petal_length double

petal_width double

species varchar

Iterating over Columns

Much like the basic iterator, the iteritems method iterates over the columns as well. However, in this case, the returned values are a pairing of the column name and a CASColumn object. CASColumn objects are just like CASTable objects except that they reference only a single column. They are analogous to  Pandas Series objects and are discussed later in the chapter.

In [58]: for col, obj in sorttbl.iteritems():

   ....:     print(col, obj)

   ....:     print('')

   ....:

sepal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

 

sepal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

 

petal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

 

petal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

 

species CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['species'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

Although the previous two iterators can be safely run on large tables (they don’t iterate through the data rows), the following two iterators go through all of the rows in a table. The iterrows method returns a tuple that contains the row index and a Pandas Series object that contains the values of the row.

In [59]: for row in sorttbl.iterrows():

   ....:     print(row)

   ....:

(0, sepal_length       4.3

sepal_width          3

petal_length       1.1

petal_width        0.1

species         setosa

Name: 0, dtype: object)

(1, sepal_length       4.4

sepal_width        2.9

petal_length       1.4

petal_width        0.2

species         setosa

Name: 1, dtype: object)

 

...

 

(149, sepal_length          7.9

sepal_width           3.8

petal_length          6.4

petal_width             2

species         virginica

Name: 149, dtype: object)

This method uses a Python generator to appear as if it fetches all of the data and iterates over it. However, the iterrows method pulls the data down in batches (configurable by the n argument) and waits until a new batch is needed before it fetches the next batch. The itertuples method works in a very similar way, but it returns each row as a tuple of values rather than a Series. The row index value can be disabled by passing index=False.

In [60]: for row in sorttbl.itertuples():

   ....:     print(row)

   ....:

(0, 4.2999999999999998, 3.0, 1.1000000000000001, 0.10000000000000001,  

 'setosa')

(1, 4.4000000000000004, 2.8999999999999999, 1.3999999999999999,

 0.20000000000000001, 'setosa')

(2, 4.4000000000000004, 3.0, 1.3, 0.20000000000000001, 'setosa')

 

. . .

 

(149, 7.9000000000000004, 3.7999999999999998, 6.4000000000000004, 2.0,

 'virginica')

Now that we’ve seen how to iterate through our data, let’s look at how we can retrieve items from the data.

Techniques for Indexing and Selecting Data

Indexing and selecting data in DataFrames comes in many forms. The loc and iloc attributes enable you to select rows that are based on labels and row indexes. The attributes, at and iat, enable you to select a scalar value at a particular position, given a label and a row index, respectively, paired with a column identifier. The ix attribute supports mixed integer and label access. Finally, using the Python __getitem__ functionality (through the bracket [...] syntax) on the DataFrame itself enables you to select columns and ranges of rows.

Selecting Columns by Label and Position

Probably, the most common way that you see the Pandas DataFrame being indexed is directly on the DataFrame itself. DataFrame objects (and similarly CASTables objects) enable you to select columns either by name or index. To select a single column, you simply index the CASTable using the name of the column.

In [61]: col = sorttbl['sepal_width']

 

In [62]: col

Out[62]: CASColumn('data.iris', caslib='casuser')

           ['sepal_width'].sort_values(['sepal_length',

                                        'sepal_width'],

                                       ascending=[False, True])

 

In [63]: col.head()

Out[63]:

0    3.8

1    2.6

2    2.8

3    3.0

4    3.8

Name: sepal_width, dtype: float64

The object that we get back is a CASColumn object. This object is analogous to a Pandas Series. Technically, CAS doesn’t support stand-alone columns. The CASColumn object is a specialized subclass of CASTable object that references a single column of the table. As you can see, when you retrieve a column from a CASTable, the resulting CASColumn takes with it all table and sort information that was on the parent CASTable. We cover more advanced uses of CASColumn objects later in the chapter. For now, let’s move on to more CASTable indexing.

An alternative way of selecting a column is using the Python attribute syntax. You can specify the name of the column on the CASTable in order to have it return a CASColumn.

In [64]: sorttbl.species

Out[64]: CASColumn('DATA.IRIS', caslib='CASUSER(username)')

             ['species'].sort_values(['sepal_length', 'sepal_width'],

                                      ascending=[False, True])

Although this way of accessing columns is prettier than the bracket syntax, it has some restrictions. The name of the column must be a legal Python attribute name, which cannot collide with any existing attributes or methods on the CASTable object. Although this syntax is appropriate for interactive use, it’s much safer to write programs using the bracket syntax.

In addition to retrieving columns from a CASTable, we can also retrieve subsets of a table. To do this, you use a list of columns as the index key.

In [65]: widths = sorttbl[['sepal_width', 'petal_width', 'species']]

In [66]: widths

Out[66]: CASTable('DATA.IRIS', caslib='CASUSER(username)')[

             ['sepal_width', 'petal_width', 'species']

         ].sort_values(['sepal_length', 'sepal_width'],

                        ascending=[False, True])

 

In [67]: widths.head()

Out[67]:

Selected Rows from Table DATA.IRIS

 

   sepal_width  petal_width    species

0          3.8          2.0  virginica

1          2.6          2.3  virginica

2          2.8          2.0  virginica

3          3.0          2.3  virginica

4          3.8          2.2  virginica

If you specify a list of column names (or even a list containing a single column name), you’ll get back a new CASTable object that includes only the referenced columns. The data in this table is still the same as the data in the parent table. The local CASTable object simply references the columns that it needs.

Of course, now that we have this new table, we can use any of the previously described CASTable methods on it (including the CAS action and DataFrame methods).

In [68]: widths.describe()

Out[68]:

       sepal_width  petal_width

count   150.000000   150.000000

mean      3.054000     1.198667

std       0.433594     0.763161

min       2.000000     0.100000

25%       2.800000     0.300000

50%       3.000000     1.300000

75%       3.300000     1.800000

max       4.400000     2.500000

 

In [69]: widths.columninfo()

Out[69]:

[ColumnInfo]

 

         Column  ID     Type  RawLength  FormattedLength  NFL  NFD

 0  sepal_width   2   double          8               12    0    0

 1  petal_width   4   double          8               12    0    0

 2      species   5  varchar         10               10    0    0

 

+ Elapsed: 0.000839s, sys: 0.001s, mem: 0.188mb

In addition to basic column selection such as the examples shown here, you can also use the various accessor properties of  DataFrame objects to select columns in other ways. That’s covered in the next section.

Selecting Data by Label and Position

Although DataFrames enable you to select rows using the loc, iloc, at, iat, and ix accessors, CAS tables do not have the concept of a row label or an index. These accessors still exist on CASTable objects, but they are limited to column selection only.

Let’s use our sorted table from the previous example to select some data points.  

In [70]: sorttbl

Out[70]: CASTable('DATA.IRIS',

                  caslib='CASUSER(username)').sort_values(

                         ['sepal_length', 'sepal_width'],

                         ascending=[False, True])

The loc and iloc data accessors on CASTable objects can select only columns. The loc accessor selects columns by name, whereas the iloc accessor selects columns by index. You must specify an empty slice as the first index parameter.

# Select the petal_width column

In [71]: sorttbl.loc[:, 'petal_width'].head()

Out[71]:

0    2.0

1    2.3

2    2.0

3    2.3

4    2.2

Name: petal_width, dtype: float64

 

# Select a range of columns

In [72]: sorttbl.loc[:, 'sepal_length':'petal_length'].head()

Out[72]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length

0           7.9          3.8           6.4

1           7.7          2.6           6.9

2           7.7          2.8           6.7

3           7.7          3.0           6.1

4           7.7          3.8           6.7

 

# Select a list of columns

In [73]: sorttbl.loc[:, ['petal_width', 'sepal_width']].head()

Out[73]:

Selected Rows from Table DATA.IRIS

 

   petal_width  sepal_width

0          2.0          3.8

1          2.3          2.6

2          2.0          2.8

3          2.3          3.0

4          2.2          3.8

The iloc accessor works much like loc except that it uses integer indexes.

In [74]: sorttbl.iloc[:, 3].head()

Out[74]:

0    2.0

1    2.3

2    2.0

3    2.3

4    2.2

Name: petal_width, dtype: float64

 

In [75]: sorttbl.iloc[:, 0:3].head()

Out[75]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length

0           7.9          3.8           6.4

1           7.7          2.6           6.9

2           7.7          2.8           6.7

3           7.7          3.0           6.1

4           7.7          3.8           6.7

 

In [76]: sorttbl.iloc[:, [3, 1]].head()

Out[76]:

Selected Rows from Table DATA.IRIS

 

   petal_width  sepal_width

0          2.0          3.8

1          2.3          2.6

2          2.0          2.8

3          2.3          3.0

4          2.2          3.8

Finally, the ix accessor enables you to mix labels and position indexes.

In [77]: sorttbl.ix[:, [3, 'sepal_width']].head()

Out[77]:

Selected Rows from Table DATA.IRIS

 

   petal_width  sepal_width

0          2.0          3.8

1          2.3          2.6

2          2.0          2.8

3          2.3          3.0

4          2.2          3.8

 

In [78]: sorttbl.ix[:, 'sepal_width'].head()

Out[78]:

0    3.8

1    2.6

2    2.8

3    3.0

4    3.8

Name: sepal_width, dtype: float64

 

In [79]: sorttbl.ix[:, 'sepal_width':-2].head()

Out[79]:

Selected Rows from Table DATA.IRIS

 

   sepal_width  petal_length

0          3.8           6.4

1          2.6           6.9

2          2.8           6.7

3          3.0           6.1

4          3.8           6.7

 

In [80]: sorttbl.ix[:, ['sepal_width', 3, 4]].head()

Out[80]:

Selected Rows from Table DATA.IRIS

 

   sepal_width  petal_width    species

0          3.8          2.0  virginica

1          2.6          2.3  virginica

2          2.8          2.0  virginica

3          3.0          2.3  virginica

4          3.8          2.2  virginica

Although the preceding accessors can be useful, they are somewhat limited because the nature of distributed data prevents selection by row index. However, there is a more powerful indexing feature on CASTable objects that allows for various forms of advanced selection of data.

Dynamic Data Selection

Rather than using fixed values of rows and columns to select data, we can create conditions that are based on the data in the table to determine which rows to select. The specification of conditions is done using the same syntax as that used by DataFrame objects. CASColumn objects support Python’s various comparison operators and build a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTable. It sounds much more complicated than it is, so let’s look at an example.

If we want to get a CASTable that contains only values where petal_length is greater than 6.5, we can use the following expression to create our filter:

In [81]: expr = sorttbl.petal_length > 6.5

This expression creates a computed column that is used in a where expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a Boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.

In [82]: newtbl = sorttbl[expr]

 

In [83]: newtbl.head()

Out[83]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.7          2.6           6.9          2.3  virginica

1           7.7          2.8           6.7          2.0  virginica

2           7.7          3.8           6.7          2.2  virginica

3           7.6          3.0           6.6          2.1  virginica

These two steps are commonly entered on one line.

In [84]: newtbl = sorttbl[sorttbl.petal_length > 6.5]

 

In [85]: newtbl.head()

Out[85]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.7          2.6           6.9          2.3  virginica

1           7.7          2.8           6.7          2.0  virginica

2           7.7          3.8           6.7          2.2  virginica

3           7.6          3.0           6.6          2.1  virginica

We can further filter rows out by indexing another comparison expression.

In [86]: newtbl2 = newtbl[newtbl.petal_width < 2.2]

 

In [87]: newtbl2.head()

Out[87]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.7          2.8           6.7          2.0  virginica

1           7.6          3.0           6.6          2.1  virginica

Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You must be careful with these operators though due to operator precedence. Bitwise comparison has a higher precedence than comparisons such as greater-than and less-than, so you need to enclose your comparisons in parentheses.

In [88]: sorttbl[(sorttbl.petal_length > 6.5) &

   ....:         (sorttbl.petal_width < 2.2)].head()

Out[88]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.7          2.8           6.7          2.0  virginica

1           7.6          3.0           6.6          2.1  virginica

In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the resulting CASTable objects.

In [89]: sorttbl[(sorttbl.petal_length > 6.5) &

   ....:         (sorttbl.petal_width < 2.2)]

Out[89]: CASTable('data.iris', caslib='casuser',  

              computedvars=['_gt_6_', '_and_8_', '_lt_7_'],

              computedvarsprogram='_gt_6_ = (petal_length > 6.5);

                                   _lt_7_ = (petal_width < 2.2);

                                   _and_8_ = (_gt_6_ and _lt_7_); ',

              where='(_and_8_)')

             [['sepal_length', 'sepal_width',

               'petal_length', 'petal_width',

               'species']].sort_values(['sepal_length',

                                        'sepal_width'],  

                                       ascending=[False, True])

You can also do mathematical operations on columns with constants or on other columns within your comparisons.

In [90]: sorttbl[(sorttbl.petal_length + sorttbl.petal_width)

   ....:          * 2 > 17.5].head()

Out[90]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species

0           7.7          2.6           6.9          2.3  virginica

1           7.7          3.8           6.7          2.2  virginica

The list of supported operators is shown in the following table:

Operator Numeric Data Character Data
+ (add)
- (subtract)  
* (multiply)  
/ (divide)  
% (modulo)  
** (power)  

The supported comparison operators are shown in the following table.

Operator Numeric Data Character Data
== (equality)
!= (inequality)
< (less than)
> (greater than)
<= (less than or equal to)
>= (greater than or equal to)

As you can see in the preceding tables, you can do comparisons on character columns as well. It is also possible to use many of the Python string methods on the column values. These methods are accessed through the str attribute of the column, just like in DataFrames.

In [91]: sorttbl[sorttbl.species.str.upper().str.startswith('SET')

   ....:         ].head()

Out[91]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width species

0           5.8          4.0           1.2          0.2  setosa

1           5.7          3.8           1.7          0.3  setosa

2           5.7          4.4           1.5          0.4  setosa

3           5.5          3.5           1.3          0.2  setosa

4           5.5          4.2           1.4          0.2  setosa

The following table shows the string methods that are supported on character columns:

Method Description
col.str.capitalize Capitalize the first character, and lowercase the remaining characters.
col.str.contains Return Boolean values indicating whether the pattern exists in the data values.
col.str.count Count occurrences of the pattern in each value.
col.str.endswith Return Boolean values indicating whether the values end with the pattern.
col.str.find Return the lowest index of the pattern in each value, or return -1 if the pattern is not found.
col.str.index Is the same as the find method (the preceding entry) except that a ValueError is raised if the pattern is not found.
col.str.len Compute the length of each value.
col.str.lower Lowercase the entire column.
col.str.lstrip Strip the leading spaces.
col.str.repeat Duplicate the value by the specified number of times.
col.str.rfind Return the highest index of the pattern in each value, or return -1 if the pattern is not found.
col.str.rindex Is the same as the rfind method (the preceding entry) except that a ValueError is raised if the pattern is not found.
col.str.rstrip Strip the trailing whitespace characters.
col.str.slice Slice a substring from the value.
col.str.startswith Return Boolean values indicating the value starts with the pattern.
col.str.strip Strip the leading and trailing whitespace characters.
col.str.title Capitalize each word in the value, and lowercase the remaining characters.
col.str.upper Uppercase the value.
col.str.isalnum Return Boolean values indicating whether the value is all alphanumeric values.
col.str.isalpha Return Boolean values indicating whether the value is all alphabetic characters.
col.str.isdigit Return Boolean values indicating whether the value is all digits.
col.str.isspace Return Boolean values indicating whether the value is all whitespace characters.
col.str.islower Return Boolean values indicating whether the value is all lowercase.
col.str.isupper Return Boolean values indicating whether the value is all uppercase.
col.str.istitle Return Boolean values indicating whether the value is equivalent to the title representation. (See the title method.)
col.str.isnumeric Return Boolean values indicating whether the value is in a numeric representation.
col.str.isdecimal Return Boolean values indicating whether the value is in a decimal representation.

In addition to the string column methods, you can also use the following date/time properties on date, time, and datetime columns.

Property Description
col.dt.year Return the year of the date/time/datetime.
col.dt.month Return the month of the date/time/datetime, where January is 1 and December is 12.
col.dt.day Return the day of the month of the date/time/datetime.
col.dt.hour Return the hours of the date/time/datetime (0-24).
col.dt.minute Return the minutes of the date/time/datetime (0-60).
col.dt.second Return the seconds of the date/time/datetime (0-60).
col.dt.microsecond Return the microseconds of the date/time/datetime (0-999999).
col.dt.nanosecond Return the nanoseconds of the date/time/datetime. Since CAS has a resolution of only one microsecond, this property always returns zero.
col.dt.week Return the week of the year of the date/time/datetime.
col.dt.weekofyear Is the same as the week property.
col.dt.dayofweek Return the day of the week of the date/time/datetime. Monday is 0 and Sunday is 6.
col.dt.weekday Is the same as the dayofweek property.
col.dt.dayofyear Return the ordinal day of the year of the date/time/datetime.
col.dt.quarter Return the year’s quarter of the date/time/datetime.
col.dt.is_month_start Boolean indicating whether the date/time/datetime is the first day of the month.
col.dt.is_month_end Boolean indicating whether the date/time/datetime is the last day of the month.
col.dt.is_quarter_start Boolean indicating whether the date/time/datetime is the first day of the quarter.
col.dt.is_quarter_end Boolean indicating whether the date/time/datetime is the last day of the quarter.
col.dt.is_year_start Boolean indicating whether the date/time/datetime is the first day of the year.
col.dt.is_year_end Boolean indicating whether the date/time/datetime is the last day of the year.
col.dt.daysinmonth Returns the number of days in the month of the date/time/datetime.
col.dt.days_in_month Same as daysinmonth.

Note that when you call any of the previous methods on CAS time values, any method that returns a date-related component uses the current day. When you call time-related methods on a CAS date, the values are always zero.

It’s easy to create powerful filters that are executed in CAS while still using the Python syntax. However, the similarities to DataFrames don’t end there. CASTable objects can also create computed columns and by groups using similar techniques. Those topics are covered in the next section.

Data Wrangling on the Fly

There’s a certain amount of “data wrangling” that is required to prepare your data for reporting or modeling. Data wrangling refers to data manipulation, including the transformation of data from one format to another format, BY-group processing, and data aggregation. Such manipulations on large data sets often become bottlenecks to understanding your data, because most of the steps to manipulate data require that you create a new copy of the data or that you move data around the nodes of your cluster. In CAS, several ways are provided to speed up the data wrangling steps such as temporary computed columns and group by processing. Neither requires copying data or moving data.

Creating Computed Columns

Rather than using the CASColumn-based expressions to filter a CASTable object you can also use them to build computed columns. Instead of comparing the result of an expression to another value, you simply set it to a key on the CASTable object. Here is an example that adds the sepal_length and sepal_width and then multiplies the result by 2. The result is a CASColumn object that we set to sepal_factor.

In [92]: sorttbl['sepal_factor'] = ((sorttbl.sepal_length +

   ....:                             sorttbl.sepal_width) * 2)

 

In [93]: sorttbl.head()

Out[93]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species  

0           7.9          3.8           6.4          2.0  virginica

1           7.7          2.6           6.9          2.3  virginica

2           7.7          2.8           6.7          2.0  virginica

3           7.7          3.0           6.1          2.3  virginica

4           7.7          3.8           6.7          2.2  virginica

 

   sepal_factor

0          23.4

1          20.6

2          21.0

3          21.4

4          23.0

The sepal_factor column is a computed column that is not present in the original CAS table. It is a temporary column generated “on the fly” when an action is executed. They exist only for the duration of the action, and are not visible to any other action. Computed columns are efficient because you can use computed columns to dynamically filter or expand the data without actually duplicating the data to create those extra filters or columns. You can create computed columns from other computed columns as well. In the following code, we’ll use our previously computed column as part of the expression for a new computed column.

In [94]: sorttbl['total_factor'] = sorttbl.sepal_factor +

   ....:                           sorttbl.petal_width +

   ....:                           sorttbl.petal_length

 

In [95]: sorttbl.head()

Out[95]:

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width    species  

0           7.9          3.8           6.4          2.0  virginica

1           7.7          2.6           6.9          2.3  virginica

2           7.7          2.8           6.7          2.0  virginica

3           7.7          3.0           6.1          2.3  virginica

4           7.7          3.8           6.7          2.2  virginica

 

   sepal_factor  total_factor

0          23.4          31.8

1          20.6          29.8

2          21.0          29.7

3          21.4          29.8

4          23.0          31.9

The output from the head method shows that the extra columns are now available on the table. You can even create computed columns that are based on constants or Python expressions (evaluated on the client) rather than on data in the table.

In [96]: sorttbl['names'] = 'sepal / petal'

 

In [97]: sorttbl.head()

Out[97]:

   sepal_length  sepal_width  petal_length  petal_width    species  

0           7.9          3.8           6.4          2.0  virginica

1           7.7          2.6           6.9          2.3  virginica

2           7.7          2.8           6.7          2.0  virginica

3           7.7          3.0           6.1          2.3  virginica

4           7.7          3.8           6.7          2.2  virginica

 

   sepal_factor  total_factor          names

0          23.4           8.4  sepal / petal

1          20.6           9.2  sepal / petal

2          21.0           8.7  sepal / petal

3          21.4           8.4  sepal / petal

4          23.0           8.9  sepal / petal

You can use any of the numeric, string, and date/time methods that are described in the filtering section to construct computed columns as well.  

In [98]: sorttbl['cap_names'] = sorttbl.names.str.title()

 

In [99]: sorttbl.head()

Selected Rows from Table DATA.IRIS

 

   sepal_length  sepal_width  petal_length  petal_width species  

0           4.3          3.0           1.1          0.1  setosa

1           4.4          2.9           1.4          0.2  setosa

2           4.4          3.0           1.3          0.2  setosa

3           4.4          3.2           1.3          0.2  setosa

4           4.5          2.3           1.3          0.3  setosa

 

   sepal_factor  total_factor          names      cap_names

0          14.6          15.8  sepal / petal  Sepal / Petal

1          14.6          16.2  sepal / petal  Sepal / Petal

2          14.8          16.3  sepal / petal  Sepal / Petal

3          15.2          16.7  sepal / petal  Sepal / Petal

4          13.6          15.2  sepal / petal  Sepal / Petal

Now that we have seen that we can filter our data and create computed columns, let’s look at the topic of grouping data by data values.

By Group Processing

Many analytic actions and Pandas DataFrame methods support grouping by values in one or more columns in a table. There are a couple of ways to apply by grouping in CASTable objects, one of which is borrowed from the DataFrame API.

The most direct way of applying by groupings to a CASTable object is to use the groupby parameter. We’ll start with the Iris table for this example.

In [100]: tbl.set_param('groupby', ['species'])

 

In [101]: tbl

Out[101]: CASTable('DATA.IRIS', caslib='CASUSER(username)',

                  groupby=['species'])

Now that we have some grouping variables set, we’ll run the summary action.

In [102]: tbl.summary(subset=['min', 'max'])

Out[102]:

[ByGroupInfo]

 

 ByGroupInfo

 

       species   species_f       _key_

 0      setosa      setosa      setosa

 1  versicolor  versicolor  versicolor

 2   virginica   virginica   virginica

 

[ByGroup1.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                Column  Min  Max

 species

 setosa   sepal_length  4.3  5.8

 setosa    sepal_width  2.3  4.4

 setosa   petal_length  1.0  1.9

 setosa    petal_width  0.1  0.6

 

[ByGroup2.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                   Column  Min  Max

 species

 versicolor  sepal_length  4.9  7.0

 versicolor   sepal_width  2.0  3.4

 versicolor  petal_length  3.0  5.1

 versicolor   petal_width  1.0  1.8

 

[ByGroup3.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                  Column  Min  Max

 species

 virginica  sepal_length  4.9  7.9

 virginica   sepal_width  2.2  3.8

 virginica  petal_length  4.5  6.9

 virginica   petal_width  1.4  2.5

 

+ Elapsed: 0.0082s, user: 0.006s, sys: 0.007s, mem: 1.94mb

You see that we now get multiple tables as output rather than just the single table that the summary action produces when running without by groupings. The first table is called ByGroupInfo. This table contains information about all of the by grouping tables that follow. This information is more useful if there are a large number of by groupings and you want to know about them in advance so that you can prepare for the tables as they come from the server. The remaining tables correspond to each of the by groupings.

Each by grouping generates a new output table. The name of the key for that output is the name of the table prefixed by a label ByGroup#, where # is the index of the grouping (starting with 1). Without by grouping applied, the summary action creates a table called Summary. With by grouping applied, the result keys are ByGroup1.Summary, ByGroup2.Summary, and so on.

We mentioned that the DataFrame method for applying by groupings works with CASTable objects as well. The DataFrame groupby method is used to group the data. In the following example, we delete the groupby parameter from the table, and then use the groupby method to apply by groupings instead.

In [103]: tbl.del_param('groupby')

 

In [104]: tbl

Out[104]: CASTable('DATA.IRIS', caslib='CASUSER(username)')

 

In [105]: grptbl = tbl.groupby(['species'])

 

In [106]: grptbl

Out[106]: <swat.cas.table.CASTableGroupBy at 0x7f2148e12358>

You notice that when using this method, we don’t get a CASTable object back. Instead, we get a CASTableGroupBy object. This object corresponds to the Pandas GroupBy object. This object works much like a DataFrame object in many ways. For example, you can still call most DataFrame methods, call CAS actions on it, access columns, and so on, but the CASTableGroupBy object also adds methods for traversing the by-groupings in various ways.

Using the summary action on our grptbl variable shows that we get the same result as when we used the groupby parameter on the CASTable object.

In [107]: grptbl.summary(subset=['min', 'max'])

Out[107]:

[ByGroupInfo]

 

 ByGroupInfo

 

       species   species_f       _key_

 0      setosa      setosa      setosa

 1  versicolor  versicolor  versicolor

 2   virginica   virginica   virginica

 

[ByGroup1.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                Column  Min  Max

 species

 setosa   sepal_length  4.3  5.8

 setosa    sepal_width  2.3  4.4

 setosa   petal_length  1.0  1.9

 setosa    petal_width  0.1  0.6

 

[ByGroup2.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                   Column  Min  Max

 species

 versicolor  sepal_length  4.9  7.0

 versicolor   sepal_width  2.0  3.4

 versicolor  petal_length  3.0  5.1

 versicolor   petal_width  1.0  1.8

 

[ByGroup3.Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                  Column  Min  Max

 species

 virginica  sepal_length  4.9  7.9

 virginica   sepal_width  2.2  3.8

 virginica  petal_length  4.5  6.9

 virginica   petal_width  1.4  2.5

 

+ Elapsed: 0.00862s, user: 0.008s, sys: 0.009s, mem: 1.98mb

Concatenating By Groups

If you already use Pandas, you might want to combine all of the by-grouping tables into a single DataFrame so that you can continue your workflow, as usual. Fortunately, there is a method on the CASResults object that can be used to combine by-grouping tables into a single DataFrame: concat_bygroups.

In [108]: grpsumm = grptbl.summary(subset=['min', 'max'])

 

In [109]: grpsumm.concat_bygroups()

Out[109]:

[Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                   Column  Min  Max

 species

 setosa      sepal_length  4.3  5.8

 setosa       sepal_width  2.3  4.4

 setosa      petal_length  1.0  1.9

 setosa       petal_width  0.1  0.6

 versicolor  sepal_length  4.9  7.0

 versicolor   sepal_width  2.0  3.4

 versicolor  petal_length  3.0  5.1

 versicolor   petal_width  1.0  1.8

 virginica   sepal_length  4.9  7.9

 virginica    sepal_width  2.2  3.8

 virginica   petal_length  4.5  6.9

 virginica    petal_width  1.4  2.5

This method preserves all of the DataFrame index information so that it can be treated just like the output of the native DataFrame methods with by-groups enabled.

If you prefer to keep your by-group tables separate, there are a couple of other methods on the CASResults object that can query specific table names or groups.

Selecting Result Keys by Table Name

Unlike the summary action, some CAS actions return multiple result table names, including the correlation action. We won’t explain the correlation action in detail, but we use it in our example to select a particular table name from a result set. We still use our grptbl variable from the previous example as our starting point.

In [110]: grpcorr = grptbl.correlation()

 

In [111]: grpcorr

Out[111]:

[ByGroupInfo]

 

 ByGroupInfo

 

       species   species_f       _key_

 0      setosa      setosa      setosa

 1  versicolor  versicolor  versicolor

 2   virginica   virginica   virginica

 

[ByGroup1.CorrSimple]

 Summary Statistics in Correlation Analysis for DATA.IRIS

 

              Variable     N   Mean    Sum    StdDev  Minimum  

 species

 setosa   sepal_length  50.0  5.006  250.3  0.352490      4.3

 setosa    sepal_width  50.0  3.418  170.9  0.381024      2.3

 setosa   petal_length  50.0  1.464   73.2  0.173511      1.0

 setosa    petal_width  50.0  0.244   12.2  0.107210      0.1

 

          Maximum

 species

 setosa       5.8

 setosa       4.4

 setosa       1.9

 setosa       0.6

 

[ByGroup1.Correlation]

 

 Pearson Correlation Coefficients for DATA.IRIS

 

              Variable  sepal_length  sepal_width  petal_length  

 species

 setosa   sepal_length      1.000000     0.746780      0.263874

 setosa    sepal_width      0.746780     1.000000      0.176695

 setosa   petal_length      0.263874     0.176695      1.000000

 setosa    petal_width      0.279092     0.279973      0.306308

 

          petal_width

 species

 setosa      0.279092

 setosa      0.279973

 setosa      0.306308

 setosa      1.000000

 

... truncated ...

As you can see in the preceding output, the correlation action produces tables that are named CorrSimple and Correlation. These tables appear in each by grouping in the output. If we want to select only the Correlation tables from the output, we can use the get_tables method. This method takes the name of a table and returns a list of all of the tables in the result set with that name. The name of the table should not include the ByGroup#. prefix.

In [112]: grpcorr.get_tables('Correlation')

Out[112]:

[Pearson Correlation Coefficients for DATA.IRIS

 

              Variable  sepal_length  sepal_width  petal_length  

 species

 setosa   sepal_length      1.000000     0.746780      0.263874

 setosa    sepal_width      0.746780     1.000000      0.176695

 setosa   petal_length      0.263874     0.176695      1.000000

 setosa    petal_width      0.279092     0.279973      0.306308

 

          petal_width

 species

 setosa      0.279092

 setosa      0.279973

 setosa      0.306308

 setosa      1.000000  , Pearson Correlation Coefficients for DATA.IRIS

 

                 Variable  sepal_length  sepal_width  petal_length  

 species

 versicolor  sepal_length      1.000000     0.525911      0.754049

 versicolor   sepal_width      0.525911     1.000000      0.560522

 versicolor  petal_length      0.754049     0.560522      1.000000

 versicolor   petal_width      0.546461     0.663999      0.786668

 

             petal_width

 species

 versicolor     0.546461

 versicolor     0.663999

 versicolor     0.786668

 versicolor     1.000000  , Pearson Correlation Coefficients for DATA.IRIS

 

                Variable  sepal_length  sepal_width  petal_length  

 species

 virginica  sepal_length      1.000000     0.457228      0.864225

 virginica   sepal_width      0.457228     1.000000      0.401045

 virginica  petal_length      0.864225     0.401045      1.000000

 virginica   petal_width      0.281108     0.537728      0.322108

 

            petal_width

 species

 virginica     0.281108

 virginica     0.537728

 virginica     0.322108

 virginica     1.000000  ]

In the preceding code, you see that we now have a list of DataFrames that correspond to the Correlation tables. You can use the concat function in SWAT to combine the Correlation tables into a single DataFrame.

In [113]: swat.concat(grpcorr.get_tables('Correlation'))

Out[113]:

Pearson Correlation Coefficients for DATA.IRIS

 

                Variable  sepal_length  sepal_width  petal_length  

species

setosa      sepal_length      1.000000     0.746780      0.263874

setosa       sepal_width      0.746780     1.000000      0.176695

setosa      petal_length      0.263874     0.176695      1.000000

setosa       petal_width      0.279092     0.279973      0.306308

versicolor  sepal_length      1.000000     0.525911      0.754049

versicolor   sepal_width      0.525911     1.000000      0.560522

versicolor  petal_length      0.754049     0.560522      1.000000

versicolor   petal_width      0.546461     0.663999      0.786668

virginica   sepal_length      1.000000     0.457228      0.864225

virginica    sepal_width      0.457228     1.000000      0.401045

virginica   petal_length      0.864225     0.401045      1.000000

virginica    petal_width      0.281108     0.537728      0.322108

             petal_width

species

setosa         0.279092

setosa         0.279973

setosa         0.306308

setosa         1.000000

versicolor     0.546461

versicolor     0.663999

versicolor     0.786668

versicolor     1.000000

virginica      0.281108

virginica      0.537728

virginica      0.322108

virginica      1.000000

We’ve seen how to select specific table names from result sets, but what if we want to select a particular by group?  That’s covered in the next section.

Selecting a Specific By Group

The CASResults object has a method to select by groups from a result set using the values of the by-grouping variables. This method is called get_group. If you specify a list of values, get_group matches the by-grouping variables as they were specified via the groupby method.

In [114]: grpsumm.get_group(['versicolor'])

Out[114]:

[Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                   Column  Min  Max

 species

 versicolor  sepal_length  4.9  7.0

 versicolor   sepal_width  2.0  3.4

 versicolor  petal_length  3.0  5.1

 versicolor   petal_width  1.0  1.8

You can also use keyword parameters on the get_group method to specify the desired by-grouping variables.

In [115]: grpsumm.get_group(species='versicolor')

Out[115]:

[Summary]

 

 Descriptive Statistics for DATA.IRIS

 

                   Column  Min  Max

 species

 versicolor  sepal_length  4.9  7.0

 versicolor   sepal_width  2.0  3.4

 versicolor  petal_length  3.0  5.1

 versicolor   petal_width  1.0  1.8

The final topic for discussion about by groupings is the case in which a CAS action has support for multiple sets of by groupings.

Handling Multiple Sets of By Groups

Some actions, such as mdsummary, enable you to specify multiple sets of by groups. The output for such actions is similar to what we saw in the examples in the preceding section except that the result keys add another prefix: ByGroupSet#.

In [116]: grpmdsumm =

   .....:     tbl.mdsummary(sets=[dict(groupby=['sepal_length']),

   .....:                         dict(groupby=['petal_length'])])

 

In [117]: list(grpmdsumm.keys())

Out[117]:

['ByGroupSet1.ByGroupInfo',

 'ByGroupSet1.ByGroup1.MDSummary',

 'ByGroupSet1.ByGroup2.MDSummary',

 'ByGroupSet1.ByGroup3.MDSummary',

 

 ...

 

 'ByGroupSet2.ByGroupInfo',

 'ByGroupSet2.ByGroup1.MDSummary',

 'ByGroupSet2.ByGroup2.MDSummary',

 'ByGroupSet2.ByGroup3.MDSummary',

 

 ...]

As you can see from the preceding output, we get a ByGroupSet#.ByGroupInfo table for each by-group set followed by all of the tables for that by-group set. We can extract a particular by-group set from the results using the get_set method.

In [118]: grpmdsumm.get_set(1)

Out[118]:

[ByGroupInfo]

 

 ByGroupSet1.ByGroupInfo

 

     sepal_length sepal_length_f         _key_

 0            5.0              5             5

 1            6.0              6             6

 2            7.0              7             7

 3            4.3            4.3           4.3

 4            4.4            4.4           4.4

 5            4.5            4.5           4.5

 6            4.6            4.6           4.6

 7            4.7            4.7           4.7

 8            4.8            4.8           4.8

 9            4.9            4.9           4.9

 ..           ...            ...           ...

 25           6.7            6.7           6.7

 26           6.8            6.8           6.8

 27           6.9            6.9           6.9

 28           7.1            7.1           7.1

 29           7.2            7.2           7.2

 30           7.3            7.3           7.3

 31           7.4            7.4           7.4

 32           7.6            7.6           7.6

 33           7.7            7.7           7.7

 34           7.9            7.9           7.9

 

 [35 rows x 3 columns]

 

[ByGroup1.MDSummary]

 

 Descriptive Statistics for DATA.IRIS

 

                     Column  Min  Max     N  NMiss  Mean   Sum  

 sepal_length

 5             sepal_length  5.0  5.0  10.0    0.0  5.00  50.0

 5              sepal_width  2.0  3.6  10.0    0.0  3.12  31.2

 5             petal_length  1.2  3.5  10.0    0.0  1.84  18.4

 5              petal_width  0.2  1.0  10.0    0.0  0.43   4.3

 

                    Std    StdErr       Var     USS    CSS  

 sepal_length

 5             0.000000  0.000000  0.000000  250.00  0.000

 5             0.543241  0.171788  0.295111  100.00  2.656

 5             0.834266  0.263818  0.696000   40.12  6.264

 5             0.326769  0.103333  0.106778    2.81  0.961

 

                      CV     TValue         ProbT

 sepal_length

 5              0.000000        NaN           NaN

 5             17.411580  18.161922  2.121189e-08

 5             45.340551   6.974502  6.505533e-05

 5             75.992719   4.161290  2.443024e-03

 

[ByGroup2.MDSummary]

 

 Descriptive Statistics for DATA.IRIS

 

                     Column  Min  Max    N  NMiss      Mean   Sum  

 sepal_length

 6             sepal_length  6.0  6.0  6.0    0.0  6.000000  36.0

 6              sepal_width  2.2  3.4  6.0    0.0  2.733333  16.4

 6             petal_length  4.0  5.1  6.0    0.0  4.650000  27.9

 6              petal_width  1.0  1.8  6.0    0.0  1.500000   9.0

 

                    Std    StdErr       Var     USS       CSS  

 sepal_length

 6             0.000000  0.000000  0.000000  216.00  0.000000

 6             0.471876  0.192642  0.222667   45.94  1.113333

 6             0.403733  0.164823  0.163000  130.55  0.815000

 6             0.268328  0.109545  0.072000   13.86  0.360000

 

                      CV     TValue     ProbT

 sepal_length

 6              0.000000        NaN       NaN

 6             17.263745  14.188635  0.000031

 6              8.682421  28.212059  0.000001

 6             17.888544  13.693064  0.000037

 

... truncated ...

You see that when using the get_set method, we get back a new CASResults object that corresponds to the output from that particular by-group set. You can then use get_tables, concat_bygroups, or get_group on that result just as we did in the examples in the previous section.

In [119]: grpmdsumm.get_set(1).concat_bygroups()

Out[119]:

[MDSummary]

 

 Descriptive Statistics for DATA.IRIS

 

                     Column  Min  Max     N  NMiss      Mean   Sum  

 sepal_length

 5             sepal_length  5.0  5.0  10.0    0.0  5.000000  50.0

 5              sepal_width  2.0  3.6  10.0    0.0  3.120000  31.2

 5             petal_length  1.2  3.5  10.0    0.0  1.840000  18.4

 5              petal_width  0.2  1.0  10.0    0.0  0.430000   4.3

 6             sepal_length  6.0  6.0   6.0    0.0  6.000000  36.0

 6              sepal_width  2.2  3.4   6.0    0.0  2.733333  16.4

 6             petal_length  4.0  5.1   6.0    0.0  4.650000  27.9

 6              petal_width  1.0  1.8   6.0    0.0  1.500000   9.0

 7             sepal_length  7.0  7.0   1.0    0.0  7.000000   7.0

 7              sepal_width  3.2  3.2   1.0    0.0  3.200000   3.2

 ...                    ...  ...  ...   ...    ...       ...   ...

 7.6           petal_length  6.6  6.6   1.0    0.0  6.600000   6.6

 7.6            petal_width  2.1  2.1   1.0    0.0  2.100000   2.1

 7.7           sepal_length  7.7  7.7   4.0    0.0  7.700000  30.8

 7.7            sepal_width  2.6  3.8   4.0    0.0  3.050000  12.2

 7.7           petal_length  6.1  6.9   4.0    0.0  6.600000  26.4

 7.7            petal_width  2.0  2.3   4.0    0.0  2.200000   8.8

 7.9           sepal_length  7.9  7.9   1.0    0.0  7.900000   7.9

 7.9            sepal_width  3.8  3.8   1.0    0.0  3.800000   3.8

 7.9           petal_length  6.4  6.4   1.0    0.0  6.400000   6.4

 7.9            petal_width  2.0  2.0   1.0    0.0  2.000000   2.0

 

... truncated ...

Conclusion

If you are familiar with Pandas DataFrames, hopefully, the use of much of the API in CASTable objects helps you transition to the world of CAS. We covered the basics of using some of the supported DataFrame methods, using various forms of indexing and iteration, filtering using expressions, creating computed columns, and using by groups. With all of that “under your belt,” we can look at some more advanced statistical data in the following chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.228