Chapter 6: Working with CAS Tables

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6: Working with CAS Tables

Using CASTable Objects like a DataFrame

CAS Table Introspection

Computing Simple Statistics

Creating Plots from CASTable Data

Exporting CASTables to Other Formats

Sorting, Data Selection, and Iteration

Fetching Data with a Sort Order

Iterating through Columns and Rows

Techniques for Indexing and Selecting Data

Data Wrangling on the Fly

Creating Computed Columns

By Group Processing

Conclusion

In the previous chapter, we introduced the CASAction and CASTable objects. We covered the methods of setting parameters and calling actions on a CASTable object, but our coverage only included the basics. In this chapter we continue to work on CASTable objects on a variety of topics, including the DataFrame APIs, exporting data, creating visualizations, fetching and indexing, creating temporary columns, and by group processing.

Using CASTable Objects like a DataFrame

The Pandas DataFrame has taken the elegance of the Python programming language and applied it beautifully to data structures that are used in everyday statistics and data analysis. It supports many basic statistical analyses on data sets that can fit into memory on your computer, and it also defines interesting ways to index data sets using Python’s “fancy” indexing syntax.

The CASTable object adopts these same APIs and applies them to tables in CAS. This means that you can use the familiar DataFrame API on CAS tables. The result types that are returned by the CASTable versions of the Pandas DataFrame methods are, in most cases, the same as those that are returned by DataFrame methods of the same name. The primary difference is that the amount of data that is referenced by a CASTable object can be much greater than what can be handled on a single desktop machine.

In the following sections, we start with some methods and attributes that describe the CAS table itself, and then move on to methods for simple statistics and other areas of the DataFrame API.

CAS Table Introspection

Basic information about CAS tables includes the names of the columns, the data types, and the number of rows. Until now, we’ve used the tableinfo and columninfo CAS actions to get that information. However, we can also use the DataFrame attributes and methods for the same purpose. Probably the most common attribute that is used to get this type of information is the columns attribute. It simply lists the names of all of the columns in the table. We load the Iris data into both a CAS table and a DataFrame for comparison.

# Import pandas

In [1]: import pandas as pd

# Read iris.csv into a DataFrame

In [2]: df = pd.read_csv('/u/username/data/iris.csv')

In [3]: df.columns

Out[3]: Index(['sepal_length', 'sepal_width', 'petal_length',

'petal_width', 'species'], dtype='object')

# Load iris.csv into a CAS table

In [4]: tbl = conn.loadtable('data/iris.csv',

caslib='casuser').casTable

NOTE: Cloud Analytic Services made the file data/iris.csv available as table DATA.IRIS in caslib CASUSER(username).

In [5]: tbl.columns

Out[5]: Index(['sepal_length', 'sepal_width', 'petal_length',

'petal_width', 'species'], dtype='object')

In both cases, you see that the columns attribute returns a Pandas Index object that contains the names of the columns. You can also find out the data types using dtypes.

In [6]: df.dtypes

Out[6]:

sepal_length float64

sepal_width float64

petal_length float64

petal_width float64

species object

dtype: object

In [7]: tbl.dtypes

Out[7]:

sepal_length double

sepal_width double

petal_length double

petal_width double

species varchar

dtype: object

In this case, you’ll see a small difference. The data types that are supported by DataFrames and those that are supported by CAS tables have different names. The different names don’t matter since the Python float64 types is equivalent to the CAS double type. However, if you use the type names for filtering in other parts of the API, you want to use the CAS type names.

You can also get the counts of each data type using the get_dtype_counts method.

In [8]: df.get_dtype_counts()

Out[8]:

float64 4

object 1

dtype: int64

In [9]: tbl.get_dtype_counts()

Out[9]:

double 4

varchar 1

dtype: int64

To obtain the size or the shape of a CAS table, you use the size and the shape attributes just as in a DataFrame.

In [10]: tbl.size

Out[10]: 750

In [11]: tbl.shape

Out[11]: (150, 5)

For a quick summary of information about the entire table, you can use the info method.

In [12]: tbl.info()

CASTable('DATA.IRIS', caslib='CASUSER(username)')

Data columns (total 5 columns):

N Miss Type

sepal_length 150 False double

sepal_width 150 False double

petal_length 150 False double

petal_width 150 False double

species 150 False varchar

dtypes: double(4), varchar(1)

data size: 8450

vardata size: 1250

memory usage: 8656

Finally, to get a sample of the data, you can use the head or tail methods. Keep in mind that CAS commonly uses distributed data storage that has no particular sort order, so the values that you get from the head and tail methods are not deterministic unless the table has a specified sort order, which is covered later in this chapter.

In [13]: tbl.head(n=3)

Out[13]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

In [14]: tbl.tail(n=4)

Out[14]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

146 6.4 2.8 5.6 2.2 virginica

147 6.3 2.8 5.1 1.5 virginica

148 6.1 2.6 5.6 1.4 virginica

149 7.7 3.0 6.1 2.3 virginica

As you can see from the preceding code, the head and tail methods accept the n parameter to indicate the number of records to retrieve (the default is 5). CAS commonly deals with large data sets and you might not want to bring all of the variables back. Using the columns parameter, it is possible to subset the columns that are retrieved.

In [15]: tbl.head(columns=['sepal_length', 'petal_length'])

Out[15]:

Selected Rows from Table DATA.IRIS

sepal_length petal_length

0 5.1 1.4

1 4.9 1.4

2 4.7 1.3

3 4.6 1.5

4 5.0 1.4

Although this part of the DataFrame API is fairly simple, it already shows how easy working with CASTable objects can be. Let’s move on to computing some basic statistics.

Computing Simple Statistics

A common DataFrame method that is used to learn about the characteristics of your data is the describe method. Here is a simple example of getting basic summary statistics from a DataFrame and a CASTable, both using the describe method.

# Run the describe method on a DataFrame

In [16]: desc = df.describe()

In [17]: desc

Out[17]:

sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

In [18]: type(desc)

Out[18]: pandas.core.frame.DataFrame

# Run the describe method on a CASTable

In [19]: casdesc = tbl.describe()

In [20]: casdesc

Out[20]:

sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

In [21]: type(casdesc)

Out[21]: pandas.core.frame.DataFrame

As you can see, in both cases, we run the describe method and get back the same results even in the same data type. This might not seem impressive for only 150 records, but how about running it on a larger collection? Our table contains 70 million rows and 46 columns, 35 of which are numeric. Running the describe method on our grid of 48 nodes requires only seconds.

In [22]: conn.tableinfo('datasources.megacorp5m')

Out[22]:

[TableInfo]

Name Rows Columns Encoding

0 DATASOURCES.MEGACORP5M 70732833 46 utf-8

CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime

0 01Mar2016:12:57:08 01Mar2016:12:57:08 UTF8 1.772456e+09

ModTime Global Repeated View SourceName

0 1.772456e+09 0 0 0 datasources/megacorp5m.csv

SourceCaslib Compressed

0 CASTestTmp 0

In [23]: %time mega.describe()

CPU times: user 261 ms, sys: 24 ms, total: 285 ms

Wall time: 12.1 s

Out[23]:

Date DateByYear DateByMonth

count 70732833.000000 70732833.000000 70732833.000000

mean 14389.447935 14209.155789 14374.679408

std 2653.775328 2651.783194 2653.766319

min 7305.000000 7305.000000 7305.000000

25% 12460.000000 12419.000000 12450.000000

50% 14605.000000 14245.000000 14579.000000

75% 16517.000000 16437.000000 16496.000000

max 18992.000000 18628.000000 18962.000000

...

StateLatitude CityLongitude CityLatitude

count 70732833.000000 70119950.000000 70119950.000000

mean 35.090183 -95.995716 34.365083

std 4.313599 12.363752 5.171787

min 31.051683 -122.418330 27.800283

25% 31.169358 -97.396111 30.694170

50% 32.615651 -92.289440 33.509170

75% 37.271892 -88.043060 37.775000

max 47.272951 -71.455280 47.606390

[8 rows x 35 columns]

Now that we’ve seen that we can analyze both small and large quantities of data in the familiar Pandas DataFrame API, let’s see what else it can do.

In addition to running the describe method with the default settings, you can use any of the options that it supports including percentiles, include, and exclude to modify the output. Let’s go back to the Iris data set and try these options.

The percentiles that you get with the default options are 25%, 50%, and 75%. We can customize those values by passing in float values between 0 and 1 to the percentiles option.

In [24]: tbl

Out[24]: CASTable('DATA.IRIS', caslib='CASUSER(username)')

In [25]: tbl.describe(percentiles=[0.3, 0.8])

Out[25]:

sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

30% 5.250000 2.800000 1.700000 0.400000

50% 5.800000 3.000000 4.350000 1.300000

80% 6.550000 3.400000 5.350000 1.900000

max 7.900000 4.400000 6.900000 2.500000

As you can see, we now get the 30% and 80% percentiles in addition to 50%, which always gets computed by the DataFrame method. The include and exclude options enable you to specify the data types that should be included in the analysis. You can specify them as a list of strings that indicate the CAS data types that should be included, or you can use the terms ‘numeric’, ‘character’, and ‘all’ to refer to all numeric types, all character types, and all variables, respectively. The values for this option vary slightly from the DataFrame method since the names of the data types are different between Pandas and CAS, but the form of the option is the same. In either case, the default is to analyze numeric data only.

# Analyze only character data in the CAS table

In [26]: tbl.describe(include='character')

Out[26]:

species

count 150

unique 3

top virginica

freq 50

# Analyze all data in the CAS table

In [27]: tbl.describe(include='all')

Out[27]:

sepal_length sepal_width petal_length petal_width species

count 150 150 150 150 150

unique 35 23 43 22 3

top 5 3 1.5 0.2 virginica

freq 10 26 14 28 50

mean 5.84333 3.054 3.75867 1.19867 NaN

std 0.828066 0.433594 1.76442 0.763161 NaN

min 4.3 2 1 0.1 setosa

25% 5.1 2.8 1.6 0.3 NaN

50% 5.8 3 4.35 1.3 NaN

75% 6.4 3.3 5.1 1.8 NaN

max 7.9 4.4 6.9 2.5 virginica

# Same as above

In [28]: tbl.describe(include=['numeric', 'character'])

Out[28]:

sepal_length sepal_width petal_length petal_width species

count 150 150 150 150 150

unique 35 23 43 22 3

top 5 3 1.5 0.2 virginica

freq 10 26 14 28 50

mean 5.84333 3.054 3.75867 1.19867 NaN

std 0.828066 0.433594 1.76442 0.763161 NaN

min 4.3 2 1 0.1 setosa

25% 5.1 2.8 1.6 0.3 NaN

50% 5.8 3 4.35 1.3 NaN

75% 6.4 3.3 5.1 1.8 NaN

max 7.9 4.4 6.9 2.5 virginica

The last option that is supported is called stats. This is not a DataFrame option, but was added to allow statistics that are supported by the summary action to be included in the results as well. It accepts a list of strings, including ‘count’, ‘std’, ‘min’, ‘pct’, ‘max’, ‘unique’, ‘top’, ‘freq’, ‘nmiss’, ‘sum’, ‘stderr’, ‘var’, ‘uss’, ‘cv’, ‘tvalue’, and ‘probt’. The value of ‘pct’ includes all of the percentiles that have been computed. If ‘all’ is specified, all of the previously mentioned statistics are displayed.

In [29]: tbl.describe(stats=['count', 'nmiss', 'sum',

....: 'probt', 'freq'])

Out[29]:

sepal_length sepal_width petal_length petal_width

count 1.500000e+02 1.500000e+02 1.500000e+02 1.500000e+02

nmiss 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

sum 8.765000e+02 4.581000e+02 5.638000e+02 1.798000e+02

probt 3.331256e-129 4.374977e-129 1.994305e-57 3.209704e-42

freq 1.000000e+01 2.600000e+01 1.400000e+01 2.800000e+01

In [30]: tbl.describe(stats='all')

Out[30]:

sepal_length sepal_width petal_length petal_width

count 1.500000e+02 1.500000e+02 1.500000e+02 1.500000e+02

unique 3.500000e+01 2.300000e+01 4.300000e+01 2.200000e+01

mean 5.843333e+00 3.054000e+00 3.758667e+00 1.198667e+00

std 8.280661e-01 4.335943e-01 1.764420e+00 7.631607e-01

min 4.300000e+00 2.000000e+00 1.000000e+00 1.000000e-01

25% 5.100000e+00 2.800000e+00 1.600000e+00 3.000000e-01

50% 5.800000e+00 3.000000e+00 4.350000e+00 1.300000e+00

75% 6.400000e+00 3.300000e+00 5.100000e+00 1.800000e+00

max 7.900000e+00 4.400000e+00 6.900000e+00 2.500000e+00

nmiss 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00

sum 8.765000e+02 4.581000e+02 5.638000e+02 1.798000e+02

stderr 6.761132e-02 3.540283e-02 1.440643e-01 6.231181e-02

var 6.856935e-01 1.880040e-01 3.113179e+00 5.824143e-01

uss 5.223850e+03 1.427050e+03 2.583000e+03 3.023000e+02

cv 1.417113e+01 1.419759e+01 4.694272e+01 6.366747e+01

tvalue 8.642537e+01 8.626430e+01 2.609020e+01 1.923659e+01

probt 3.331256e-129 4.374977e-129 1.994305e-57 3.209704e-42

It is also possible to access the individual statistics through the DataFrame methods count, mean, std, and max. The methods nmiss, sum, stderr, var, uss, cv, tvalue, and probt are also available in order to complete the set that is provided by the summary action. These methods return a Pandas Series object that contains the values for each variable.

In [31]: tbl.count()

Out[31]:

sepal_length 150

sepal_width 150

petal_length 150

petal_width 150

species 150

dtype: int64

In [32]: tbl.mean()

Out[32]:

sepal_length 5.843333

sepal_width 3.054000

petal_length 3.758667

petal_width 1.198667

dtype: float64

In [33]: tbl.probt()

Out[33]:

sepal_length 3.331256e-129

sepal_width 4.374977e-129

petal_length 1.994305e-57

petal_width 3.209704e-42

dtype: float64

At the time of this writing, not all of the DataFrame methods were supported on the CASTable object. As development of the SWAT package continues, more of the DataFrame methods will likely become available.

Now that we have seen some basic DataFrame operations working on CASTable objects, let’s move on to some plotting operations.

Creating Plots from CASTable Data

The CASTable objects support all of the same chart types as the DataFrame object because they use the same code. Plotting is done on the client side, so all of the data being visualized must be downloaded first. The plotting DataFrame methods are mirrored in the CASTable objects, and the data is downloaded automatically. Although these methods are convenient, for iterative work that might require tuning your plots, you probably want to bring the data back to a local DataFrame and use the plotting methods on it to avoid downloading data from the server at each iteration.

The easiest way to generate a plot of data in a CASTable is to use the plot method. We’ll use our original Iris data set for this example. The following creates a plot:

In [34]: tbl.plot()

Out[34]: <matplotlib.axes.AxesSubplot at 0x7f4b6650c4e0>

If you use the Jupyter notebook as your client, your plot should show up in-line. If it doesn’t appear in-line, you might try running the following IPython magic command and then do the plot again:

In [35]: %matplotlib inline

If you are using something other than the Jupyter notebook and the plot doesn’t show up, you can try the following:

In [36]: from matplotlib.pyplot import show

In [37]: show()

The show function tells Matplotlib, which is the plotting package, to show the figure window. The following screenshot shows the plot that is generated on a Linux machine:

Of course, you can display subsets of columns as well using the standard DataFrame syntax.

In [38]: tbl[['sepal_length', 'sepal_width']].plot()

Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa3efc23240>

In [39]: show()

Here is a plot of selected columns from the Iris data set:

Other chart types include area, bar, horizontal bar, box plot, density, histogram, KDE, and pie. Depending on your version of Pandas, these chart types are accessible in multiple ways. The traditional way is to specify the type of chart in the kind argument of the plot method as follows:

In [40]: tbl[['sepal_length', 'sepal_width']].plot(kind='area')

Out[40]: <matplotlib.axes._subplots.AxesSubplot at 0x7f77480ce908>

In [41]: show()

Here is an area plot of selected column of the Iris data set:

In newer versions of Pandas (releases 0.17.0 and later), you can use the individual chart type attributes on the plot method.

In [42]: tbl[['sepal_length', 'sepal_width']].plot.area()

Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7747ea1fd0>

In [43]: show()

Options to the plotting methods can be used to customize the output (see the Pandas documentation for additional details). However, this interface does not provide publication-ready charts. It is primarily for exploration. For highly customized graphics, you should use the Matplotlib package directly.

Exporting CASTables to Other Formats

The data in CAS tables can be saved to all of the formats that are supported by the Pandas DataFrame’s export methods. It accomplishes this by fetching the data and using the DataFrame export methods to save the data. Export methods include to_pickle, to_csv, to_hdf, to_sql, to_dict, to_excel, to_json, to_html, to_latex, to_stata, to_msgpack, to_gbq, to_records, to_sparse, to_dense, to_string, and to_clipboard. Keep in mind that all of these methods save the output on the client side. To save the data in a CAS table to a data source, you use the save CAS action.

The save action stores the data in the CAS table to the appropriate format for the caslib where the data is stored. The most basic form is the file-based caslib. You can store the data from a CAS table to either the CSV form or the SASHDAT form in file-based caslibs. To store the data, you simply specify the caslib to which the file should be saved, and you specify the path to the file in the name parameter.

In [44]: tbl.save(name='data/irisout.csv', caslib='casuser')

NOTE: Cloud Analytic Services created the file data/irisout.csv in caslib CASUSER(username).

Out[44]:

[caslib]

'CASUSER(username)'

[name]

'data/irisout.csv'

+ Elapsed: 0.0143s, user: 0.004s, mem: 2.14mb

In [45]: tbl.save(name='data/irisout.sashdat', caslib='casuser')

NOTE: Cloud Analytic Services created the file data/irisout.sashdat in caslib CASUSER(username).

Out[45]:

[caslib]

'CASUSER(username)'

[name]

'data/irisout.sashdat'

+ Elapsed: 0.0214s, user: 0.016s, sys: 0.003s, mem: 16.2mb

For caslibs with other data sources, see the SAS documentation for the options and parameters that are needed to export the data to a specific data source type.

Sorting, Data Selection, and Iteration

Some of the most basic data operations deal with sorting rows, selecting data based on labels or expressions, and iterating over rows or columns of a table. The following sections introduce techniques for accomplishing each data operation using the CASTable object.

Fetching Data with a Sort Order

As mentioned previously, data in a CAS table might be distributed across a grid, or it can be accessible in ways that do not specify a particular row order. This means that anytime you fetch data from CAS, you can’t depend on the order unless you specify a sort order when retrieving it. A sort order can be specified with the sortby parameter on the fetch action, but a different method is used with DataFrames. DataFrame methods preserve order and can be sorted using the sort_values method (or the sort method in older versions of Pandas). To mimic the DataFrame behavior, CASTable objects also support those sorting methods.

The simplest form of sort_values is to specify the names of the columns to sort on. The order is ascending by default.

In [46]: tbl.sort_values(['sepal_length', 'sepal_width'])

Out[46]: CASTable('DATA.IRIS',

caslib='CASUSER(username)').sort_values(

['sepal_length', 'sepal_width'])

There is a subtle point in the preceding code that you might not have noticed if you aren’t familiar with Pandas. When you use the sort_values method on a DataFrame, it returns a new DataFrame object with an applied sort order unless you specify the inplace=True option. This is true with CASTable objects as well. The preceding code probably didn’t accomplish its sorting goal. Instead, it created a new CASTable object with the sort order applied, but since we didn’t save it to a variable, it simply disappeared. We have two choices: we can either use the inplace=True option on sort_values, or specify a variable for storing the returned value. Let’s store the value in a variable in the following example:

In [47]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'])

In [48]: sorttbl

Out[48]: CASTable('DATA.IRIS',

caslib='CASUSER(username)').sort_values(

['sepal_length', 'sepal_width'])

Now we have a new CASTable object with the sort order applied as indicated by the sort_values(…) that is shown in the string representation above. To clarify, this sorting does nothing on the server side. In fact, nothing was sent to the server when this operation was invoked. The sort_values method call simply binds a sort order to the CASTable object that gets used anytime that the CASTable object is used to retrieve data from that CAS table.

We can now use the head or tail method to retrieve a sample of the data in the specified sort order.

In [50]: sorttbl.head(10)

Out[50]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 4.3 3.0 1.1 0.1 setosa

1 4.4 2.9 1.4 0.2 setosa

2 4.4 3.0 1.3 0.2 setosa

3 4.4 3.2 1.3 0.2 setosa

4 4.5 2.3 1.3 0.3 setosa

5 4.6 3.1 1.5 0.2 setosa

6 4.6 3.2 1.4 0.2 setosa

7 4.6 3.4 1.4 0.3 setosa

8 4.6 3.6 1.0 0.2 setosa

9 4.7 3.2 1.6 0.2 setosa

In [51]: sorttbl.tail(5)

Out[51]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

145 7.7 2.6 6.9 2.3 virginica

146 7.7 2.8 6.7 2.0 virginica

147 7.7 3.0 6.1 2.3 virginica

148 7.7 3.8 6.7 2.2 virginica

149 7.9 3.8 6.4 2.0 virginica

To change the order to descending, you can specify ascending=False or you can give a list of Booleans to indicate ascending or descending order for each column.

In [52]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'],

....: ascending=[False, True])

In [53]: sorttbl

Out[53]: CASTable('DATA.IRIS',

caslib='CASUSER(username)').sort_values(

['sepal_length', 'sepal_width'],

ascending=[False, True])

If we fetch a sample of the data with our mixed descending/ascending sort order, you’ll get the following.

In [54]: sorttbl.head(10)

Out[54]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.9 3.8 6.4 2.0 virginica

1 7.7 2.6 6.9 2.3 virginica

2 7.7 2.8 6.7 2.0 virginica

3 7.7 3.0 6.1 2.3 virginica

4 7.7 3.8 6.7 2.2 virginica

5 7.6 3.0 6.6 2.1 virginica

6 7.4 2.8 6.1 1.9 virginica

7 7.3 2.9 6.3 1.8 virginica

8 7.2 3.0 5.8 1.6 virginica

9 7.2 3.2 6.0 1.8 virginica

Now that we have seen how to apply a sort order to a CASTable, we can go to the next set of DataFrame features that pertain to iteration.

Iterating through Columns and Rows

CASTable objects support all of the DataFrame iteration methods. However, you should use the iteration methods judiciously. Although you can iterate through the rows of a CAS table, you probably don’t want to do that if your table contains millions of rows. Realistically, you should use iteration methods with CAS tables that contain only summarization data or CAS tables that have a constrained number of records.

Basic Iteration

Iterating through the column names is the most basic form of iteration on a CASTable object. Iterating through columns is done by using the table object itself as the iterator just like with DataFrame objects.

In [55]: for col in sorttbl:

....: print(col)

....:

sepal_length

sepal_width

petal_length

petal_width

species

Of course, you can also wrap the table in the various Python iteration functions to get a different behavior. For example, you can use the enumerate function to give the index value of each column name, or you can even use the zip function to walk through multiple lists of the same size.

In [56]: for i, col in enumerate(sorttbl):

....: print(i, col)

....:

0 sepal_length

1 sepal_width

2 petal_length

3 petal_width

4 species

In [57]: for col, dtype in zip(sorttbl, sorttbl.dtypes):

....: print(col, dtype)

....:

sepal_length double

sepal_width double

petal_length double

petal_width double

species varchar

Iterating over Columns

Much like the basic iterator, the iteritems method iterates over the columns as well. However, in this case, the returned values are a pairing of the column name and a CASColumn object. CASColumn objects are just like CASTable objects except that they reference only a single column. They are analogous to Pandas Series objects and are discussed later in the chapter.

In [58]: for col, obj in sorttbl.iteritems():

....: print(col, obj)

....: print('')

....:

sepal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

sepal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

petal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

petal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

species CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['species'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])

Although the previous two iterators can be safely run on large tables (they don’t iterate through the data rows), the following two iterators go through all of the rows in a table. The iterrows method returns a tuple that contains the row index and a Pandas Series object that contains the values of the row.

In [59]: for row in sorttbl.iterrows():

....: print(row)

....:

(0, sepal_length 4.3

sepal_width 3

petal_length 1.1

petal_width 0.1

species setosa

Name: 0, dtype: object)

(1, sepal_length 4.4

sepal_width 2.9

petal_length 1.4

petal_width 0.2

species setosa

Name: 1, dtype: object)

...

(149, sepal_length 7.9

sepal_width 3.8

petal_length 6.4

petal_width 2

species virginica

Name: 149, dtype: object)

This method uses a Python generator to appear as if it fetches all of the data and iterates over it. However, the iterrows method pulls the data down in batches (configurable by the n argument) and waits until a new batch is needed before it fetches the next batch. The itertuples method works in a very similar way, but it returns each row as a tuple of values rather than a Series. The row index value can be disabled by passing index=False.

In [60]: for row in sorttbl.itertuples():

....: print(row)

....:

(0, 4.2999999999999998, 3.0, 1.1000000000000001, 0.10000000000000001,

'setosa')

(1, 4.4000000000000004, 2.8999999999999999, 1.3999999999999999,

0.20000000000000001, 'setosa')

(2, 4.4000000000000004, 3.0, 1.3, 0.20000000000000001, 'setosa')

. . .

(149, 7.9000000000000004, 3.7999999999999998, 6.4000000000000004, 2.0,

'virginica')

Now that we’ve seen how to iterate through our data, let’s look at how we can retrieve items from the data.

Techniques for Indexing and Selecting Data

Indexing and selecting data in DataFrames comes in many forms. The loc and iloc attributes enable you to select rows that are based on labels and row indexes. The attributes, at and iat, enable you to select a scalar value at a particular position, given a label and a row index, respectively, paired with a column identifier. The ix attribute supports mixed integer and label access. Finally, using the Python __getitem__ functionality (through the bracket [...] syntax) on the DataFrame itself enables you to select columns and ranges of rows.

Selecting Columns by Label and Position

Probably, the most common way that you see the Pandas DataFrame being indexed is directly on the DataFrame itself. DataFrame objects (and similarly CASTables objects) enable you to select columns either by name or index. To select a single column, you simply index the CASTable using the name of the column.

In [61]: col = sorttbl['sepal_width']

In [62]: col

Out[62]: CASColumn('data.iris', caslib='casuser')

['sepal_width'].sort_values(['sepal_length',

'sepal_width'],

ascending=[False, True])

In [63]: col.head()

Out[63]:

0 3.8

1 2.6

2 2.8

3 3.0

4 3.8

Name: sepal_width, dtype: float64

The object that we get back is a CASColumn object. This object is analogous to a Pandas Series. Technically, CAS doesn’t support stand-alone columns. The CASColumn object is a specialized subclass of CASTable object that references a single column of the table. As you can see, when you retrieve a column from a CASTable, the resulting CASColumn takes with it all table and sort information that was on the parent CASTable. We cover more advanced uses of CASColumn objects later in the chapter. For now, let’s move on to more CASTable indexing.

An alternative way of selecting a column is using the Python attribute syntax. You can specify the name of the column on the CASTable in order to have it return a CASColumn.

In [64]: sorttbl.species

Out[64]: CASColumn('DATA.IRIS', caslib='CASUSER(username)')

['species'].sort_values(['sepal_length', 'sepal_width'],

ascending=[False, True])

Although this way of accessing columns is prettier than the bracket syntax, it has some restrictions. The name of the column must be a legal Python attribute name, which cannot collide with any existing attributes or methods on the CASTable object. Although this syntax is appropriate for interactive use, it’s much safer to write programs using the bracket syntax.

In addition to retrieving columns from a CASTable, we can also retrieve subsets of a table. To do this, you use a list of columns as the index key.

In [65]: widths = sorttbl[['sepal_width', 'petal_width', 'species']]

In [66]: widths

Out[66]: CASTable('DATA.IRIS', caslib='CASUSER(username)')[

['sepal_width', 'petal_width', 'species']

].sort_values(['sepal_length', 'sepal_width'],

ascending=[False, True])

In [67]: widths.head()

Out[67]:

Selected Rows from Table DATA.IRIS

sepal_width petal_width species

0 3.8 2.0 virginica

1 2.6 2.3 virginica

2 2.8 2.0 virginica

3 3.0 2.3 virginica

4 3.8 2.2 virginica

If you specify a list of column names (or even a list containing a single column name), you’ll get back a new CASTable object that includes only the referenced columns. The data in this table is still the same as the data in the parent table. The local CASTable object simply references the columns that it needs.

Of course, now that we have this new table, we can use any of the previously described CASTable methods on it (including the CAS action and DataFrame methods).

In [68]: widths.describe()

Out[68]:

sepal_width petal_width

count 150.000000 150.000000

mean 3.054000 1.198667

std 0.433594 0.763161

min 2.000000 0.100000

25% 2.800000 0.300000

50% 3.000000 1.300000

75% 3.300000 1.800000

max 4.400000 2.500000

In [69]: widths.columninfo()

Out[69]:

[ColumnInfo]

Column ID Type RawLength FormattedLength NFL NFD

0 sepal_width 2 double 8 12 0 0

1 petal_width 4 double 8 12 0 0

2 species 5 varchar 10 10 0 0

+ Elapsed: 0.000839s, sys: 0.001s, mem: 0.188mb

In addition to basic column selection such as the examples shown here, you can also use the various accessor properties of DataFrame objects to select columns in other ways. That’s covered in the next section.

Selecting Data by Label and Position

Although DataFrames enable you to select rows using the loc, iloc, at, iat, and ix accessors, CAS tables do not have the concept of a row label or an index. These accessors still exist on CASTable objects, but they are limited to column selection only.

Let’s use our sorted table from the previous example to select some data points.

In [70]: sorttbl

Out[70]: CASTable('DATA.IRIS',

caslib='CASUSER(username)').sort_values(

['sepal_length', 'sepal_width'],

ascending=[False, True])

The loc and iloc data accessors on CASTable objects can select only columns. The loc accessor selects columns by name, whereas the iloc accessor selects columns by index. You must specify an empty slice as the first index parameter.

# Select the petal_width column

In [71]: sorttbl.loc[:, 'petal_width'].head()

Out[71]:

0 2.0

1 2.3

2 2.0

3 2.3

4 2.2

Name: petal_width, dtype: float64

# Select a range of columns

In [72]: sorttbl.loc[:, 'sepal_length':'petal_length'].head()

Out[72]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length

0 7.9 3.8 6.4

1 7.7 2.6 6.9

2 7.7 2.8 6.7

3 7.7 3.0 6.1

4 7.7 3.8 6.7

# Select a list of columns

In [73]: sorttbl.loc[:, ['petal_width', 'sepal_width']].head()

Out[73]:

Selected Rows from Table DATA.IRIS

petal_width sepal_width

0 2.0 3.8

1 2.3 2.6

2 2.0 2.8

3 2.3 3.0

4 2.2 3.8

The iloc accessor works much like loc except that it uses integer indexes.

In [74]: sorttbl.iloc[:, 3].head()

Out[74]:

0 2.0

1 2.3

2 2.0

3 2.3

4 2.2

Name: petal_width, dtype: float64

In [75]: sorttbl.iloc[:, 0:3].head()

Out[75]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length

0 7.9 3.8 6.4

1 7.7 2.6 6.9

2 7.7 2.8 6.7

3 7.7 3.0 6.1

4 7.7 3.8 6.7

In [76]: sorttbl.iloc[:, [3, 1]].head()

Out[76]:

Selected Rows from Table DATA.IRIS

petal_width sepal_width

0 2.0 3.8

1 2.3 2.6

2 2.0 2.8

3 2.3 3.0

4 2.2 3.8

Finally, the ix accessor enables you to mix labels and position indexes.

In [77]: sorttbl.ix[:, [3, 'sepal_width']].head()

Out[77]:

Selected Rows from Table DATA.IRIS

petal_width sepal_width

0 2.0 3.8

1 2.3 2.6

2 2.0 2.8

3 2.3 3.0

4 2.2 3.8

In [78]: sorttbl.ix[:, 'sepal_width'].head()

Out[78]:

0 3.8

1 2.6

2 2.8

3 3.0

4 3.8

Name: sepal_width, dtype: float64

In [79]: sorttbl.ix[:, 'sepal_width':-2].head()

Out[79]:

Selected Rows from Table DATA.IRIS

sepal_width petal_length

0 3.8 6.4

1 2.6 6.9

2 2.8 6.7

3 3.0 6.1

4 3.8 6.7

In [80]: sorttbl.ix[:, ['sepal_width', 3, 4]].head()

Out[80]:

Selected Rows from Table DATA.IRIS

sepal_width petal_width species

0 3.8 2.0 virginica

1 2.6 2.3 virginica

2 2.8 2.0 virginica

3 3.0 2.3 virginica

4 3.8 2.2 virginica

Although the preceding accessors can be useful, they are somewhat limited because the nature of distributed data prevents selection by row index. However, there is a more powerful indexing feature on CASTable objects that allows for various forms of advanced selection of data.

Dynamic Data Selection

Rather than using fixed values of rows and columns to select data, we can create conditions that are based on the data in the table to determine which rows to select. The specification of conditions is done using the same syntax as that used by DataFrame objects. CASColumn objects support Python’s various comparison operators and build a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTable. It sounds much more complicated than it is, so let’s look at an example.

If we want to get a CASTable that contains only values where petal_length is greater than 6.5, we can use the following expression to create our filter:

In [81]: expr = sorttbl.petal_length > 6.5

This expression creates a computed column that is used in a where expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a Boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.

In [82]: newtbl = sorttbl[expr]

In [83]: newtbl.head()

Out[83]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.7 2.6 6.9 2.3 virginica

1 7.7 2.8 6.7 2.0 virginica

2 7.7 3.8 6.7 2.2 virginica

3 7.6 3.0 6.6 2.1 virginica

These two steps are commonly entered on one line.

In [84]: newtbl = sorttbl[sorttbl.petal_length > 6.5]

In [85]: newtbl.head()

Out[85]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.7 2.6 6.9 2.3 virginica

1 7.7 2.8 6.7 2.0 virginica

2 7.7 3.8 6.7 2.2 virginica

3 7.6 3.0 6.6 2.1 virginica

We can further filter rows out by indexing another comparison expression.

In [86]: newtbl2 = newtbl[newtbl.petal_width < 2.2]

In [87]: newtbl2.head()

Out[87]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.7 2.8 6.7 2.0 virginica

1 7.6 3.0 6.6 2.1 virginica

Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You must be careful with these operators though due to operator precedence. Bitwise comparison has a higher precedence than comparisons such as greater-than and less-than, so you need to enclose your comparisons in parentheses.

In [88]: sorttbl[(sorttbl.petal_length > 6.5) &

....: (sorttbl.petal_width < 2.2)].head()

Out[88]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.7 2.8 6.7 2.0 virginica

1 7.6 3.0 6.6 2.1 virginica

In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the resulting CASTable objects.

In [89]: sorttbl[(sorttbl.petal_length > 6.5) &

....: (sorttbl.petal_width < 2.2)]

Out[89]: CASTable('data.iris', caslib='casuser',

computedvars=['_gt_6_', '_and_8_', '_lt_7_'],

computedvarsprogram='_gt_6_ = (petal_length > 6.5);

_lt_7_ = (petal_width < 2.2);

_and_8_ = (_gt_6_ and _lt_7_); ',

where='(_and_8_)')

[['sepal_length', 'sepal_width',

'petal_length', 'petal_width',

'species']].sort_values(['sepal_length',

'sepal_width'],

ascending=[False, True])

You can also do mathematical operations on columns with constants or on other columns within your comparisons.

In [90]: sorttbl[(sorttbl.petal_length + sorttbl.petal_width)

....: * 2 > 17.5].head()

Out[90]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.7 2.6 6.9 2.3 virginica

1 7.7 3.8 6.7 2.2 virginica

The list of supported operators is shown in the following table:

Operator	Numeric Data	Character Data
+ (add)	✓	✓
- (subtract)	✓
* (multiply)	✓
/ (divide)	✓
% (modulo)	✓
** (power)	✓

The supported comparison operators are shown in the following table.

Operator	Numeric Data	Character Data
== (equality)	✓	✓
!= (inequality)	✓	✓
< (less than)	✓	✓
> (greater than)	✓	✓
<= (less than or equal to)	✓	✓
>= (greater than or equal to)	✓	✓

As you can see in the preceding tables, you can do comparisons on character columns as well. It is also possible to use many of the Python string methods on the column values. These methods are accessed through the str attribute of the column, just like in DataFrames.

In [91]: sorttbl[sorttbl.species.str.upper().str.startswith('SET')

....: ].head()

Out[91]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 5.8 4.0 1.2 0.2 setosa

1 5.7 3.8 1.7 0.3 setosa

2 5.7 4.4 1.5 0.4 setosa

3 5.5 3.5 1.3 0.2 setosa

4 5.5 4.2 1.4 0.2 setosa

The following table shows the string methods that are supported on character columns:

Method	Description
col.str.capitalize	Capitalize the first character, and lowercase the remaining characters.
col.str.contains	Return Boolean values indicating whether the pattern exists in the data values.
col.str.count	Count occurrences of the pattern in each value.
col.str.endswith	Return Boolean values indicating whether the values end with the pattern.
col.str.find	Return the lowest index of the pattern in each value, or return -1 if the pattern is not found.
col.str.index	Is the same as the find method (the preceding entry) except that a ValueError is raised if the pattern is not found.
col.str.len	Compute the length of each value.
col.str.lower	Lowercase the entire column.
col.str.lstrip	Strip the leading spaces.
col.str.repeat	Duplicate the value by the specified number of times.
col.str.rfind	Return the highest index of the pattern in each value, or return -1 if the pattern is not found.
col.str.rindex	Is the same as the rfind method (the preceding entry) except that a ValueError is raised if the pattern is not found.
col.str.rstrip	Strip the trailing whitespace characters.
col.str.slice	Slice a substring from the value.
col.str.startswith	Return Boolean values indicating the value starts with the pattern.
col.str.strip	Strip the leading and trailing whitespace characters.
col.str.title	Capitalize each word in the value, and lowercase the remaining characters.
col.str.upper	Uppercase the value.
col.str.isalnum	Return Boolean values indicating whether the value is all alphanumeric values.
col.str.isalpha	Return Boolean values indicating whether the value is all alphabetic characters.
col.str.isdigit	Return Boolean values indicating whether the value is all digits.
col.str.isspace	Return Boolean values indicating whether the value is all whitespace characters.
col.str.islower	Return Boolean values indicating whether the value is all lowercase.
col.str.isupper	Return Boolean values indicating whether the value is all uppercase.
col.str.istitle	Return Boolean values indicating whether the value is equivalent to the title representation. (See the title method.)
col.str.isnumeric	Return Boolean values indicating whether the value is in a numeric representation.
col.str.isdecimal	Return Boolean values indicating whether the value is in a decimal representation.

In addition to the string column methods, you can also use the following date/time properties on date, time, and datetime columns.

Property	Description
col.dt.year	Return the year of the date/time/datetime.
col.dt.month	Return the month of the date/time/datetime, where January is 1 and December is 12.
col.dt.day	Return the day of the month of the date/time/datetime.
col.dt.hour	Return the hours of the date/time/datetime (0-24).
col.dt.minute	Return the minutes of the date/time/datetime (0-60).
col.dt.second	Return the seconds of the date/time/datetime (0-60).
col.dt.microsecond	Return the microseconds of the date/time/datetime (0-999999).
col.dt.nanosecond	Return the nanoseconds of the date/time/datetime. Since CAS has a resolution of only one microsecond, this property always returns zero.
col.dt.week	Return the week of the year of the date/time/datetime.
col.dt.weekofyear	Is the same as the week property.
col.dt.dayofweek	Return the day of the week of the date/time/datetime. Monday is 0 and Sunday is 6.
col.dt.weekday	Is the same as the dayofweek property.
col.dt.dayofyear	Return the ordinal day of the year of the date/time/datetime.
col.dt.quarter	Return the year’s quarter of the date/time/datetime.
col.dt.is_month_start	Boolean indicating whether the date/time/datetime is the first day of the month.
col.dt.is_month_end	Boolean indicating whether the date/time/datetime is the last day of the month.
col.dt.is_quarter_start	Boolean indicating whether the date/time/datetime is the first day of the quarter.
col.dt.is_quarter_end	Boolean indicating whether the date/time/datetime is the last day of the quarter.
col.dt.is_year_start	Boolean indicating whether the date/time/datetime is the first day of the year.
col.dt.is_year_end	Boolean indicating whether the date/time/datetime is the last day of the year.
col.dt.daysinmonth	Returns the number of days in the month of the date/time/datetime.
col.dt.days_in_month	Same as daysinmonth.

Note that when you call any of the previous methods on CAS time values, any method that returns a date-related component uses the current day. When you call time-related methods on a CAS date, the values are always zero.

It’s easy to create powerful filters that are executed in CAS while still using the Python syntax. However, the similarities to DataFrames don’t end there. CASTable objects can also create computed columns and by groups using similar techniques. Those topics are covered in the next section.

Data Wrangling on the Fly

There’s a certain amount of “data wrangling” that is required to prepare your data for reporting or modeling. Data wrangling refers to data manipulation, including the transformation of data from one format to another format, BY-group processing, and data aggregation. Such manipulations on large data sets often become bottlenecks to understanding your data, because most of the steps to manipulate data require that you create a new copy of the data or that you move data around the nodes of your cluster. In CAS, several ways are provided to speed up the data wrangling steps such as temporary computed columns and group by processing. Neither requires copying data or moving data.

Creating Computed Columns

Rather than using the CASColumn-based expressions to filter a CASTable object you can also use them to build computed columns. Instead of comparing the result of an expression to another value, you simply set it to a key on the CASTable object. Here is an example that adds the sepal_length and sepal_width and then multiplies the result by 2. The result is a CASColumn object that we set to sepal_factor.

In [92]: sorttbl['sepal_factor'] = ((sorttbl.sepal_length +

....: sorttbl.sepal_width) * 2)

In [93]: sorttbl.head()

Out[93]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.9 3.8 6.4 2.0 virginica

1 7.7 2.6 6.9 2.3 virginica

2 7.7 2.8 6.7 2.0 virginica

3 7.7 3.0 6.1 2.3 virginica

4 7.7 3.8 6.7 2.2 virginica

sepal_factor

0 23.4

1 20.6

2 21.0

3 21.4

4 23.0

The sepal_factor column is a computed column that is not present in the original CAS table. It is a temporary column generated “on the fly” when an action is executed. They exist only for the duration of the action, and are not visible to any other action. Computed columns are efficient because you can use computed columns to dynamically filter or expand the data without actually duplicating the data to create those extra filters or columns. You can create computed columns from other computed columns as well. In the following code, we’ll use our previously computed column as part of the expression for a new computed column.

In [94]: sorttbl['total_factor'] = sorttbl.sepal_factor +

....: sorttbl.petal_width +

....: sorttbl.petal_length

In [95]: sorttbl.head()

Out[95]:

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 7.9 3.8 6.4 2.0 virginica

1 7.7 2.6 6.9 2.3 virginica

2 7.7 2.8 6.7 2.0 virginica

3 7.7 3.0 6.1 2.3 virginica

4 7.7 3.8 6.7 2.2 virginica

sepal_factor total_factor

0 23.4 31.8

1 20.6 29.8

2 21.0 29.7

3 21.4 29.8

4 23.0 31.9

The output from the head method shows that the extra columns are now available on the table. You can even create computed columns that are based on constants or Python expressions (evaluated on the client) rather than on data in the table.

In [96]: sorttbl['names'] = 'sepal / petal'

In [97]: sorttbl.head()

Out[97]:

sepal_length sepal_width petal_length petal_width species

0 7.9 3.8 6.4 2.0 virginica

1 7.7 2.6 6.9 2.3 virginica

2 7.7 2.8 6.7 2.0 virginica

3 7.7 3.0 6.1 2.3 virginica

4 7.7 3.8 6.7 2.2 virginica

sepal_factor total_factor names

0 23.4 8.4 sepal / petal

1 20.6 9.2 sepal / petal

2 21.0 8.7 sepal / petal

3 21.4 8.4 sepal / petal

4 23.0 8.9 sepal / petal

You can use any of the numeric, string, and date/time methods that are described in the filtering section to construct computed columns as well.

In [98]: sorttbl['cap_names'] = sorttbl.names.str.title()

In [99]: sorttbl.head()

Selected Rows from Table DATA.IRIS

sepal_length sepal_width petal_length petal_width species

0 4.3 3.0 1.1 0.1 setosa

1 4.4 2.9 1.4 0.2 setosa

2 4.4 3.0 1.3 0.2 setosa

3 4.4 3.2 1.3 0.2 setosa

4 4.5 2.3 1.3 0.3 setosa

sepal_factor total_factor names cap_names

0 14.6 15.8 sepal / petal Sepal / Petal

1 14.6 16.2 sepal / petal Sepal / Petal

2 14.8 16.3 sepal / petal Sepal / Petal

3 15.2 16.7 sepal / petal Sepal / Petal

4 13.6 15.2 sepal / petal Sepal / Petal

Now that we have seen that we can filter our data and create computed columns, let’s look at the topic of grouping data by data values.

By Group Processing

Many analytic actions and Pandas DataFrame methods support grouping by values in one or more columns in a table. There are a couple of ways to apply by grouping in CASTable objects, one of which is borrowed from the DataFrame API.

The most direct way of applying by groupings to a CASTable object is to use the groupby parameter. We’ll start with the Iris table for this example.

In [100]: tbl.set_param('groupby', ['species'])

In [101]: tbl

Out[101]: CASTable('DATA.IRIS', caslib='CASUSER(username)',

groupby=['species'])

Now that we have some grouping variables set, we’ll run the summary action.

In [102]: tbl.summary(subset=['min', 'max'])

Out[102]:

[ByGroupInfo]

ByGroupInfo

species species_f _key_

0 setosa setosa setosa

1 versicolor versicolor versicolor

2 virginica virginica virginica

[ByGroup1.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

setosa sepal_length 4.3 5.8

setosa sepal_width 2.3 4.4

setosa petal_length 1.0 1.9

setosa petal_width 0.1 0.6

[ByGroup2.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

versicolor sepal_length 4.9 7.0

versicolor sepal_width 2.0 3.4

versicolor petal_length 3.0 5.1

versicolor petal_width 1.0 1.8

[ByGroup3.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

virginica sepal_length 4.9 7.9

virginica sepal_width 2.2 3.8

virginica petal_length 4.5 6.9

virginica petal_width 1.4 2.5

+ Elapsed: 0.0082s, user: 0.006s, sys: 0.007s, mem: 1.94mb

You see that we now get multiple tables as output rather than just the single table that the summary action produces when running without by groupings. The first table is called ByGroupInfo. This table contains information about all of the by grouping tables that follow. This information is more useful if there are a large number of by groupings and you want to know about them in advance so that you can prepare for the tables as they come from the server. The remaining tables correspond to each of the by groupings.

Each by grouping generates a new output table. The name of the key for that output is the name of the table prefixed by a label ByGroup#, where # is the index of the grouping (starting with 1). Without by grouping applied, the summary action creates a table called Summary. With by grouping applied, the result keys are ByGroup1.Summary, ByGroup2.Summary, and so on.

We mentioned that the DataFrame method for applying by groupings works with CASTable objects as well. The DataFrame groupby method is used to group the data. In the following example, we delete the groupby parameter from the table, and then use the groupby method to apply by groupings instead.

In [103]: tbl.del_param('groupby')

In [104]: tbl

Out[104]: CASTable('DATA.IRIS', caslib='CASUSER(username)')

In [105]: grptbl = tbl.groupby(['species'])

In [106]: grptbl

Out[106]: <swat.cas.table.CASTableGroupBy at 0x7f2148e12358>

You notice that when using this method, we don’t get a CASTable object back. Instead, we get a CASTableGroupBy object. This object corresponds to the Pandas GroupBy object. This object works much like a DataFrame object in many ways. For example, you can still call most DataFrame methods, call CAS actions on it, access columns, and so on, but the CASTableGroupBy object also adds methods for traversing the by-groupings in various ways.

Using the summary action on our grptbl variable shows that we get the same result as when we used the groupby parameter on the CASTable object.

In [107]: grptbl.summary(subset=['min', 'max'])

Out[107]:

[ByGroupInfo]

ByGroupInfo

species species_f _key_

0 setosa setosa setosa

1 versicolor versicolor versicolor

2 virginica virginica virginica

[ByGroup1.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

setosa sepal_length 4.3 5.8

setosa sepal_width 2.3 4.4

setosa petal_length 1.0 1.9

setosa petal_width 0.1 0.6

[ByGroup2.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

versicolor sepal_length 4.9 7.0

versicolor sepal_width 2.0 3.4

versicolor petal_length 3.0 5.1

versicolor petal_width 1.0 1.8

[ByGroup3.Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

virginica sepal_length 4.9 7.9

virginica sepal_width 2.2 3.8

virginica petal_length 4.5 6.9

virginica petal_width 1.4 2.5

+ Elapsed: 0.00862s, user: 0.008s, sys: 0.009s, mem: 1.98mb

Concatenating By Groups

If you already use Pandas, you might want to combine all of the by-grouping tables into a single DataFrame so that you can continue your workflow, as usual. Fortunately, there is a method on the CASResults object that can be used to combine by-grouping tables into a single DataFrame: concat_bygroups.

In [108]: grpsumm = grptbl.summary(subset=['min', 'max'])

In [109]: grpsumm.concat_bygroups()

Out[109]:

[Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

setosa sepal_length 4.3 5.8

setosa sepal_width 2.3 4.4

setosa petal_length 1.0 1.9

setosa petal_width 0.1 0.6

versicolor sepal_length 4.9 7.0

versicolor sepal_width 2.0 3.4

versicolor petal_length 3.0 5.1

versicolor petal_width 1.0 1.8

virginica sepal_length 4.9 7.9

virginica sepal_width 2.2 3.8

virginica petal_length 4.5 6.9

virginica petal_width 1.4 2.5

This method preserves all of the DataFrame index information so that it can be treated just like the output of the native DataFrame methods with by-groups enabled.

If you prefer to keep your by-group tables separate, there are a couple of other methods on the CASResults object that can query specific table names or groups.

Selecting Result Keys by Table Name

Unlike the summary action, some CAS actions return multiple result table names, including the correlation action. We won’t explain the correlation action in detail, but we use it in our example to select a particular table name from a result set. We still use our grptbl variable from the previous example as our starting point.

In [110]: grpcorr = grptbl.correlation()

In [111]: grpcorr

Out[111]:

[ByGroupInfo]

ByGroupInfo

species species_f _key_

0 setosa setosa setosa

1 versicolor versicolor versicolor

2 virginica virginica virginica

[ByGroup1.CorrSimple]

Summary Statistics in Correlation Analysis for DATA.IRIS

Variable N Mean Sum StdDev Minimum

species

setosa sepal_length 50.0 5.006 250.3 0.352490 4.3

setosa sepal_width 50.0 3.418 170.9 0.381024 2.3

setosa petal_length 50.0 1.464 73.2 0.173511 1.0

setosa petal_width 50.0 0.244 12.2 0.107210 0.1

Maximum

species

setosa 5.8

setosa 4.4

setosa 1.9

setosa 0.6

[ByGroup1.Correlation]

Pearson Correlation Coefficients for DATA.IRIS

Variable sepal_length sepal_width petal_length

species

setosa sepal_length 1.000000 0.746780 0.263874

setosa sepal_width 0.746780 1.000000 0.176695

setosa petal_length 0.263874 0.176695 1.000000

setosa petal_width 0.279092 0.279973 0.306308

petal_width

species

setosa 0.279092

setosa 0.279973

setosa 0.306308

setosa 1.000000

... truncated ...

As you can see in the preceding output, the correlation action produces tables that are named CorrSimple and Correlation. These tables appear in each by grouping in the output. If we want to select only the Correlation tables from the output, we can use the get_tables method. This method takes the name of a table and returns a list of all of the tables in the result set with that name. The name of the table should not include the ByGroup#. prefix.

In [112]: grpcorr.get_tables('Correlation')

Out[112]:

[Pearson Correlation Coefficients for DATA.IRIS

Variable sepal_length sepal_width petal_length

species

setosa sepal_length 1.000000 0.746780 0.263874

setosa sepal_width 0.746780 1.000000 0.176695

setosa petal_length 0.263874 0.176695 1.000000

setosa petal_width 0.279092 0.279973 0.306308

petal_width

species

setosa 0.279092

setosa 0.279973

setosa 0.306308

setosa 1.000000 , Pearson Correlation Coefficients for DATA.IRIS

Variable sepal_length sepal_width petal_length

species

versicolor sepal_length 1.000000 0.525911 0.754049

versicolor sepal_width 0.525911 1.000000 0.560522

versicolor petal_length 0.754049 0.560522 1.000000

versicolor petal_width 0.546461 0.663999 0.786668

petal_width

species

versicolor 0.546461

versicolor 0.663999

versicolor 0.786668

versicolor 1.000000 , Pearson Correlation Coefficients for DATA.IRIS

Variable sepal_length sepal_width petal_length

species

virginica sepal_length 1.000000 0.457228 0.864225

virginica sepal_width 0.457228 1.000000 0.401045

virginica petal_length 0.864225 0.401045 1.000000

virginica petal_width 0.281108 0.537728 0.322108

petal_width

species

virginica 0.281108

virginica 0.537728

virginica 0.322108

virginica 1.000000 ]

In the preceding code, you see that we now have a list of DataFrames that correspond to the Correlation tables. You can use the concat function in SWAT to combine the Correlation tables into a single DataFrame.

In [113]: swat.concat(grpcorr.get_tables('Correlation'))

Out[113]:

Pearson Correlation Coefficients for DATA.IRIS

Variable sepal_length sepal_width petal_length

species

setosa sepal_length 1.000000 0.746780 0.263874

setosa sepal_width 0.746780 1.000000 0.176695

setosa petal_length 0.263874 0.176695 1.000000

setosa petal_width 0.279092 0.279973 0.306308

versicolor sepal_length 1.000000 0.525911 0.754049

versicolor sepal_width 0.525911 1.000000 0.560522

versicolor petal_length 0.754049 0.560522 1.000000

versicolor petal_width 0.546461 0.663999 0.786668

virginica sepal_length 1.000000 0.457228 0.864225

virginica sepal_width 0.457228 1.000000 0.401045

virginica petal_length 0.864225 0.401045 1.000000

virginica petal_width 0.281108 0.537728 0.322108

petal_width

species

setosa 0.279092

setosa 0.279973

setosa 0.306308

setosa 1.000000

versicolor 0.546461

versicolor 0.663999

versicolor 0.786668

versicolor 1.000000

virginica 0.281108

virginica 0.537728

virginica 0.322108

virginica 1.000000

We’ve seen how to select specific table names from result sets, but what if we want to select a particular by group? That’s covered in the next section.

Selecting a Specific By Group

The CASResults object has a method to select by groups from a result set using the values of the by-grouping variables. This method is called get_group. If you specify a list of values, get_group matches the by-grouping variables as they were specified via the groupby method.

In [114]: grpsumm.get_group(['versicolor'])

Out[114]:

[Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

versicolor sepal_length 4.9 7.0

versicolor sepal_width 2.0 3.4

versicolor petal_length 3.0 5.1

versicolor petal_width 1.0 1.8

You can also use keyword parameters on the get_group method to specify the desired by-grouping variables.

In [115]: grpsumm.get_group(species='versicolor')

Out[115]:

[Summary]

Descriptive Statistics for DATA.IRIS

Column Min Max

species

versicolor sepal_length 4.9 7.0

versicolor sepal_width 2.0 3.4

versicolor petal_length 3.0 5.1

versicolor petal_width 1.0 1.8

The final topic for discussion about by groupings is the case in which a CAS action has support for multiple sets of by groupings.

Handling Multiple Sets of By Groups

Some actions, such as mdsummary, enable you to specify multiple sets of by groups. The output for such actions is similar to what we saw in the examples in the preceding section except that the result keys add another prefix: ByGroupSet#.

In [116]: grpmdsumm =

.....: tbl.mdsummary(sets=[dict(groupby=['sepal_length']),

.....: dict(groupby=['petal_length'])])

In [117]: list(grpmdsumm.keys())

Out[117]:

['ByGroupSet1.ByGroupInfo',

'ByGroupSet1.ByGroup1.MDSummary',

'ByGroupSet1.ByGroup2.MDSummary',

'ByGroupSet1.ByGroup3.MDSummary',

...

'ByGroupSet2.ByGroupInfo',

'ByGroupSet2.ByGroup1.MDSummary',

'ByGroupSet2.ByGroup2.MDSummary',

'ByGroupSet2.ByGroup3.MDSummary',

...]

As you can see from the preceding output, we get a ByGroupSet#.ByGroupInfo table for each by-group set followed by all of the tables for that by-group set. We can extract a particular by-group set from the results using the get_set method.

In [118]: grpmdsumm.get_set(1)

Out[118]:

[ByGroupInfo]

ByGroupSet1.ByGroupInfo

sepal_length sepal_length_f _key_

0 5.0 5 5

1 6.0 6 6

2 7.0 7 7

3 4.3 4.3 4.3

4 4.4 4.4 4.4

5 4.5 4.5 4.5

6 4.6 4.6 4.6

7 4.7 4.7 4.7

8 4.8 4.8 4.8

9 4.9 4.9 4.9

.. ... ... ...

25 6.7 6.7 6.7

26 6.8 6.8 6.8

27 6.9 6.9 6.9

28 7.1 7.1 7.1

29 7.2 7.2 7.2

30 7.3 7.3 7.3

31 7.4 7.4 7.4

32 7.6 7.6 7.6

33 7.7 7.7 7.7

34 7.9 7.9 7.9

[35 rows x 3 columns]

[ByGroup1.MDSummary]

Descriptive Statistics for DATA.IRIS

Column Min Max N NMiss Mean Sum

sepal_length

5 sepal_length 5.0 5.0 10.0 0.0 5.00 50.0

5 sepal_width 2.0 3.6 10.0 0.0 3.12 31.2

5 petal_length 1.2 3.5 10.0 0.0 1.84 18.4

5 petal_width 0.2 1.0 10.0 0.0 0.43 4.3

Std StdErr Var USS CSS

sepal_length

5 0.000000 0.000000 0.000000 250.00 0.000

5 0.543241 0.171788 0.295111 100.00 2.656

5 0.834266 0.263818 0.696000 40.12 6.264

5 0.326769 0.103333 0.106778 2.81 0.961

CV TValue ProbT

sepal_length

5 0.000000 NaN NaN

5 17.411580 18.161922 2.121189e-08

5 45.340551 6.974502 6.505533e-05

5 75.992719 4.161290 2.443024e-03

[ByGroup2.MDSummary]

Descriptive Statistics for DATA.IRIS

Column Min Max N NMiss Mean Sum

sepal_length

6 sepal_length 6.0 6.0 6.0 0.0 6.000000 36.0

6 sepal_width 2.2 3.4 6.0 0.0 2.733333 16.4

6 petal_length 4.0 5.1 6.0 0.0 4.650000 27.9

6 petal_width 1.0 1.8 6.0 0.0 1.500000 9.0

Std StdErr Var USS CSS

sepal_length

6 0.000000 0.000000 0.000000 216.00 0.000000

6 0.471876 0.192642 0.222667 45.94 1.113333

6 0.403733 0.164823 0.163000 130.55 0.815000

6 0.268328 0.109545 0.072000 13.86 0.360000

CV TValue ProbT

sepal_length

6 0.000000 NaN NaN

6 17.263745 14.188635 0.000031

6 8.682421 28.212059 0.000001

6 17.888544 13.693064 0.000037

... truncated ...

You see that when using the get_set method, we get back a new CASResults object that corresponds to the output from that particular by-group set. You can then use get_tables, concat_bygroups, or get_group on that result just as we did in the examples in the previous section.

In [119]: grpmdsumm.get_set(1).concat_bygroups()

Out[119]:

[MDSummary]

Descriptive Statistics for DATA.IRIS

Column Min Max N NMiss Mean Sum

sepal_length

5 sepal_length 5.0 5.0 10.0 0.0 5.000000 50.0

5 sepal_width 2.0 3.6 10.0 0.0 3.120000 31.2

5 petal_length 1.2 3.5 10.0 0.0 1.840000 18.4

5 petal_width 0.2 1.0 10.0 0.0 0.430000 4.3

6 sepal_length 6.0 6.0 6.0 0.0 6.000000 36.0

6 sepal_width 2.2 3.4 6.0 0.0 2.733333 16.4

6 petal_length 4.0 5.1 6.0 0.0 4.650000 27.9

6 petal_width 1.0 1.8 6.0 0.0 1.500000 9.0

7 sepal_length 7.0 7.0 1.0 0.0 7.000000 7.0

7 sepal_width 3.2 3.2 1.0 0.0 3.200000 3.2

... ... ... ... ... ... ... ...

7.6 petal_length 6.6 6.6 1.0 0.0 6.600000 6.6

7.6 petal_width 2.1 2.1 1.0 0.0 2.100000 2.1

7.7 sepal_length 7.7 7.7 4.0 0.0 7.700000 30.8

7.7 sepal_width 2.6 3.8 4.0 0.0 3.050000 12.2

7.7 petal_length 6.1 6.9 4.0 0.0 6.600000 26.4

7.7 petal_width 2.0 2.3 4.0 0.0 2.200000 8.8

7.9 sepal_length 7.9 7.9 1.0 0.0 7.900000 7.9

7.9 sepal_width 3.8 3.8 1.0 0.0 3.800000 3.8

7.9 petal_length 6.4 6.4 1.0 0.0 6.400000 6.4

7.9 petal_width 2.0 2.0 1.0 0.0 2.000000 2.0

... truncated ...

Conclusion

If you are familiar with Pandas DataFrames, hopefully, the use of much of the API in CASTable objects helps you transition to the world of CAS. We covered the basics of using some of the supported DataFrame methods, using various forms of indexing and iteration, filtering using expressions, creating computed columns, and using by groups. With all of that “under your belt,” we can look at some more advanced statistical data in the following chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Working with CAS Tables

Create new playlist

Sign In

Sign Up

Chapter 6: Working with CAS Tables

Using CASTable Objects like a DataFrame

CAS Table Introspection

Computing Simple Statistics

Creating Plots from CASTable Data

Exporting CASTables to Other Formats

Sorting, Data Selection, and Iteration

Fetching Data with a Sort Order

Iterating through Columns and Rows

Basic Iteration

Iterating over Columns

Techniques for Indexing and Selecting Data

Selecting Columns by Label and Position

Selecting Data by Label and Position

Dynamic Data Selection

Data Wrangling on the Fly

Creating Computed Columns

By Group Processing

Concatenating By Groups

Selecting Result Keys by Table Name

Selecting a Specific By Group

Handling Multiple Sets of By Groups

Conclusion

Table of Contents for
Chapter 6: Working with CAS Tables