Using CASTable Objects like a DataFrame
Creating Plots from CASTable Data
Exporting CASTables to Other Formats
Sorting, Data Selection, and Iteration
Fetching Data with a Sort Order
Iterating through Columns and Rows
Techniques for Indexing and Selecting Data
In the previous chapter, we introduced the CASAction and CASTable objects. We covered the methods of setting parameters and calling actions on a CASTable object, but our coverage only included the basics. In this chapter we continue to work on CASTable objects on a variety of topics, including the DataFrame APIs, exporting data, creating visualizations, fetching and indexing, creating temporary columns, and by group processing.
The Pandas DataFrame has taken the elegance of the Python programming language and applied it beautifully to data structures that are used in everyday statistics and data analysis. It supports many basic statistical analyses on data sets that can fit into memory on your computer, and it also defines interesting ways to index data sets using Python’s “fancy” indexing syntax.
The CASTable object adopts these same APIs and applies them to tables in CAS. This means that you can use the familiar DataFrame API on CAS tables. The result types that are returned by the CASTable versions of the Pandas DataFrame methods are, in most cases, the same as those that are returned by DataFrame methods of the same name. The primary difference is that the amount of data that is referenced by a CASTable object can be much greater than what can be handled on a single desktop machine.
In the following sections, we start with some methods and attributes that describe the CAS table itself, and then move on to methods for simple statistics and other areas of the DataFrame API.
Basic information about CAS tables includes the names of the columns, the data types, and the number of rows. Until now, we’ve used the tableinfo and columninfo CAS actions to get that information. However, we can also use the DataFrame attributes and methods for the same purpose. Probably the most common attribute that is used to get this type of information is the columns attribute. It simply lists the names of all of the columns in the table. We load the Iris data into both a CAS table and a DataFrame for comparison.
# Import pandas
In [1]: import pandas as pd
# Read iris.csv into a DataFrame
In [2]: df = pd.read_csv('/u/username/data/iris.csv')
In [3]: df.columns
Out[3]: Index(['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'species'], dtype='object')
# Load iris.csv into a CAS table
In [4]: tbl = conn.loadtable('data/iris.csv',
caslib='casuser').casTable
NOTE: Cloud Analytic Services made the file data/iris.csv available as table DATA.IRIS in caslib CASUSER(username).
In [5]: tbl.columns
Out[5]: Index(['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'species'], dtype='object')
In both cases, you see that the columns attribute returns a Pandas Index object that contains the names of the columns. You can also find out the data types using dtypes.
In [6]: df.dtypes
Out[6]:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
In [7]: tbl.dtypes
Out[7]:
sepal_length double
sepal_width double
petal_length double
petal_width double
species varchar
dtype: object
In this case, you’ll see a small difference. The data types that are supported by DataFrames and those that are supported by CAS tables have different names. The different names don’t matter since the Python float64 types is equivalent to the CAS double type. However, if you use the type names for filtering in other parts of the API, you want to use the CAS type names.
You can also get the counts of each data type using the get_dtype_counts method.
In [8]: df.get_dtype_counts()
Out[8]:
float64 4
object 1
dtype: int64
In [9]: tbl.get_dtype_counts()
Out[9]:
double 4
varchar 1
dtype: int64
To obtain the size or the shape of a CAS table, you use the size and the shape attributes just as in a DataFrame.
In [10]: tbl.size
Out[10]: 750
In [11]: tbl.shape
Out[11]: (150, 5)
For a quick summary of information about the entire table, you can use the info method.
In [12]: tbl.info()
CASTable('DATA.IRIS', caslib='CASUSER(username)')
Data columns (total 5 columns):
N Miss Type
sepal_length 150 False double
sepal_width 150 False double
petal_length 150 False double
petal_width 150 False double
species 150 False varchar
dtypes: double(4), varchar(1)
data size: 8450
vardata size: 1250
memory usage: 8656
Finally, to get a sample of the data, you can use the head or tail methods. Keep in mind that CAS commonly uses distributed data storage that has no particular sort order, so the values that you get from the head and tail methods are not deterministic unless the table has a specified sort order, which is covered later in this chapter.
In [13]: tbl.head(n=3)
Out[13]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
In [14]: tbl.tail(n=4)
Out[14]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
146 6.4 2.8 5.6 2.2 virginica
147 6.3 2.8 5.1 1.5 virginica
148 6.1 2.6 5.6 1.4 virginica
149 7.7 3.0 6.1 2.3 virginica
As you can see from the preceding code, the head and tail methods accept the n parameter to indicate the number of records to retrieve (the default is 5). CAS commonly deals with large data sets and you might not want to bring all of the variables back. Using the columns parameter, it is possible to subset the columns that are retrieved.
In [15]: tbl.head(columns=['sepal_length', 'petal_length'])
Out[15]:
Selected Rows from Table DATA.IRIS
sepal_length petal_length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
3 4.6 1.5
4 5.0 1.4
Although this part of the DataFrame API is fairly simple, it already shows how easy working with CASTable objects can be. Let’s move on to computing some basic statistics.
A common DataFrame method that is used to learn about the characteristics of your data is the describe method. Here is a simple example of getting basic summary statistics from a DataFrame and a CASTable, both using the describe method.
# Run the describe method on a DataFrame
In [16]: desc = df.describe()
In [17]: desc
Out[17]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [18]: type(desc)
Out[18]: pandas.core.frame.DataFrame
# Run the describe method on a CASTable
In [19]: casdesc = tbl.describe()
In [20]: casdesc
Out[20]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [21]: type(casdesc)
Out[21]: pandas.core.frame.DataFrame
As you can see, in both cases, we run the describe method and get back the same results even in the same data type. This might not seem impressive for only 150 records, but how about running it on a larger collection? Our table contains 70 million rows and 46 columns, 35 of which are numeric. Running the describe method on our grid of 48 nodes requires only seconds.
In [22]: conn.tableinfo('datasources.megacorp5m')
Out[22]:
[TableInfo]
Name Rows Columns Encoding
0 DATASOURCES.MEGACORP5M 70732833 46 utf-8
CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime
0 01Mar2016:12:57:08 01Mar2016:12:57:08 UTF8 1.772456e+09
ModTime Global Repeated View SourceName
0 1.772456e+09 0 0 0 datasources/megacorp5m.csv
SourceCaslib Compressed
0 CASTestTmp 0
In [23]: %time mega.describe()
CPU times: user 261 ms, sys: 24 ms, total: 285 ms
Wall time: 12.1 s
Out[23]:
Date DateByYear DateByMonth
count 70732833.000000 70732833.000000 70732833.000000
mean 14389.447935 14209.155789 14374.679408
std 2653.775328 2651.783194 2653.766319
min 7305.000000 7305.000000 7305.000000
25% 12460.000000 12419.000000 12450.000000
50% 14605.000000 14245.000000 14579.000000
75% 16517.000000 16437.000000 16496.000000
max 18992.000000 18628.000000 18962.000000
...
StateLatitude CityLongitude CityLatitude
count 70732833.000000 70119950.000000 70119950.000000
mean 35.090183 -95.995716 34.365083
std 4.313599 12.363752 5.171787
min 31.051683 -122.418330 27.800283
25% 31.169358 -97.396111 30.694170
50% 32.615651 -92.289440 33.509170
75% 37.271892 -88.043060 37.775000
max 47.272951 -71.455280 47.606390
[8 rows x 35 columns]
Now that we’ve seen that we can analyze both small and large quantities of data in the familiar Pandas DataFrame API, let’s see what else it can do.
In addition to running the describe method with the default settings, you can use any of the options that it supports including percentiles, include, and exclude to modify the output. Let’s go back to the Iris data set and try these options.
The percentiles that you get with the default options are 25%, 50%, and 75%. We can customize those values by passing in float values between 0 and 1 to the percentiles option.
In [24]: tbl
Out[24]: CASTable('DATA.IRIS', caslib='CASUSER(username)')
In [25]: tbl.describe(percentiles=[0.3, 0.8])
Out[25]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
30% 5.250000 2.800000 1.700000 0.400000
50% 5.800000 3.000000 4.350000 1.300000
80% 6.550000 3.400000 5.350000 1.900000
max 7.900000 4.400000 6.900000 2.500000
As you can see, we now get the 30% and 80% percentiles in addition to 50%, which always gets computed by the DataFrame method. The include and exclude options enable you to specify the data types that should be included in the analysis. You can specify them as a list of strings that indicate the CAS data types that should be included, or you can use the terms ‘numeric’, ‘character’, and ‘all’ to refer to all numeric types, all character types, and all variables, respectively. The values for this option vary slightly from the DataFrame method since the names of the data types are different between Pandas and CAS, but the form of the option is the same. In either case, the default is to analyze numeric data only.
# Analyze only character data in the CAS table
In [26]: tbl.describe(include='character')
Out[26]:
species
count 150
unique 3
top virginica
freq 50
# Analyze all data in the CAS table
In [27]: tbl.describe(include='all')
Out[27]:
sepal_length sepal_width petal_length petal_width species
count 150 150 150 150 150
unique 35 23 43 22 3
top 5 3 1.5 0.2 virginica
freq 10 26 14 28 50
mean 5.84333 3.054 3.75867 1.19867 NaN
std 0.828066 0.433594 1.76442 0.763161 NaN
min 4.3 2 1 0.1 setosa
25% 5.1 2.8 1.6 0.3 NaN
50% 5.8 3 4.35 1.3 NaN
75% 6.4 3.3 5.1 1.8 NaN
max 7.9 4.4 6.9 2.5 virginica
# Same as above
In [28]: tbl.describe(include=['numeric', 'character'])
Out[28]:
sepal_length sepal_width petal_length petal_width species
count 150 150 150 150 150
unique 35 23 43 22 3
top 5 3 1.5 0.2 virginica
freq 10 26 14 28 50
mean 5.84333 3.054 3.75867 1.19867 NaN
std 0.828066 0.433594 1.76442 0.763161 NaN
min 4.3 2 1 0.1 setosa
25% 5.1 2.8 1.6 0.3 NaN
50% 5.8 3 4.35 1.3 NaN
75% 6.4 3.3 5.1 1.8 NaN
max 7.9 4.4 6.9 2.5 virginica
The last option that is supported is called stats. This is not a DataFrame option, but was added to allow statistics that are supported by the summary action to be included in the results as well. It accepts a list of strings, including ‘count’, ‘std’, ‘min’, ‘pct’, ‘max’, ‘unique’, ‘top’, ‘freq’, ‘nmiss’, ‘sum’, ‘stderr’, ‘var’, ‘uss’, ‘cv’, ‘tvalue’, and ‘probt’. The value of ‘pct’ includes all of the percentiles that have been computed. If ‘all’ is specified, all of the previously mentioned statistics are displayed.
In [29]: tbl.describe(stats=['count', 'nmiss', 'sum',
....: 'probt', 'freq'])
Out[29]:
sepal_length sepal_width petal_length petal_width
count 1.500000e+02 1.500000e+02 1.500000e+02 1.500000e+02
nmiss 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
sum 8.765000e+02 4.581000e+02 5.638000e+02 1.798000e+02
probt 3.331256e-129 4.374977e-129 1.994305e-57 3.209704e-42
freq 1.000000e+01 2.600000e+01 1.400000e+01 2.800000e+01
In [30]: tbl.describe(stats='all')
Out[30]:
sepal_length sepal_width petal_length petal_width
count 1.500000e+02 1.500000e+02 1.500000e+02 1.500000e+02
unique 3.500000e+01 2.300000e+01 4.300000e+01 2.200000e+01
mean 5.843333e+00 3.054000e+00 3.758667e+00 1.198667e+00
std 8.280661e-01 4.335943e-01 1.764420e+00 7.631607e-01
min 4.300000e+00 2.000000e+00 1.000000e+00 1.000000e-01
25% 5.100000e+00 2.800000e+00 1.600000e+00 3.000000e-01
50% 5.800000e+00 3.000000e+00 4.350000e+00 1.300000e+00
75% 6.400000e+00 3.300000e+00 5.100000e+00 1.800000e+00
max 7.900000e+00 4.400000e+00 6.900000e+00 2.500000e+00
nmiss 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
sum 8.765000e+02 4.581000e+02 5.638000e+02 1.798000e+02
stderr 6.761132e-02 3.540283e-02 1.440643e-01 6.231181e-02
var 6.856935e-01 1.880040e-01 3.113179e+00 5.824143e-01
uss 5.223850e+03 1.427050e+03 2.583000e+03 3.023000e+02
cv 1.417113e+01 1.419759e+01 4.694272e+01 6.366747e+01
tvalue 8.642537e+01 8.626430e+01 2.609020e+01 1.923659e+01
probt 3.331256e-129 4.374977e-129 1.994305e-57 3.209704e-42
It is also possible to access the individual statistics through the DataFrame methods count, mean, std, and max. The methods nmiss, sum, stderr, var, uss, cv, tvalue, and probt are also available in order to complete the set that is provided by the summary action. These methods return a Pandas Series object that contains the values for each variable.
In [31]: tbl.count()
Out[31]:
sepal_length 150
sepal_width 150
petal_length 150
petal_width 150
species 150
dtype: int64
In [32]: tbl.mean()
Out[32]:
sepal_length 5.843333
sepal_width 3.054000
petal_length 3.758667
petal_width 1.198667
dtype: float64
In [33]: tbl.probt()
Out[33]:
sepal_length 3.331256e-129
sepal_width 4.374977e-129
petal_length 1.994305e-57
petal_width 3.209704e-42
dtype: float64
At the time of this writing, not all of the DataFrame methods were supported on the CASTable object. As development of the SWAT package continues, more of the DataFrame methods will likely become available.
Now that we have seen some basic DataFrame operations working on CASTable objects, let’s move on to some plotting operations.
The CASTable objects support all of the same chart types as the DataFrame object because they use the same code. Plotting is done on the client side, so all of the data being visualized must be downloaded first. The plotting DataFrame methods are mirrored in the CASTable objects, and the data is downloaded automatically. Although these methods are convenient, for iterative work that might require tuning your plots, you probably want to bring the data back to a local DataFrame and use the plotting methods on it to avoid downloading data from the server at each iteration.
The easiest way to generate a plot of data in a CASTable is to use the plot method. We’ll use our original Iris data set for this example. The following creates a plot:
In [34]: tbl.plot()
Out[34]: <matplotlib.axes.AxesSubplot at 0x7f4b6650c4e0>
If you use the Jupyter notebook as your client, your plot should show up in-line. If it doesn’t appear in-line, you might try running the following IPython magic command and then do the plot again:
In [35]: %matplotlib inline
If you are using something other than the Jupyter notebook and the plot doesn’t show up, you can try the following:
In [36]: from matplotlib.pyplot import show
In [37]: show()
The show function tells Matplotlib, which is the plotting package, to show the figure window. The following screenshot shows the plot that is generated on a Linux machine:
Of course, you can display subsets of columns as well using the standard DataFrame syntax.
In [38]: tbl[['sepal_length', 'sepal_width']].plot()
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x7fa3efc23240>
In [39]: show()
Here is a plot of selected columns from the Iris data set:
Other chart types include area, bar, horizontal bar, box plot, density, histogram, KDE, and pie. Depending on your version of Pandas, these chart types are accessible in multiple ways. The traditional way is to specify the type of chart in the kind argument of the plot method as follows:
In [40]: tbl[['sepal_length', 'sepal_width']].plot(kind='area')
Out[40]: <matplotlib.axes._subplots.AxesSubplot at 0x7f77480ce908>
In [41]: show()
Here is an area plot of selected column of the Iris data set:
In newer versions of Pandas (releases 0.17.0 and later), you can use the individual chart type attributes on the plot method.
In [42]: tbl[['sepal_length', 'sepal_width']].plot.area()
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7747ea1fd0>
In [43]: show()
Options to the plotting methods can be used to customize the output (see the Pandas documentation for additional details). However, this interface does not provide publication-ready charts. It is primarily for exploration. For highly customized graphics, you should use the Matplotlib package directly.
The data in CAS tables can be saved to all of the formats that are supported by the Pandas DataFrame’s export methods. It accomplishes this by fetching the data and using the DataFrame export methods to save the data. Export methods include to_pickle, to_csv, to_hdf, to_sql, to_dict, to_excel, to_json, to_html, to_latex, to_stata, to_msgpack, to_gbq, to_records, to_sparse, to_dense, to_string, and to_clipboard. Keep in mind that all of these methods save the output on the client side. To save the data in a CAS table to a data source, you use the save CAS action.
The save action stores the data in the CAS table to the appropriate format for the caslib where the data is stored. The most basic form is the file-based caslib. You can store the data from a CAS table to either the CSV form or the SASHDAT form in file-based caslibs. To store the data, you simply specify the caslib to which the file should be saved, and you specify the path to the file in the name parameter.
In [44]: tbl.save(name='data/irisout.csv', caslib='casuser')
NOTE: Cloud Analytic Services created the file data/irisout.csv in caslib CASUSER(username).
Out[44]:
[caslib]
'CASUSER(username)'
[name]
'data/irisout.csv'
+ Elapsed: 0.0143s, user: 0.004s, mem: 2.14mb
In [45]: tbl.save(name='data/irisout.sashdat', caslib='casuser')
NOTE: Cloud Analytic Services created the file data/irisout.sashdat in caslib CASUSER(username).
Out[45]:
[caslib]
'CASUSER(username)'
[name]
'data/irisout.sashdat'
+ Elapsed: 0.0214s, user: 0.016s, sys: 0.003s, mem: 16.2mb
For caslibs with other data sources, see the SAS documentation for the options and parameters that are needed to export the data to a specific data source type.
Some of the most basic data operations deal with sorting rows, selecting data based on labels or expressions, and iterating over rows or columns of a table. The following sections introduce techniques for accomplishing each data operation using the CASTable object.
As mentioned previously, data in a CAS table might be distributed across a grid, or it can be accessible in ways that do not specify a particular row order. This means that anytime you fetch data from CAS, you can’t depend on the order unless you specify a sort order when retrieving it. A sort order can be specified with the sortby parameter on the fetch action, but a different method is used with DataFrames. DataFrame methods preserve order and can be sorted using the sort_values method (or the sort method in older versions of Pandas). To mimic the DataFrame behavior, CASTable objects also support those sorting methods.
The simplest form of sort_values is to specify the names of the columns to sort on. The order is ascending by default.
In [46]: tbl.sort_values(['sepal_length', 'sepal_width'])
Out[46]: CASTable('DATA.IRIS',
caslib='CASUSER(username)').sort_values(
['sepal_length', 'sepal_width'])
There is a subtle point in the preceding code that you might not have noticed if you aren’t familiar with Pandas. When you use the sort_values method on a DataFrame, it returns a new DataFrame object with an applied sort order unless you specify the inplace=True option. This is true with CASTable objects as well. The preceding code probably didn’t accomplish its sorting goal. Instead, it created a new CASTable object with the sort order applied, but since we didn’t save it to a variable, it simply disappeared. We have two choices: we can either use the inplace=True option on sort_values, or specify a variable for storing the returned value. Let’s store the value in a variable in the following example:
In [47]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'])
In [48]: sorttbl
Out[48]: CASTable('DATA.IRIS',
caslib='CASUSER(username)').sort_values(
['sepal_length', 'sepal_width'])
Now we have a new CASTable object with the sort order applied as indicated by the sort_values(…) that is shown in the string representation above. To clarify, this sorting does nothing on the server side. In fact, nothing was sent to the server when this operation was invoked. The sort_values method call simply binds a sort order to the CASTable object that gets used anytime that the CASTable object is used to retrieve data from that CAS table.
We can now use the head or tail method to retrieve a sample of the data in the specified sort order.
In [50]: sorttbl.head(10)
Out[50]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 4.3 3.0 1.1 0.1 setosa
1 4.4 2.9 1.4 0.2 setosa
2 4.4 3.0 1.3 0.2 setosa
3 4.4 3.2 1.3 0.2 setosa
4 4.5 2.3 1.3 0.3 setosa
5 4.6 3.1 1.5 0.2 setosa
6 4.6 3.2 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
8 4.6 3.6 1.0 0.2 setosa
9 4.7 3.2 1.6 0.2 setosa
In [51]: sorttbl.tail(5)
Out[51]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
145 7.7 2.6 6.9 2.3 virginica
146 7.7 2.8 6.7 2.0 virginica
147 7.7 3.0 6.1 2.3 virginica
148 7.7 3.8 6.7 2.2 virginica
149 7.9 3.8 6.4 2.0 virginica
To change the order to descending, you can specify ascending=False or you can give a list of Booleans to indicate ascending or descending order for each column.
In [52]: sorttbl = tbl.sort_values(['sepal_length', 'sepal_width'],
....: ascending=[False, True])
In [53]: sorttbl
Out[53]: CASTable('DATA.IRIS',
caslib='CASUSER(username)').sort_values(
['sepal_length', 'sepal_width'],
ascending=[False, True])
If we fetch a sample of the data with our mixed descending/ascending sort order, you’ll get the following.
In [54]: sorttbl.head(10)
Out[54]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.9 3.8 6.4 2.0 virginica
1 7.7 2.6 6.9 2.3 virginica
2 7.7 2.8 6.7 2.0 virginica
3 7.7 3.0 6.1 2.3 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.6 3.0 6.6 2.1 virginica
6 7.4 2.8 6.1 1.9 virginica
7 7.3 2.9 6.3 1.8 virginica
8 7.2 3.0 5.8 1.6 virginica
9 7.2 3.2 6.0 1.8 virginica
Now that we have seen how to apply a sort order to a CASTable, we can go to the next set of DataFrame features that pertain to iteration.
CASTable objects support all of the DataFrame iteration methods. However, you should use the iteration methods judiciously. Although you can iterate through the rows of a CAS table, you probably don’t want to do that if your table contains millions of rows. Realistically, you should use iteration methods with CAS tables that contain only summarization data or CAS tables that have a constrained number of records.
Iterating through the column names is the most basic form of iteration on a CASTable object. Iterating through columns is done by using the table object itself as the iterator just like with DataFrame objects.
In [55]: for col in sorttbl:
....: print(col)
....:
sepal_length
sepal_width
petal_length
petal_width
species
Of course, you can also wrap the table in the various Python iteration functions to get a different behavior. For example, you can use the enumerate function to give the index value of each column name, or you can even use the zip function to walk through multiple lists of the same size.
In [56]: for i, col in enumerate(sorttbl):
....: print(i, col)
....:
0 sepal_length
1 sepal_width
2 petal_length
3 petal_width
4 species
In [57]: for col, dtype in zip(sorttbl, sorttbl.dtypes):
....: print(col, dtype)
....:
sepal_length double
sepal_width double
petal_length double
petal_width double
species varchar
Much like the basic iterator, the iteritems method iterates over the columns as well. However, in this case, the returned values are a pairing of the column name and a CASColumn object. CASColumn objects are just like CASTable objects except that they reference only a single column. They are analogous to Pandas Series objects and are discussed later in the chapter.
In [58]: for col, obj in sorttbl.iteritems():
....: print(col, obj)
....: print('')
....:
sepal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])
sepal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['sepal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])
petal_length CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_length'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])
petal_width CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['petal_width'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])
species CASColumn('DATA.IRIS',
caslib='CASUSER(username)')['species'].sort_values(['sepal_length',
'sepal_width'], ascending=[False, True])
Although the previous two iterators can be safely run on large tables (they don’t iterate through the data rows), the following two iterators go through all of the rows in a table. The iterrows method returns a tuple that contains the row index and a Pandas Series object that contains the values of the row.
In [59]: for row in sorttbl.iterrows():
....: print(row)
....:
(0, sepal_length 4.3
sepal_width 3
petal_length 1.1
petal_width 0.1
species setosa
Name: 0, dtype: object)
(1, sepal_length 4.4
sepal_width 2.9
petal_length 1.4
petal_width 0.2
species setosa
Name: 1, dtype: object)
...
(149, sepal_length 7.9
sepal_width 3.8
petal_length 6.4
petal_width 2
species virginica
Name: 149, dtype: object)
This method uses a Python generator to appear as if it fetches all of the data and iterates over it. However, the iterrows method pulls the data down in batches (configurable by the n argument) and waits until a new batch is needed before it fetches the next batch. The itertuples method works in a very similar way, but it returns each row as a tuple of values rather than a Series. The row index value can be disabled by passing index=False.
In [60]: for row in sorttbl.itertuples():
....: print(row)
....:
(0, 4.2999999999999998, 3.0, 1.1000000000000001, 0.10000000000000001,
'setosa')
(1, 4.4000000000000004, 2.8999999999999999, 1.3999999999999999,
0.20000000000000001, 'setosa')
(2, 4.4000000000000004, 3.0, 1.3, 0.20000000000000001, 'setosa')
. . .
(149, 7.9000000000000004, 3.7999999999999998, 6.4000000000000004, 2.0,
'virginica')
Now that we’ve seen how to iterate through our data, let’s look at how we can retrieve items from the data.
Indexing and selecting data in DataFrames comes in many forms. The loc and iloc attributes enable you to select rows that are based on labels and row indexes. The attributes, at and iat, enable you to select a scalar value at a particular position, given a label and a row index, respectively, paired with a column identifier. The ix attribute supports mixed integer and label access. Finally, using the Python __getitem__ functionality (through the bracket [...] syntax) on the DataFrame itself enables you to select columns and ranges of rows.
Probably, the most common way that you see the Pandas DataFrame being indexed is directly on the DataFrame itself. DataFrame objects (and similarly CASTables objects) enable you to select columns either by name or index. To select a single column, you simply index the CASTable using the name of the column.
In [61]: col = sorttbl['sepal_width']
In [62]: col
Out[62]: CASColumn('data.iris', caslib='casuser')
['sepal_width'].sort_values(['sepal_length',
'sepal_width'],
ascending=[False, True])
In [63]: col.head()
Out[63]:
0 3.8
1 2.6
2 2.8
3 3.0
4 3.8
Name: sepal_width, dtype: float64
The object that we get back is a CASColumn object. This object is analogous to a Pandas Series. Technically, CAS doesn’t support stand-alone columns. The CASColumn object is a specialized subclass of CASTable object that references a single column of the table. As you can see, when you retrieve a column from a CASTable, the resulting CASColumn takes with it all table and sort information that was on the parent CASTable. We cover more advanced uses of CASColumn objects later in the chapter. For now, let’s move on to more CASTable indexing.
An alternative way of selecting a column is using the Python attribute syntax. You can specify the name of the column on the CASTable in order to have it return a CASColumn.
In [64]: sorttbl.species
Out[64]: CASColumn('DATA.IRIS', caslib='CASUSER(username)')
['species'].sort_values(['sepal_length', 'sepal_width'],
ascending=[False, True])
Although this way of accessing columns is prettier than the bracket syntax, it has some restrictions. The name of the column must be a legal Python attribute name, which cannot collide with any existing attributes or methods on the CASTable object. Although this syntax is appropriate for interactive use, it’s much safer to write programs using the bracket syntax.
In addition to retrieving columns from a CASTable, we can also retrieve subsets of a table. To do this, you use a list of columns as the index key.
In [65]: widths = sorttbl[['sepal_width', 'petal_width', 'species']]
In [66]: widths
Out[66]: CASTable('DATA.IRIS', caslib='CASUSER(username)')[
['sepal_width', 'petal_width', 'species']
].sort_values(['sepal_length', 'sepal_width'],
ascending=[False, True])
In [67]: widths.head()
Out[67]:
Selected Rows from Table DATA.IRIS
sepal_width petal_width species
0 3.8 2.0 virginica
1 2.6 2.3 virginica
2 2.8 2.0 virginica
3 3.0 2.3 virginica
4 3.8 2.2 virginica
If you specify a list of column names (or even a list containing a single column name), you’ll get back a new CASTable object that includes only the referenced columns. The data in this table is still the same as the data in the parent table. The local CASTable object simply references the columns that it needs.
Of course, now that we have this new table, we can use any of the previously described CASTable methods on it (including the CAS action and DataFrame methods).
In [68]: widths.describe()
Out[68]:
sepal_width petal_width
count 150.000000 150.000000
mean 3.054000 1.198667
std 0.433594 0.763161
min 2.000000 0.100000
25% 2.800000 0.300000
50% 3.000000 1.300000
75% 3.300000 1.800000
max 4.400000 2.500000
In [69]: widths.columninfo()
Out[69]:
[ColumnInfo]
Column ID Type RawLength FormattedLength NFL NFD
0 sepal_width 2 double 8 12 0 0
1 petal_width 4 double 8 12 0 0
2 species 5 varchar 10 10 0 0
+ Elapsed: 0.000839s, sys: 0.001s, mem: 0.188mb
In addition to basic column selection such as the examples shown here, you can also use the various accessor properties of DataFrame objects to select columns in other ways. That’s covered in the next section.
Although DataFrames enable you to select rows using the loc, iloc, at, iat, and ix accessors, CAS tables do not have the concept of a row label or an index. These accessors still exist on CASTable objects, but they are limited to column selection only.
Let’s use our sorted table from the previous example to select some data points.
In [70]: sorttbl
Out[70]: CASTable('DATA.IRIS',
caslib='CASUSER(username)').sort_values(
['sepal_length', 'sepal_width'],
ascending=[False, True])
The loc and iloc data accessors on CASTable objects can select only columns. The loc accessor selects columns by name, whereas the iloc accessor selects columns by index. You must specify an empty slice as the first index parameter.
# Select the petal_width column
In [71]: sorttbl.loc[:, 'petal_width'].head()
Out[71]:
0 2.0
1 2.3
2 2.0
3 2.3
4 2.2
Name: petal_width, dtype: float64
# Select a range of columns
In [72]: sorttbl.loc[:, 'sepal_length':'petal_length'].head()
Out[72]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length
0 7.9 3.8 6.4
1 7.7 2.6 6.9
2 7.7 2.8 6.7
3 7.7 3.0 6.1
4 7.7 3.8 6.7
# Select a list of columns
In [73]: sorttbl.loc[:, ['petal_width', 'sepal_width']].head()
Out[73]:
Selected Rows from Table DATA.IRIS
petal_width sepal_width
0 2.0 3.8
1 2.3 2.6
2 2.0 2.8
3 2.3 3.0
4 2.2 3.8
The iloc accessor works much like loc except that it uses integer indexes.
In [74]: sorttbl.iloc[:, 3].head()
Out[74]:
0 2.0
1 2.3
2 2.0
3 2.3
4 2.2
Name: petal_width, dtype: float64
In [75]: sorttbl.iloc[:, 0:3].head()
Out[75]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length
0 7.9 3.8 6.4
1 7.7 2.6 6.9
2 7.7 2.8 6.7
3 7.7 3.0 6.1
4 7.7 3.8 6.7
In [76]: sorttbl.iloc[:, [3, 1]].head()
Out[76]:
Selected Rows from Table DATA.IRIS
petal_width sepal_width
0 2.0 3.8
1 2.3 2.6
2 2.0 2.8
3 2.3 3.0
4 2.2 3.8
Finally, the ix accessor enables you to mix labels and position indexes.
In [77]: sorttbl.ix[:, [3, 'sepal_width']].head()
Out[77]:
Selected Rows from Table DATA.IRIS
petal_width sepal_width
0 2.0 3.8
1 2.3 2.6
2 2.0 2.8
3 2.3 3.0
4 2.2 3.8
In [78]: sorttbl.ix[:, 'sepal_width'].head()
Out[78]:
0 3.8
1 2.6
2 2.8
3 3.0
4 3.8
Name: sepal_width, dtype: float64
In [79]: sorttbl.ix[:, 'sepal_width':-2].head()
Out[79]:
Selected Rows from Table DATA.IRIS
sepal_width petal_length
0 3.8 6.4
1 2.6 6.9
2 2.8 6.7
3 3.0 6.1
4 3.8 6.7
In [80]: sorttbl.ix[:, ['sepal_width', 3, 4]].head()
Out[80]:
Selected Rows from Table DATA.IRIS
sepal_width petal_width species
0 3.8 2.0 virginica
1 2.6 2.3 virginica
2 2.8 2.0 virginica
3 3.0 2.3 virginica
4 3.8 2.2 virginica
Although the preceding accessors can be useful, they are somewhat limited because the nature of distributed data prevents selection by row index. However, there is a more powerful indexing feature on CASTable objects that allows for various forms of advanced selection of data.
Rather than using fixed values of rows and columns to select data, we can create conditions that are based on the data in the table to determine which rows to select. The specification of conditions is done using the same syntax as that used by DataFrame objects. CASColumn objects support Python’s various comparison operators and build a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTable. It sounds much more complicated than it is, so let’s look at an example.
If we want to get a CASTable that contains only values where petal_length is greater than 6.5, we can use the following expression to create our filter:
In [81]: expr = sorttbl.petal_length > 6.5
This expression creates a computed column that is used in a where expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a Boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.
In [82]: newtbl = sorttbl[expr]
In [83]: newtbl.head()
Out[83]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.7 2.6 6.9 2.3 virginica
1 7.7 2.8 6.7 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.6 3.0 6.6 2.1 virginica
These two steps are commonly entered on one line.
In [84]: newtbl = sorttbl[sorttbl.petal_length > 6.5]
In [85]: newtbl.head()
Out[85]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.7 2.6 6.9 2.3 virginica
1 7.7 2.8 6.7 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.6 3.0 6.6 2.1 virginica
We can further filter rows out by indexing another comparison expression.
In [86]: newtbl2 = newtbl[newtbl.petal_width < 2.2]
In [87]: newtbl2.head()
Out[87]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.7 2.8 6.7 2.0 virginica
1 7.6 3.0 6.6 2.1 virginica
Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You must be careful with these operators though due to operator precedence. Bitwise comparison has a higher precedence than comparisons such as greater-than and less-than, so you need to enclose your comparisons in parentheses.
In [88]: sorttbl[(sorttbl.petal_length > 6.5) &
....: (sorttbl.petal_width < 2.2)].head()
Out[88]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.7 2.8 6.7 2.0 virginica
1 7.6 3.0 6.6 2.1 virginica
In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the resulting CASTable objects.
In [89]: sorttbl[(sorttbl.petal_length > 6.5) &
....: (sorttbl.petal_width < 2.2)]
Out[89]: CASTable('data.iris', caslib='casuser',
computedvars=['_gt_6_', '_and_8_', '_lt_7_'],
computedvarsprogram='_gt_6_ = (petal_length > 6.5);
_lt_7_ = (petal_width < 2.2);
_and_8_ = (_gt_6_ and _lt_7_); ',
where='(_and_8_)')
[['sepal_length', 'sepal_width',
'petal_length', 'petal_width',
'species']].sort_values(['sepal_length',
'sepal_width'],
ascending=[False, True])
You can also do mathematical operations on columns with constants or on other columns within your comparisons.
In [90]: sorttbl[(sorttbl.petal_length + sorttbl.petal_width)
....: * 2 > 17.5].head()
Out[90]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.7 2.6 6.9 2.3 virginica
1 7.7 3.8 6.7 2.2 virginica
The list of supported operators is shown in the following table:
Operator | Numeric Data | Character Data |
+ (add) | ✓ | ✓ |
- (subtract) | ✓ | |
* (multiply) | ✓ | |
/ (divide) | ✓ | |
% (modulo) | ✓ | |
** (power) | ✓ |
The supported comparison operators are shown in the following table.
Operator | Numeric Data | Character Data |
== (equality) | ✓ | ✓ |
!= (inequality) | ✓ | ✓ |
< (less than) | ✓ | ✓ |
> (greater than) | ✓ | ✓ |
<= (less than or equal to) | ✓ | ✓ |
>= (greater than or equal to) | ✓ | ✓ |
As you can see in the preceding tables, you can do comparisons on character columns as well. It is also possible to use many of the Python string methods on the column values. These methods are accessed through the str attribute of the column, just like in DataFrames.
In [91]: sorttbl[sorttbl.species.str.upper().str.startswith('SET')
....: ].head()
Out[91]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 5.8 4.0 1.2 0.2 setosa
1 5.7 3.8 1.7 0.3 setosa
2 5.7 4.4 1.5 0.4 setosa
3 5.5 3.5 1.3 0.2 setosa
4 5.5 4.2 1.4 0.2 setosa
The following table shows the string methods that are supported on character columns:
Method | Description |
col.str.capitalize | Capitalize the first character, and lowercase the remaining characters. |
col.str.contains | Return Boolean values indicating whether the pattern exists in the data values. |
col.str.count | Count occurrences of the pattern in each value. |
col.str.endswith | Return Boolean values indicating whether the values end with the pattern. |
col.str.find | Return the lowest index of the pattern in each value, or return -1 if the pattern is not found. |
col.str.index | Is the same as the find method (the preceding entry) except that a ValueError is raised if the pattern is not found. |
col.str.len | Compute the length of each value. |
col.str.lower | Lowercase the entire column. |
col.str.lstrip | Strip the leading spaces. |
col.str.repeat | Duplicate the value by the specified number of times. |
col.str.rfind | Return the highest index of the pattern in each value, or return -1 if the pattern is not found. |
col.str.rindex | Is the same as the rfind method (the preceding entry) except that a ValueError is raised if the pattern is not found. |
col.str.rstrip | Strip the trailing whitespace characters. |
col.str.slice | Slice a substring from the value. |
col.str.startswith | Return Boolean values indicating the value starts with the pattern. |
col.str.strip | Strip the leading and trailing whitespace characters. |
col.str.title | Capitalize each word in the value, and lowercase the remaining characters. |
col.str.upper | Uppercase the value. |
col.str.isalnum | Return Boolean values indicating whether the value is all alphanumeric values. |
col.str.isalpha | Return Boolean values indicating whether the value is all alphabetic characters. |
col.str.isdigit | Return Boolean values indicating whether the value is all digits. |
col.str.isspace | Return Boolean values indicating whether the value is all whitespace characters. |
col.str.islower | Return Boolean values indicating whether the value is all lowercase. |
col.str.isupper | Return Boolean values indicating whether the value is all uppercase. |
col.str.istitle | Return Boolean values indicating whether the value is equivalent to the title representation. (See the title method.) |
col.str.isnumeric | Return Boolean values indicating whether the value is in a numeric representation. |
col.str.isdecimal | Return Boolean values indicating whether the value is in a decimal representation. |
In addition to the string column methods, you can also use the following date/time properties on date, time, and datetime columns.
Property | Description |
col.dt.year | Return the year of the date/time/datetime. |
col.dt.month | Return the month of the date/time/datetime, where January is 1 and December is 12. |
col.dt.day | Return the day of the month of the date/time/datetime. |
col.dt.hour | Return the hours of the date/time/datetime (0-24). |
col.dt.minute | Return the minutes of the date/time/datetime (0-60). |
col.dt.second | Return the seconds of the date/time/datetime (0-60). |
col.dt.microsecond | Return the microseconds of the date/time/datetime (0-999999). |
col.dt.nanosecond | Return the nanoseconds of the date/time/datetime. Since CAS has a resolution of only one microsecond, this property always returns zero. |
col.dt.week | Return the week of the year of the date/time/datetime. |
col.dt.weekofyear | Is the same as the week property. |
col.dt.dayofweek | Return the day of the week of the date/time/datetime. Monday is 0 and Sunday is 6. |
col.dt.weekday | Is the same as the dayofweek property. |
col.dt.dayofyear | Return the ordinal day of the year of the date/time/datetime. |
col.dt.quarter | Return the year’s quarter of the date/time/datetime. |
col.dt.is_month_start | Boolean indicating whether the date/time/datetime is the first day of the month. |
col.dt.is_month_end | Boolean indicating whether the date/time/datetime is the last day of the month. |
col.dt.is_quarter_start | Boolean indicating whether the date/time/datetime is the first day of the quarter. |
col.dt.is_quarter_end | Boolean indicating whether the date/time/datetime is the last day of the quarter. |
col.dt.is_year_start | Boolean indicating whether the date/time/datetime is the first day of the year. |
col.dt.is_year_end | Boolean indicating whether the date/time/datetime is the last day of the year. |
col.dt.daysinmonth | Returns the number of days in the month of the date/time/datetime. |
col.dt.days_in_month | Same as daysinmonth. |
Note that when you call any of the previous methods on CAS time values, any method that returns a date-related component uses the current day. When you call time-related methods on a CAS date, the values are always zero.
It’s easy to create powerful filters that are executed in CAS while still using the Python syntax. However, the similarities to DataFrames don’t end there. CASTable objects can also create computed columns and by groups using similar techniques. Those topics are covered in the next section.
There’s a certain amount of “data wrangling” that is required to prepare your data for reporting or modeling. Data wrangling refers to data manipulation, including the transformation of data from one format to another format, BY-group processing, and data aggregation. Such manipulations on large data sets often become bottlenecks to understanding your data, because most of the steps to manipulate data require that you create a new copy of the data or that you move data around the nodes of your cluster. In CAS, several ways are provided to speed up the data wrangling steps such as temporary computed columns and group by processing. Neither requires copying data or moving data.
Rather than using the CASColumn-based expressions to filter a CASTable object you can also use them to build computed columns. Instead of comparing the result of an expression to another value, you simply set it to a key on the CASTable object. Here is an example that adds the sepal_length and sepal_width and then multiplies the result by 2. The result is a CASColumn object that we set to sepal_factor.
In [92]: sorttbl['sepal_factor'] = ((sorttbl.sepal_length +
....: sorttbl.sepal_width) * 2)
In [93]: sorttbl.head()
Out[93]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.9 3.8 6.4 2.0 virginica
1 7.7 2.6 6.9 2.3 virginica
2 7.7 2.8 6.7 2.0 virginica
3 7.7 3.0 6.1 2.3 virginica
4 7.7 3.8 6.7 2.2 virginica
sepal_factor
0 23.4
1 20.6
2 21.0
3 21.4
4 23.0
The sepal_factor column is a computed column that is not present in the original CAS table. It is a temporary column generated “on the fly” when an action is executed. They exist only for the duration of the action, and are not visible to any other action. Computed columns are efficient because you can use computed columns to dynamically filter or expand the data without actually duplicating the data to create those extra filters or columns. You can create computed columns from other computed columns as well. In the following code, we’ll use our previously computed column as part of the expression for a new computed column.
In [94]: sorttbl['total_factor'] = sorttbl.sepal_factor +
....: sorttbl.petal_width +
....: sorttbl.petal_length
In [95]: sorttbl.head()
Out[95]:
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 7.9 3.8 6.4 2.0 virginica
1 7.7 2.6 6.9 2.3 virginica
2 7.7 2.8 6.7 2.0 virginica
3 7.7 3.0 6.1 2.3 virginica
4 7.7 3.8 6.7 2.2 virginica
sepal_factor total_factor
0 23.4 31.8
1 20.6 29.8
2 21.0 29.7
3 21.4 29.8
4 23.0 31.9
The output from the head method shows that the extra columns are now available on the table. You can even create computed columns that are based on constants or Python expressions (evaluated on the client) rather than on data in the table.
In [96]: sorttbl['names'] = 'sepal / petal'
In [97]: sorttbl.head()
Out[97]:
sepal_length sepal_width petal_length petal_width species
0 7.9 3.8 6.4 2.0 virginica
1 7.7 2.6 6.9 2.3 virginica
2 7.7 2.8 6.7 2.0 virginica
3 7.7 3.0 6.1 2.3 virginica
4 7.7 3.8 6.7 2.2 virginica
sepal_factor total_factor names
0 23.4 8.4 sepal / petal
1 20.6 9.2 sepal / petal
2 21.0 8.7 sepal / petal
3 21.4 8.4 sepal / petal
4 23.0 8.9 sepal / petal
You can use any of the numeric, string, and date/time methods that are described in the filtering section to construct computed columns as well.
In [98]: sorttbl['cap_names'] = sorttbl.names.str.title()
In [99]: sorttbl.head()
Selected Rows from Table DATA.IRIS
sepal_length sepal_width petal_length petal_width species
0 4.3 3.0 1.1 0.1 setosa
1 4.4 2.9 1.4 0.2 setosa
2 4.4 3.0 1.3 0.2 setosa
3 4.4 3.2 1.3 0.2 setosa
4 4.5 2.3 1.3 0.3 setosa
sepal_factor total_factor names cap_names
0 14.6 15.8 sepal / petal Sepal / Petal
1 14.6 16.2 sepal / petal Sepal / Petal
2 14.8 16.3 sepal / petal Sepal / Petal
3 15.2 16.7 sepal / petal Sepal / Petal
4 13.6 15.2 sepal / petal Sepal / Petal
Now that we have seen that we can filter our data and create computed columns, let’s look at the topic of grouping data by data values.
Many analytic actions and Pandas DataFrame methods support grouping by values in one or more columns in a table. There are a couple of ways to apply by grouping in CASTable objects, one of which is borrowed from the DataFrame API.
The most direct way of applying by groupings to a CASTable object is to use the groupby parameter. We’ll start with the Iris table for this example.
In [100]: tbl.set_param('groupby', ['species'])
In [101]: tbl
Out[101]: CASTable('DATA.IRIS', caslib='CASUSER(username)',
groupby=['species'])
Now that we have some grouping variables set, we’ll run the summary action.
In [102]: tbl.summary(subset=['min', 'max'])
Out[102]:
[ByGroupInfo]
ByGroupInfo
species species_f _key_
0 setosa setosa setosa
1 versicolor versicolor versicolor
2 virginica virginica virginica
[ByGroup1.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
setosa sepal_length 4.3 5.8
setosa sepal_width 2.3 4.4
setosa petal_length 1.0 1.9
setosa petal_width 0.1 0.6
[ByGroup2.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
versicolor sepal_length 4.9 7.0
versicolor sepal_width 2.0 3.4
versicolor petal_length 3.0 5.1
versicolor petal_width 1.0 1.8
[ByGroup3.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
virginica sepal_length 4.9 7.9
virginica sepal_width 2.2 3.8
virginica petal_length 4.5 6.9
virginica petal_width 1.4 2.5
+ Elapsed: 0.0082s, user: 0.006s, sys: 0.007s, mem: 1.94mb
You see that we now get multiple tables as output rather than just the single table that the summary action produces when running without by groupings. The first table is called ByGroupInfo. This table contains information about all of the by grouping tables that follow. This information is more useful if there are a large number of by groupings and you want to know about them in advance so that you can prepare for the tables as they come from the server. The remaining tables correspond to each of the by groupings.
Each by grouping generates a new output table. The name of the key for that output is the name of the table prefixed by a label ByGroup#, where # is the index of the grouping (starting with 1). Without by grouping applied, the summary action creates a table called Summary. With by grouping applied, the result keys are ByGroup1.Summary, ByGroup2.Summary, and so on.
We mentioned that the DataFrame method for applying by groupings works with CASTable objects as well. The DataFrame groupby method is used to group the data. In the following example, we delete the groupby parameter from the table, and then use the groupby method to apply by groupings instead.
In [103]: tbl.del_param('groupby')
In [104]: tbl
Out[104]: CASTable('DATA.IRIS', caslib='CASUSER(username)')
In [105]: grptbl = tbl.groupby(['species'])
In [106]: grptbl
Out[106]: <swat.cas.table.CASTableGroupBy at 0x7f2148e12358>
You notice that when using this method, we don’t get a CASTable object back. Instead, we get a CASTableGroupBy object. This object corresponds to the Pandas GroupBy object. This object works much like a DataFrame object in many ways. For example, you can still call most DataFrame methods, call CAS actions on it, access columns, and so on, but the CASTableGroupBy object also adds methods for traversing the by-groupings in various ways.
Using the summary action on our grptbl variable shows that we get the same result as when we used the groupby parameter on the CASTable object.
In [107]: grptbl.summary(subset=['min', 'max'])
Out[107]:
[ByGroupInfo]
ByGroupInfo
species species_f _key_
0 setosa setosa setosa
1 versicolor versicolor versicolor
2 virginica virginica virginica
[ByGroup1.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
setosa sepal_length 4.3 5.8
setosa sepal_width 2.3 4.4
setosa petal_length 1.0 1.9
setosa petal_width 0.1 0.6
[ByGroup2.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
versicolor sepal_length 4.9 7.0
versicolor sepal_width 2.0 3.4
versicolor petal_length 3.0 5.1
versicolor petal_width 1.0 1.8
[ByGroup3.Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
virginica sepal_length 4.9 7.9
virginica sepal_width 2.2 3.8
virginica petal_length 4.5 6.9
virginica petal_width 1.4 2.5
+ Elapsed: 0.00862s, user: 0.008s, sys: 0.009s, mem: 1.98mb
If you already use Pandas, you might want to combine all of the by-grouping tables into a single DataFrame so that you can continue your workflow, as usual. Fortunately, there is a method on the CASResults object that can be used to combine by-grouping tables into a single DataFrame: concat_bygroups.
In [108]: grpsumm = grptbl.summary(subset=['min', 'max'])
In [109]: grpsumm.concat_bygroups()
Out[109]:
[Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
setosa sepal_length 4.3 5.8
setosa sepal_width 2.3 4.4
setosa petal_length 1.0 1.9
setosa petal_width 0.1 0.6
versicolor sepal_length 4.9 7.0
versicolor sepal_width 2.0 3.4
versicolor petal_length 3.0 5.1
versicolor petal_width 1.0 1.8
virginica sepal_length 4.9 7.9
virginica sepal_width 2.2 3.8
virginica petal_length 4.5 6.9
virginica petal_width 1.4 2.5
This method preserves all of the DataFrame index information so that it can be treated just like the output of the native DataFrame methods with by-groups enabled.
If you prefer to keep your by-group tables separate, there are a couple of other methods on the CASResults object that can query specific table names or groups.
Unlike the summary action, some CAS actions return multiple result table names, including the correlation action. We won’t explain the correlation action in detail, but we use it in our example to select a particular table name from a result set. We still use our grptbl variable from the previous example as our starting point.
In [110]: grpcorr = grptbl.correlation()
In [111]: grpcorr
Out[111]:
[ByGroupInfo]
ByGroupInfo
species species_f _key_
0 setosa setosa setosa
1 versicolor versicolor versicolor
2 virginica virginica virginica
[ByGroup1.CorrSimple]
Summary Statistics in Correlation Analysis for DATA.IRIS
Variable N Mean Sum StdDev Minimum
species
setosa sepal_length 50.0 5.006 250.3 0.352490 4.3
setosa sepal_width 50.0 3.418 170.9 0.381024 2.3
setosa petal_length 50.0 1.464 73.2 0.173511 1.0
setosa petal_width 50.0 0.244 12.2 0.107210 0.1
Maximum
species
setosa 5.8
setosa 4.4
setosa 1.9
setosa 0.6
[ByGroup1.Correlation]
Pearson Correlation Coefficients for DATA.IRIS
Variable sepal_length sepal_width petal_length
species
setosa sepal_length 1.000000 0.746780 0.263874
setosa sepal_width 0.746780 1.000000 0.176695
setosa petal_length 0.263874 0.176695 1.000000
setosa petal_width 0.279092 0.279973 0.306308
petal_width
species
setosa 0.279092
setosa 0.279973
setosa 0.306308
setosa 1.000000
... truncated ...
As you can see in the preceding output, the correlation action produces tables that are named CorrSimple and Correlation. These tables appear in each by grouping in the output. If we want to select only the Correlation tables from the output, we can use the get_tables method. This method takes the name of a table and returns a list of all of the tables in the result set with that name. The name of the table should not include the ByGroup#. prefix.
In [112]: grpcorr.get_tables('Correlation')
Out[112]:
[Pearson Correlation Coefficients for DATA.IRIS
Variable sepal_length sepal_width petal_length
species
setosa sepal_length 1.000000 0.746780 0.263874
setosa sepal_width 0.746780 1.000000 0.176695
setosa petal_length 0.263874 0.176695 1.000000
setosa petal_width 0.279092 0.279973 0.306308
petal_width
species
setosa 0.279092
setosa 0.279973
setosa 0.306308
setosa 1.000000 , Pearson Correlation Coefficients for DATA.IRIS
Variable sepal_length sepal_width petal_length
species
versicolor sepal_length 1.000000 0.525911 0.754049
versicolor sepal_width 0.525911 1.000000 0.560522
versicolor petal_length 0.754049 0.560522 1.000000
versicolor petal_width 0.546461 0.663999 0.786668
petal_width
species
versicolor 0.546461
versicolor 0.663999
versicolor 0.786668
versicolor 1.000000 , Pearson Correlation Coefficients for DATA.IRIS
Variable sepal_length sepal_width petal_length
species
virginica sepal_length 1.000000 0.457228 0.864225
virginica sepal_width 0.457228 1.000000 0.401045
virginica petal_length 0.864225 0.401045 1.000000
virginica petal_width 0.281108 0.537728 0.322108
petal_width
species
virginica 0.281108
virginica 0.537728
virginica 0.322108
virginica 1.000000 ]
In the preceding code, you see that we now have a list of DataFrames that correspond to the Correlation tables. You can use the concat function in SWAT to combine the Correlation tables into a single DataFrame.
In [113]: swat.concat(grpcorr.get_tables('Correlation'))
Out[113]:
Pearson Correlation Coefficients for DATA.IRIS
Variable sepal_length sepal_width petal_length
species
setosa sepal_length 1.000000 0.746780 0.263874
setosa sepal_width 0.746780 1.000000 0.176695
setosa petal_length 0.263874 0.176695 1.000000
setosa petal_width 0.279092 0.279973 0.306308
versicolor sepal_length 1.000000 0.525911 0.754049
versicolor sepal_width 0.525911 1.000000 0.560522
versicolor petal_length 0.754049 0.560522 1.000000
versicolor petal_width 0.546461 0.663999 0.786668
virginica sepal_length 1.000000 0.457228 0.864225
virginica sepal_width 0.457228 1.000000 0.401045
virginica petal_length 0.864225 0.401045 1.000000
virginica petal_width 0.281108 0.537728 0.322108
petal_width
species
setosa 0.279092
setosa 0.279973
setosa 0.306308
setosa 1.000000
versicolor 0.546461
versicolor 0.663999
versicolor 0.786668
versicolor 1.000000
virginica 0.281108
virginica 0.537728
virginica 0.322108
virginica 1.000000
We’ve seen how to select specific table names from result sets, but what if we want to select a particular by group? That’s covered in the next section.
The CASResults object has a method to select by groups from a result set using the values of the by-grouping variables. This method is called get_group. If you specify a list of values, get_group matches the by-grouping variables as they were specified via the groupby method.
In [114]: grpsumm.get_group(['versicolor'])
Out[114]:
[Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
versicolor sepal_length 4.9 7.0
versicolor sepal_width 2.0 3.4
versicolor petal_length 3.0 5.1
versicolor petal_width 1.0 1.8
You can also use keyword parameters on the get_group method to specify the desired by-grouping variables.
In [115]: grpsumm.get_group(species='versicolor')
Out[115]:
[Summary]
Descriptive Statistics for DATA.IRIS
Column Min Max
species
versicolor sepal_length 4.9 7.0
versicolor sepal_width 2.0 3.4
versicolor petal_length 3.0 5.1
versicolor petal_width 1.0 1.8
The final topic for discussion about by groupings is the case in which a CAS action has support for multiple sets of by groupings.
Some actions, such as mdsummary, enable you to specify multiple sets of by groups. The output for such actions is similar to what we saw in the examples in the preceding section except that the result keys add another prefix: ByGroupSet#.
In [116]: grpmdsumm =
.....: tbl.mdsummary(sets=[dict(groupby=['sepal_length']),
.....: dict(groupby=['petal_length'])])
In [117]: list(grpmdsumm.keys())
Out[117]:
['ByGroupSet1.ByGroupInfo',
'ByGroupSet1.ByGroup1.MDSummary',
'ByGroupSet1.ByGroup2.MDSummary',
'ByGroupSet1.ByGroup3.MDSummary',
...
'ByGroupSet2.ByGroupInfo',
'ByGroupSet2.ByGroup1.MDSummary',
'ByGroupSet2.ByGroup2.MDSummary',
'ByGroupSet2.ByGroup3.MDSummary',
...]
As you can see from the preceding output, we get a ByGroupSet#.ByGroupInfo table for each by-group set followed by all of the tables for that by-group set. We can extract a particular by-group set from the results using the get_set method.
In [118]: grpmdsumm.get_set(1)
Out[118]:
[ByGroupInfo]
ByGroupSet1.ByGroupInfo
sepal_length sepal_length_f _key_
0 5.0 5 5
1 6.0 6 6
2 7.0 7 7
3 4.3 4.3 4.3
4 4.4 4.4 4.4
5 4.5 4.5 4.5
6 4.6 4.6 4.6
7 4.7 4.7 4.7
8 4.8 4.8 4.8
9 4.9 4.9 4.9
.. ... ... ...
25 6.7 6.7 6.7
26 6.8 6.8 6.8
27 6.9 6.9 6.9
28 7.1 7.1 7.1
29 7.2 7.2 7.2
30 7.3 7.3 7.3
31 7.4 7.4 7.4
32 7.6 7.6 7.6
33 7.7 7.7 7.7
34 7.9 7.9 7.9
[35 rows x 3 columns]
[ByGroup1.MDSummary]
Descriptive Statistics for DATA.IRIS
Column Min Max N NMiss Mean Sum
sepal_length
5 sepal_length 5.0 5.0 10.0 0.0 5.00 50.0
5 sepal_width 2.0 3.6 10.0 0.0 3.12 31.2
5 petal_length 1.2 3.5 10.0 0.0 1.84 18.4
5 petal_width 0.2 1.0 10.0 0.0 0.43 4.3
Std StdErr Var USS CSS
sepal_length
5 0.000000 0.000000 0.000000 250.00 0.000
5 0.543241 0.171788 0.295111 100.00 2.656
5 0.834266 0.263818 0.696000 40.12 6.264
5 0.326769 0.103333 0.106778 2.81 0.961
CV TValue ProbT
sepal_length
5 0.000000 NaN NaN
5 17.411580 18.161922 2.121189e-08
5 45.340551 6.974502 6.505533e-05
5 75.992719 4.161290 2.443024e-03
[ByGroup2.MDSummary]
Descriptive Statistics for DATA.IRIS
Column Min Max N NMiss Mean Sum
sepal_length
6 sepal_length 6.0 6.0 6.0 0.0 6.000000 36.0
6 sepal_width 2.2 3.4 6.0 0.0 2.733333 16.4
6 petal_length 4.0 5.1 6.0 0.0 4.650000 27.9
6 petal_width 1.0 1.8 6.0 0.0 1.500000 9.0
Std StdErr Var USS CSS
sepal_length
6 0.000000 0.000000 0.000000 216.00 0.000000
6 0.471876 0.192642 0.222667 45.94 1.113333
6 0.403733 0.164823 0.163000 130.55 0.815000
6 0.268328 0.109545 0.072000 13.86 0.360000
CV TValue ProbT
sepal_length
6 0.000000 NaN NaN
6 17.263745 14.188635 0.000031
6 8.682421 28.212059 0.000001
6 17.888544 13.693064 0.000037
... truncated ...
You see that when using the get_set method, we get back a new CASResults object that corresponds to the output from that particular by-group set. You can then use get_tables, concat_bygroups, or get_group on that result just as we did in the examples in the previous section.
In [119]: grpmdsumm.get_set(1).concat_bygroups()
Out[119]:
[MDSummary]
Descriptive Statistics for DATA.IRIS
Column Min Max N NMiss Mean Sum
sepal_length
5 sepal_length 5.0 5.0 10.0 0.0 5.000000 50.0
5 sepal_width 2.0 3.6 10.0 0.0 3.120000 31.2
5 petal_length 1.2 3.5 10.0 0.0 1.840000 18.4
5 petal_width 0.2 1.0 10.0 0.0 0.430000 4.3
6 sepal_length 6.0 6.0 6.0 0.0 6.000000 36.0
6 sepal_width 2.2 3.4 6.0 0.0 2.733333 16.4
6 petal_length 4.0 5.1 6.0 0.0 4.650000 27.9
6 petal_width 1.0 1.8 6.0 0.0 1.500000 9.0
7 sepal_length 7.0 7.0 1.0 0.0 7.000000 7.0
7 sepal_width 3.2 3.2 1.0 0.0 3.200000 3.2
... ... ... ... ... ... ... ...
7.6 petal_length 6.6 6.6 1.0 0.0 6.600000 6.6
7.6 petal_width 2.1 2.1 1.0 0.0 2.100000 2.1
7.7 sepal_length 7.7 7.7 4.0 0.0 7.700000 30.8
7.7 sepal_width 2.6 3.8 4.0 0.0 3.050000 12.2
7.7 petal_length 6.1 6.9 4.0 0.0 6.600000 26.4
7.7 petal_width 2.0 2.3 4.0 0.0 2.200000 8.8
7.9 sepal_length 7.9 7.9 1.0 0.0 7.900000 7.9
7.9 sepal_width 3.8 3.8 1.0 0.0 3.800000 3.8
7.9 petal_length 6.4 6.4 1.0 0.0 6.400000 6.4
7.9 petal_width 2.0 2.0 1.0 0.0 2.000000 2.0
... truncated ...
If you are familiar with Pandas DataFrames, hopefully, the use of much of the API in CASTable objects helps you transition to the world of CAS. We covered the basics of using some of the supported DataFrame methods, using various forms of indexing and iteration, filtering using expressions, creating computed columns, and using by groups. With all of that “under your belt,” we can look at some more advanced statistical data in the following chapters.
3.148.144.228