Importing SWAT and Getting Connected
Executing Actions on CAS Tables
If you are already familiar with Python, have a running CAS server, and just can’t wait to get started, we’ve written this chapter just for you. This chapter is a very quick summary of what you can do with CAS from Python. We don’t provide a lot of explanation of the examples; that comes in the later chapters. This chapter is here for those who want to dive in and work through the details in the rest of the book as needed.
In all of the sample code in this chapter, we are using the IPython interface to Python.
The only thing you need to know about the CAS server in order to get connected is the host name, the port number, your user name, and your password. The SWAT package contains the CAS class that is used to communicate with the server. The arguments to the CAS class are hostname, port, username, and password1, in that order. Note that you can use the REST interface by specifying the HTTP port that is used by the CAS server. The CAS class can autodetect the port type for the standard CAS port and HTTP. However, if you use HTTPS, you must specify protocol=’https’ as a keyword argument to the CAS constructor. You can also specify ‘cas’ or ‘http’ to explicitly override autodetection.
In [1]: import swat
In [2]: conn = swat.CAS('server-name.mycompany.com', 5570,
...: 'username', 'password')
When you connect to CAS, it creates a session on the server. By default, all resources (CAS actions, data tables, options, and so on) are available only to that session. Some resources can be promoted to a global scope, which we discuss later in the book.
To see what CAS actions are available, use the help method on the CAS connection object, which calls the help action on the CAS server.
In [3]: out = conn.help()
NOTE: Available Action Sets and Actions:
NOTE: accessControl
NOTE: assumeRole - Assumes a role
NOTE: dropRole - Relinquishes a role
NOTE: showRolesIn - Shows the currently active role
NOTE: showRolesAllowed - Shows the roles that a user
is a member of
NOTE: isInRole - Shows whether a role is assumed
NOTE: isAuthorized - Shows whether access is authorized
NOTE: isAuthorizedActions - Shows whether access is
authorized to actions
NOTE: isAuthorizedTables - Shows whether access is authorized
to tables
NOTE: isAuthorizedColumns - Shows whether access is authorized
to columns
NOTE: listAllPrincipals - Lists all principals that have
explicit access controls
NOTE: whatIsEffective - Lists effective access and
explanations (Origins)
NOTE: partition - Partitions a table
NOTE: recordCount - Shows the number of rows in a Cloud
Analytic Services table
NOTE: loadDataSource - Loads one or more data source interfaces
NOTE: update - Updates rows in a table
The printed notes describe all of the CAS action sets and the actions in those action sets. The help action also returns the action set and action information as a return value. The return values from all actions are in the form of CASResults objects, which are a subclass of the Python collections.OrderedDict class. To see a list of all of the keys, use the keys method just as you would with any Python dictionary. In this case, the keys correspond to the names of the CAS action sets.
In [4]: list(out.keys())
Out[4]:
['accessControl',
'builtins',
'configuration',
'dataPreprocess',
'dataStep',
'percentile',
'search',
'session',
'sessionProp',
'simple',
'table']
Printing the contents of the return value shows all of the top-level keys as sections. In the case of the help action, the information about each action set is returned in a table in each section. These tables are stored in the dictionary as Pandas DataFrames.
In [5]: out
Out[5]:
[accessControl]
name description
0 assumeRole Assumes a role
1 dropRole Relinquishes a role
2 showRolesIn Shows the currently active role
3 showRolesAllowed Shows the roles that a user is a mem...
4 isInRole Shows whether a role is assumed
5 isAuthorized Shows whether access is authorized
6 isAuthorizedActions Shows whether access is authorized t...
7 isAuthorizedTables Shows whether access is authorized t...
8 isAuthorizedColumns Shows whether access is authorized t...
9 listAllPrincipals Lists all principals that have expli...
10 whatIsEffective Lists effective access and explanati...
11 listAcsData Lists access controls for caslibs, t...
12 listAcsActionSet Lists access controls for an action ...
13 repAllAcsCaslib Replaces all access controls for a c...
14 repAllAcsTable Replaces all access controls for a t...
15 repAllAcsColumn Replaces all access controls for a c...
16 repAllAcsActionSet Replaces all access controls for an ...
17 repAllAcsAction Replaces all access controls for an ...
18 updSomeAcsCaslib Adds, deletes, and modifies some acc...
19 updSomeAcsTable Adds, deletes, and modifies some acc...
... truncated ...
+ Elapsed: 0.0034s, user: 0.003s, mem: 0.164mb
Since the output is based on the dictionary object, you can access each key individually as well.
In [6]: out['builtins']
Out[6]:
name description
0 addNode Adds a machine to the server
1 removeNode Remove one or more machines from the...
2 help Shows the parameters for an action o...
3 listNodes Shows the host names used by the server
4 loadActionSet Loads an action set for use in this ...
5 installActionSet Loads an action set in new sessions ...
6 log Shows and modifies logging levels
7 queryActionSet Shows whether an action set is loaded
8 queryName Checks whether a name is an action o...
9 reflect Shows detailed parameter information...
10 serverStatus Shows the status of the server
11 about Shows the status of the server
12 shutdown Shuts down the server
13 userInfo Shows the user information for your ...
14 actionSetInfo Shows the build information from loa...
15 history Shows the actions that were run in t...
16 casCommon Provides parameters that are common ...
17 ping Sends a single request to the server...
18 echo Prints the supplied parameters to th...
19 modifyQueue Modifies the action response queue s...
20 getLicenseInfo Shows the license information for a ...
21 refreshLicense Refresh SAS license information from...
22 httpAddress Shows the HTTP address for the serve...
The keys are commonly alphanumeric, so the CASResults object was extended to enable you to access keys as attributes as well. This just keeps your code a bit cleaner. However, you should be aware that if a result key has the same name as a Python dictionary method, the dictionary method takes precedence. In the following code, we access the builtins key again, but this time we access it as if it were an attribute.
In [7]: out.builtins
Out[7]:
name description
0 addNode Adds a machine to the server
1 removeNode Remove one or more machines from the...
2 help Shows the parameters for an action o...
3 listNodes Shows the host names used by the server
4 loadActionSet Loads an action set for use in this ...
5 installActionSet Loads an action set in new sessions ...
6 log Shows and modifies logging levels
7 queryActionSet Shows whether an action set is loaded
8 queryName Checks whether a name is an action o...
9 reflect Shows detailed parameter information...
10 serverStatus Shows the status of the server
11 about Shows the status of the server
12 shutdown Shuts down the server
13 userInfo Shows the user information for your ...
14 actionSetInfo Shows the build information from loa...
15 history Shows the actions that were run in t...
16 casCommon Provides parameters that are common ...
17 ping Sends a single request to the server...
18 echo Prints the supplied parameters to th...
19 modifyQueue Modifies the action response queue s...
20 getLicenseInfo Shows the license information for a ...
21 refreshLicense Refresh SAS license information from...
22 httpAddress Shows the HTTP address for the serve...
Just like the help action, all of the action sets and actions are available as attributes and methods on the CAS connection object. For example, the userinfo action is called as follows.
In [8]: conn.userinfo()
Out[8]:
[userInfo]
{'anonymous': False,
'groups': ['users'],
'hostAccount': True,
'providedName': 'username',
'providerName': 'Active Directory',
'uniqueId': 'username',
'userId': 'username'}
+ Elapsed: 0.000291s, mem: 0.0826mb
The result this time is a CASResults object, the contents of which is a dictionary under a single key (userInfo) that contains information about your user account. Although all actions return a CASResults object, there are no strict rules about what keys and values are in that object. The returned values are determined by the action and vary depending on the type of information returned. Analytic actions typically return one or more DataFrames. If you aren’t using IPython to format your results automatically, you can cast the result to a dictionary and then print it using pprint for a nicer representation.
In [9]: from pprint import pprint
In [10]: pprint(dict(conn.userinfo()))
{'userInfo': {'anonymous': False,
'groups': ['users'],
'hostAccount': True,
'providedName': 'username',
'providerName': 'Active Directory',
'uniqueId': 'username',
'userId': 'username'}}
When calling the help and userinfo actions, we actually used a shortcut. In some cases, you might need to specify the fully qualified name of the action, which includes the action set name. This can happen if two action sets have an action of the same name, or an action name collides with an existing method or attribute name on the CAS object. The userinfo action is contained in the builtins action set. To call it using the fully qualified name, you use builtins.userinfo rather than userinfo on the CAS object. The builtins level in this call corresponds to a CASActionSet object that contains all of the actions in the builtins action set.
In [11]: conn.builtins.userinfo()
The preceding code provides you with the same result as the previous example does.
The easiest way to load data into a CAS server is by using the upload method on the CAS connection object. This method uses a file path or URL that points to a file in various possible formats including CSV, Excel, and SAS data sets. You can also pass a Pandas DataFrame object to the upload method in order to upload the data from that DataFrame to a CAS table. We use the classic Iris data set in the following data loading example.
In [12]: out = conn.upload('https://raw.githubusercontent.com/' +
....: 'pydata/pandas/master/pandas/tests/' +
....: 'data/iris.csv')
In [13]: out
Out[13]:
[caslib]
'CASUSER(username)'
[tableName]
'IRIS'
[casTable]
CASTable('IRIS', caslib='CASUSER(username)')
+ Elapsed: 0.0629s, user: 0.037s, sys: 0.021s, mem: 48.4mb
The output from the upload method is, again, a CASResults object. The output contains the name of the created table, the CASLib that the table was created in, and a CASTable object that can be used to interact with the table on the server. CASTable objects have all of the same CAS action set and action methods of the connection that created it. They also include many of the methods that are defined by Pandas DataFrames so that you can operate on them as if they were local DataFrames. However, until you explicitly fetch the data or call a method that returns data from the table (such as head or tail), all operations are simply combined on the client side (essentially creating a client-side view) until data is actually retrieved from the server.
We can use actions such as tableinfo and columninfo to access general information about the table itself and its columns.
# Store CASTable object in its own variable.
In [14]: iris = out.casTable
# Call the tableinfo action on the CASTable object.
In [15]: iris.tableinfo()
Out[15]:
[TableInfo]
Name Rows Columns Encoding CreateTimeFormatted
0 IRIS 150 5 utf-8 01Nov2016:16:38:59
ModTimeFormatted JavaCharSet CreateTime ModTime
0 01Nov2016:16:38:59 UTF8 1.793638e+09 1.793638e+09
Global Repeated View SourceName SourceCaslib Compressed
0 0 0 0 0
Creator Modifier
0 username
+ Elapsed: 0.000856s, mem: 0.104mb
# Call the columninfo action on the CASTable.
In [16]: iris.columninfo()
Out[16]:
[ColumnInfo]
Column ID Type RawLength FormattedLength NFL NFD
0 SepalLength 1 double 8 12 0 0
1 SepalWidth 2 double 8 12 0 0
2 PetalLength 3 double 8 12 0 0
3 PetalWidth 4 double 8 12 0 0
4 Name 5 varchar 15 15 0 0
+ Elapsed: 0.000727s, mem: 0.175mb
Now that we have some data, let’s run some more interesting CAS actions on it.
The simple action set that comes with CAS contains some basic analytic actions. You can use either the help action or the IPython ? operator to view the available actions.
In [17]: conn.simple?
Type: Simple
String form: <swat.cas.actions.Simple object at 0x4582b10>
File: swat/cas/actions.py
Definition: conn.simple(self, *args, **kwargs)
Docstring:
Analytics
Actions
-------
simple.correlation : Generates a matrix of Pearson product-moment
correlation coefficients
simple.crosstab : Performs one-way or two-way tabulations
simple.distinct : Computes the distinct number of values of the
variables in the variable list
simple.freq : Generates a frequency distribution for one or
more variables
simple.groupby : Builds BY groups in terms of the variable value
combinations given the variables in the variable
list
simple.mdsummary : Calculates multidimensional summaries of numeric
variables
simple.numrows : Shows the number of rows in a Cloud Analytic
Services table
simple.paracoord : Generates a parallel coordinates plot of the
variables in the variable list
simple.regression : Performs a linear regression up to 3rd-order
polynomials
simple.summary : Generates descriptive statistics of numeric
variables such as the sample mean, sample
variance, sample size, sum of squares, and so on
simple.topk : Returns the top-K and bottom-K distinct values of
each variable included in the variable list based
on a user-specified ranking order
Let’s run the summary action on our CAS table.
In [18]: summ = iris.summary()
In [19]: summ
Out[19]:
[Summary]
Descriptive Statistics for IRIS
Column Min Max N NMiss Mean Sum Std
0 SepalLength 4.3 7.9 150.0 0.0 5.843333 876.5 0.828066
1 SepalWidth 2.0 4.4 150.0 0.0 3.054000 458.1 0.433594
2 PetalLength 1.0 6.9 150.0 0.0 3.758667 563.8 1.764420
3 PetalWidth 0.1 2.5 150.0 0.0 1.198667 179.8 0.763161
StdErr Var USS CSS CV TValue
0 0.067611 0.685694 5223.85 102.168333 14.171126 86.425375
1 0.035403 0.188004 1427.05 28.012600 14.197587 86.264297
2 0.144064 3.113179 2583.00 463.863733 46.942721 26.090198
3 0.062312 0.582414 302.30 86.779733 63.667470 19.236588
ProbT
0 3.331256e-129
1 4.374977e-129
2 1.994305e-57
3 3.209704e-42
+ Elapsed: 0.0256s, user: 0.019s, sys: 0.009s, mem: 1.74mb
The summary action displays summary statistics in a form that is familiar to SAS users. If you want them in a form similar to what Pandas users are used to, you can use the describe method (just like on DataFrames).
In [20]: iris.describe()
Out[20]:
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Note that when you call the describe method on a CASTable object, it calls various CAS actions in the background to do the calculations. This includes the summary, percentile, and topk actions. The output of those actions is combined into a DataFrame in the same form that the real Pandas DataFrame describe method returns. This enables you to use CASTable objects and DataFrame objects interchangeably in your workflow for this method and many other methods.
Since the tables that come back from the CAS server are subclasses of Pandas DataFrames, you can do anything to them that works on DataFrames. You can plot the results of your actions using the plot method or use them as input to more advanced packages such as Matplotlib and Bokeh, which are covered in more detail in a later section.
The following example uses the plot method to download the entire data set and plot it using the default options.
In [21]: iris.plot()
Out[21]: <matplotlib.axes.AxesSubplot at 0x5339050>
If the plot doesn’t show up automatically, you might have to tell Matplotlib to display it.
In [22]: import matplotlib.pyplot as plt
In [23]: plt.show()
The output that is created by the plot method follows.
Even if you loaded the same data set that we have used in this example, your plot might look different since CAS stores data in a distributed manner. Because of this, the ordering of data from the server is not deterministic unless you sort it when it is fetched. If you run the following commands, you plot the data sorted by SepalLength and SepalWidth.
In [24]: iris.sort_values(['SepalLength', 'SepalWidth']).plot()
As with any network or file resource in Python, you should close your CAS connections when you are finished. They time out and disappear eventually if left open, but it’s always a good idea to clean them up explicitly.
In [25]: conn.close()
Hopefully this 10-minute guide was enough to give you an idea of the basic workflow and capabilities of the Python CAS client. In the following chapters, we dig deeper into the details of the Python CAS client and how to blend the power of SAS analytics with the tools that are available in the Python environment.
1 Later in the book, we show you how to store your password so that you do not need to specify it in your programs.
3.145.111.92