Chapter 7: Data Exploration and Summary Statistics

The description of the columns in a table using tabular or visual outputs is typically the first step in a data analysis or a statistical modeling process. In this chapter, you learn how to use CAS to explore and summarize data. Topics include summarizing continuous variables and categorical variables, data transformation, dimensional reduction, and related visualizations using the Python Bokeh package.

Let’s start by uploading the Organics Purchase data set from your local directory to CAS. In this example, conn is a CAS object that is connected to a session on a CAS server. For more information about managing your data in CAS, refer to Chapters 4 - 6.

In [1]: organics = conn.upload('yourSamplePath/organics.sashdat')

We also import the Pandas and Bokeh packages:

In [2]: import pandas as pd

...: from bokeh.charts import Bar, Scatter, output_file, show, Area

Summarizing Continuous Variables

The simple action set provides several useful actions for univariate and multivariate data exploration and summarization. Similar to some action sets that you have learned about in previous chapters, such as builtins and table, the simple action set is also preloaded when you start a new CAS server. Therefore, you can use it on the organics variable, which is a CASTable reference to the data set organics.sashdat that you just uploaded.

Descriptive Statistics

Let’s get started with a basic summary action:

In [3]: organics.summary()

Out[3]:

[Summary]

Descriptive Statistics for ORGANICS

Column Min Max N NMiss

0 DemAffl 0.00 34.00 1606488.0 82460.0

1 DemAge 18.00 79.00 1574340.0 114608.0

2 PromTime 0.00 39.00 1667592.0 21356.0

3 purchase_3mon 698.44 1188.06 1688948.0 0.0

4 purchase_6mon 1668.77 2370.87 1688948.0 0.0

5 purchase_9mon 2624.09 3468.72 1688948.0 0.0

6 purchase_12mon 3698.92 4684.88 1688948.0 0.0

Mean Sum Std StdErr Var

0 8.711893 1.399555e+07 3.421045 0.002699 11.703547

1 53.797152 8.469501e+07 13.205734 0.010525 174.391404

2 6.564670 1.094719e+07 4.657008 0.003606 21.687723

3 950.027539 1.604547e+09 50.067179 0.038525 2506.722438

4 2049.979250 3.462308e+09 70.731010 0.054425 5002.875812

5 3070.016785 5.185099e+09 86.588115 0.066627 7497.501651

6 4189.994798 7.076683e+09 100.009042 0.076954 10001.808545

USS CSS CV TValue ProbT

0 1.407294e+08 1.880160e+07 39.268672 3227.695308 0.0

1 4.830901e+09 2.745512e+08 24.547273 5111.472302 0.0

2 1.080310e+08 3.616625e+07 70.940468 1820.333053 0.0

3 1.528598e+12 4.233721e+09 5.270077 24659.894243 0.0

4 7.106110e+12 8.449592e+09 3.450328 37665.847454 0.0

5 1.593100e+13 1.266288e+10 2.820444 46077.679978 0.0

6 2.966816e+13 1.689252e+10 2.386854 54448.053001 0.0

In this example, you execute the summary action to generate descriptive statistics for all the numeric columns in the organics data set. The default statistics that are generated by the summary action are described in the following table:

Statistic	Description
Min	Minimum value
Max	Maximum value
N	Number of observations with nonmissing values
NMiss	Number of observations with missing values
Sum	Sum
Std	Standard Deviation
StdErr	Standard Error
Var	Variable
USS	Uncorrected Sum of Squares
CSS	Corrected Sum of Squares
CV	Coefficient of Variation
TValue	Value of T-statistics for hypothesis testing
ProbT	p-value of the T-statistics

Let’s work on a specific set of columns. You can set the inputs parameter of the summary action to obtain the descriptive statistics for specified columns:

In [4]: varlist = ['DemAge', 'Purchase_12mon', 'Purchase_6mon']

...: organics.summary(inputs=varlist)

Out[4]:

[Summary]

Descriptive Statistics for ORGANICS

Column Min Max N NMiss

0 DemAge 18.00 79.00 1574340.0 114608.0

1 purchase_12mon 3698.92 4684.88 1688948.0 0.0

2 purchase_6mon 1668.77 2370.87 1688948.0 0.0

Mean Sum Std StdErr Var

0 53.797152 8.469501e+07 13.205734 0.010525 174.391404

1 4189.994798 7.076683e+09 100.009042 0.076954 10001.808545

2 2049.979250 3.462308e+09 70.731010 0.054425 5002.875812

USS CSS CV TValue ProbT

0 4.830901e+09 2.745512e+08 24.547273 5111.472302 0.0

1 2.966816e+13 1.689252e+10 2.386854 54448.053001 0.0

2 7.106110e+12 8.449592e+09 3.450328 37665.847454 0.0

This code generates summary statistics only for columns that you have specified: DemAge, Purchase_12mon, and Purchase_6mon. In the next example, we compute the summary statistics for the columns for recent purchase amounts and save the result in a local object.

In [5]: varlist = ['Purchase_3mon', 'Purchase_6mon', 'Purchase_9mon',

'Purchase_12mon']

...: result = organics.summary(inputs=varlist)

The result is a CASResults object that contains the log, the performance data, the server information, and the output of the submitted CAS action. For example, you can print out the run time of the preceding summary action:

In [6]: print(result.performance.cpu_system_time)

print(result.performance.cpu_user_time)

Out[6]:

0.766884

0.367944

A CAS action might return one or more tables referenced by the keys of the CASResult object. In this example, only one summary table has been returned by the summary action.

In [7]: list(result.keys())

Out[7]: ['Summary']

Result tables are returned in the form of SASDataFrame objects. Because the SASDataFrame is a subclass of the Pandas DataFrame, it works seamlessly with Pandas and many Python packages that support DataFrames. For example, you can use the stack method to rearrange the output summary table.

In [8]: df = result['Summary']

...: df.columns

Out[8]: Index(['Column', 'Min', 'Max', 'N', 'NMiss', 'Mean', 'Sum', 'Std', 'StdErr', 'Var', 'USS', 'CSS', 'CV', 'TValue', 'ProbT'], dtype='object')

In [9]: df.index = df['Column']

...: stackedDf = df[['Min','Mean','Max']].stack()

...: print(stackedDf)

Out[9]:

Column

purchase_3mon Min 698.440000

Mean 950.006638

Max 1188.060000

purchase_6mon Min 1668.770000

Mean 2049.989809

Max 2370.870000

purchase_9mon Min 2624.090000

Mean 3070.037322

Max 3468.720000

purchase_12mon Min 3704.700000

Mean 4190.004978

Max 4684.880000

You can also use the Bokeh package to generate a bar chart using the summary statistics that are generated by the summary action.

In [10]: p = Bar(df, 'Column', values='Mean',

...: color='#1f77b4', agg='mean', title='',

...: xlabel = '', ylabel ='Frequency')

...: output_file('visual1.html')

...: show(p)

Sometimes it is useful to summarize your data by segments, which is accomplished with the groupby parameter.

In [11]: organics.groupby='DemGender'

...: result = organics.summary(inputs='DemAge')

The groupby parameter groups the data into three segments for DemGender : F, M, or U. Also, it summarizes the column DemAge for each data segment, respectively. The summary statistics are returned in three different DataFrames.

In [12]: list(result.keys())

Out[12]: ['ByGroupInfo', 'ByGroup1.Summary', 'ByGroup2.Summary',

'ByGroup3.Summary']

In this case, the summary action also returns a table called ByGroupInfo that provides basic information about the by groups.

In [13]: result['ByGroupInfo']

Out[13]:

ByGroupInfo

DemGender DemGender_f _key_

0 F F F

1 M M M

2 U U U

To merge these tables into one result table, use the CASResults concat_bygroups method. It returns a new CASResults object with all by-group tables that are concatenated.

In [14]: result2 = result.concat_bygroups()

...: result2['Summary'][['N','Min','Mean','Max','Std']]

Out[14]:

Descriptive Statistics for ORGANICS

N Min Mean Max Std

DemGender

F 861460.0 18.0 52.880723 79.0 13.546649

M 411768.0 18.0 54.535807 79.0 12.817555

U 301112.0 18.0 55.408884 79.0 12.504695

When you specify a group-by variable, the summary action does not require any data shuffle to summarize data by segments. In fact, you can have as many group-by variables as you need. For example, the next action call summarizes DemAge by two columns: DemGender and DemHomeowner.

In [15]: varlist='DemAge'

...: organics.groupby=['DemGender','DemHomeowner']

...: result = organics.summary(inputs=varlist).concat_bygroups()

...: result['Summary'][['N','Min','Mean','Max','Std']]

Out[15]:

Descriptive Statistics for ORGANICS

N Min Mean Max Std

DemGender DemHomeowner

F No 560471.0 18.0 52.883482 79.0 13.543064

Yes 300989.0 18.0 52.875587 79.0 13.553343

M No 267238.0 18.0 54.525262 79.0 12.814615

Yes 144530.0 18.0 54.555303 79.0 12.823012

U No 195894.0 18.0 55.414745 79.0 12.504366

Yes 105218.0 18.0 55.397974 79.0 12.505360

In this example, we concatenate the by-group tables and automatically assign the levels of the group-by variables as the indices of the underlying Pandas DataFrame. The index is shown below:

In [16]: result['Summary'].index

Out[16]:

MultiIndex(levels=[['F', 'M', 'U'], ['No', 'Yes']],

labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],

names=['DemGender', 'DemHomeowner'])

To remove the grouping variables from the CASTable object, you can use either the Python del statement or the CASTable del_params method. After the grouping variables are removed, the summary action computes summary statistics for the entire data set.

In [17]: del organics.groupby

You can also use

In [18]: organics.del_param('groupby')

Histograms

Histograms are another useful visualization tool for summarizing continuous data. A histogram is an estimate of the probability distribution of a continuous variable by equal-distance binning. The CAS dataPreprocess action set provides a histogram action. In the next example, we execute the histogram action for the DemAge column:

In [19]: result = organics.histogram(

...: reqpacks=[{'nicebinning':False, 'nbins':10}],

...: inputs=['Purchase_3mon']

...: )

In [20]: result['BinDetails']

Out[20]:

Bin Details for ORGANICS

Variable BinSetId BinId BinLowerBnd BinUpperBnd

0 purchase_3mon 1 1 698.440 747.402

1 purchase_3mon 1 2 747.402 796.364

2 purchase_3mon 1 3 796.364 845.326

3 purchase_3mon 1 4 845.326 894.288

4 purchase_3mon 1 5 894.288 943.250

5 purchase_3mon 1 6 943.250 992.212

6 purchase_3mon 1 7 992.212 1041.174

7 purchase_3mon 1 8 1041.174 1090.136

8 purchase_3mon 1 9 1090.136 1139.098

9 purchase_3mon 1 10 1139.098 1188.060

BinWidth NInBin Mean Std Min Max

0 48.962 45 735.920444 11.629722 698.44 747.30

1 48.962 1802 783.902153 10.767204 747.72 796.36

2 48.962 28782 830.094716 12.081154 796.37 845.32

3 48.962 193828 875.761523 13.155664 845.33 894.28

4 48.962 529051 921.156725 13.782269 894.29 943.24

5 48.962 597872 966.341785 13.869978 943.25 992.21

6 48.962 279514 1011.694702 13.360849 992.22 1041.17

7 48.962 53719 1057.383763 12.429201 1041.18 1090.12

8 48.962 4214 1103.745873 11.365814 1090.14 1139.07

9 48.962 121 1151.365785 10.677088 1139.32 1188.06

MidPoint Percent

0 722.921 0.002664

1 771.883 0.106694

2 820.845 1.704138

3 869.807 11.476256

4 918.769 31.324292

5 967.731 35.399077

6 1016.693 16.549592

7 1065.655 3.180619

8 1114.617 0.249504

9 1163.579 0.007164

Each row of the BinDetails output contains the binning information for a specific bin, including the lower and upper bounds, the bin width, and summary statistics per bin such as counts, averages, standard deviations, minimums, maximums and midpoints. You can also use the Bokeh package to create a histogram.

In [21]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

...: color='#1f77b4', agg='mean', title='', legend=None,

...: xlabel = 'Purchase_3mon',

...: ylabel = 'Percent'

...: )

...: output_file('bar.html')

...: show(p)

In the preceding example, we set nicebinning to False, which is an option to request coarser upper bounds and lower bounds. Such a calculation usually returns nicer cutpoints for visualization and interpretation. A setting of True for nicebinning generates the following Bokeh visualization:

Compared to the previous graph, the nicebinning option allows bounds only at a step size of 100, which is determined automatically by the range of the variable Purchase_3mon. Sometimes you need to increase the number of bins when you use the nicebinning option to get a better understanding of the data. In this example, let us change the number of bins from 10 to 25.

In [22]: result= organics.histogram(

...: reqpacks=[{'nicebinning':True, 'nbins':25}],

...: inputs=['Purchase_3mon']

...: )

In [23]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

...: color='#1f77b4', agg='mean', title='', legend=None,

...: xlabel = 'Purchase_3mon',

...: ylabel = 'Percent'

...: )

...: output_file('bar.html')

...: show(p)

In this example, 25 is chosen as the step size. The histogram action also enables you to set the bins manually. For example, you can force a fixed bin width of 50. In this case, you don’t need to specify a number of bins because it is determined automatically.

In [24]: result= organics.histogram(

...: reqpacks=[{'binwidth':50}],

...: inputs=['Purchase_3mon']

...: )

In [25]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

...: color='#1f77b4', agg='mean', title='', legend=None,

...: xlabel='Purchase_3mon',

...: ylabel='Percent'

...: )

...: output_file('bar.html')

...: show(p)

Similar to the summary action, the histogram action also supports group-by processing. You can request a histogram for each data segment that is defined by a set of group-by columns.

In [26]: organics.groupby = ['DemGender', 'DemAgeGroup']

...: result = organics.histogram(

...: reqpacks=[{'nicebinning':False, 'nbins':20}],

...: inputs=['DemAffl']

...: )

...: list(result.keys())

Out[26]:

['ByGroupInfo',

'ByGroup1.BinDetails',

'ByGroup2.BinDetails',

'ByGroup3.BinDetails',

'ByGroup4.BinDetails',

'ByGroup5.BinDetails',

'ByGroup6.BinDetails',

'ByGroup7.BinDetails',

'ByGroup8.BinDetails',

'ByGroup9.BinDetails',

'ByGroup10.BinDetails',

'ByGroup11.BinDetails',

'ByGroup12.BinDetails']

Except for the ByGroupInfo table, each result table contains the histogram statistics for one by group. The ByGroupInfo table provides an overview of the by-group levels:

In [27]: result['ByGroupInfo']

Out[27]:

ByGroupInfo

DemGender DemGender_f DemAgeGroup DemAgeGroup_f _key_

0 F F middle middle Fmiddle

1 F F senior senior Fsenior

2 F F unknown unknown Funknown

3 F F young young Fyoung

4 M M middle middle Mmiddle

5 M M senior senior Msenior

6 M M unknown unknown Munknown

7 M M young young Myoung

8 U U middle middle Umiddle

9 U U senior senior Usenior

10 U U unknown unknown Uunknown

11 U U young young Uyoung

Next, let’s collect the statistics from the output table and use Bokeh to generate a grid of histograms. The all_df dictionary contains the title of each histogram as well as its data. You must generate visualizations, one by one, using the Bokeh bar chart and then define the layout of the grid using the gridplot function from Bokeh.

In [28]: all_df = {

...:'Gender=Female, AgeGroup=Middle': result['ByGroup1.BinDetails'],

...:'Gender=Female, AgeGroup=Senior': result['ByGroup2.BinDetails'],

...:'Gender=Female, AgeGroup=Unknown': result['ByGroup3.BinDetails'],

...:'Gender=Female, AgeGroup=Young' : result['ByGroup4.BinDetails']

...:}

...: all_pic = []

...:

...: for this_title in all_df:

...: this_pic = Bar(all_df[this_title], 'MidPoint', values='Percent',

...: color='#1f77b4', agg='mean', title=this_title, legend=None,

...: xlabel='DemAffl',

...: ylabel='Percent'

...: )

...: all_pic.append(this_pic)

...:

...: from bokeh.io import gridplot

...:

...: p = gridplot([[all_pic[0], all_pic[1]],

...: [all_pic[2], all_pic[3]]])

...: output_file('grid.html')

...: show(p)

Percentiles

Percentile is another useful tool for numeric data exploration. A percentile is a statistic for indicating a variable that has a given percentage of observations below the percentile value. In CAS, such statistics are available in the percentile action set.

In [29]: conn.loadactionset('percentile')

...:

...: del organics.groupby

...: organics.percentile(inputs='DemAge')

NOTE: Added action set 'percentile'.

Out[29]:

[Percentile]

Percentiles for ORGANICS

Variable Pctl Value Converged

0 DemAge 25.0 44.0 1.0

1 DemAge 50.0 54.0 1.0

2 DemAge 75.0 64.0 1.0

By default, the percentile action computes the 25^th, the 50^th, and the 75^th percentiles for the selected columns. The preceding result shows that 25% of the observations of DemAge are less than 44, and half of the observations of DemAge are less than 54. Finally, 75% of the observations of DemAge are less than 64. These three percentiles are also known as the first quartile (Q1), the second quartile (median), and the third quartile (Q3).

The last column of the output table indicates whether the percentile calculation has converged. The CAS percentile action does not sort the data to find the percentiles. Instead, it has a numeric approximation algorithm to estimate the percentile, and the last column indicates whether the algorithm has converged. Two parameters maxiters and epsilon are used to control the accuracy of the approximation.

You can also specify the percentage values directly using the values option. For example, you can get a list of percentiles for percentage from 5% to 90% with a step size of 5% as follows.

In [30]: result = organics.percentile(inputs='DemAge', values=list(range(5,95,5)))

...: result

Out[30]:

[Percentile]

Percentiles for ORGANICS

Variable Pctl Value Converged

0 DemAge 5.0 32.0 1.0

1 DemAge 10.0 36.0 1.0

2 DemAge 15.0 39.0 1.0

3 DemAge 20.0 41.0 1.0

4 DemAge 25.0 44.0 1.0

5 DemAge 30.0 46.0 1.0

6 DemAge 35.0 48.0 1.0

7 DemAge 40.0 50.0 1.0

8 DemAge 45.0 52.0 1.0

9 DemAge 50.0 54.0 1.0

10 DemAge 55.0 56.0 1.0

11 DemAge 60.0 58.0 1.0

12 DemAge 65.0 60.0 1.0

13 DemAge 70.0 62.0 1.0

14 DemAge 75.0 64.0 1.0

15 DemAge 80.0 66.0 1.0

16 DemAge 85.0 69.0 1.0

17 DemAge 90.0 72.0 1.0

The percentile action supports group-by processing as well:

In [31]: organics.groupby = ['DemGender']

...: result = organics.percentile(inputs='DemAge', values=list(range(5,95,5)))

In this case, we can use the Bokeh package to visually compare the distributions of DemAge across three different levels of the DemGender column.

In [32]: df = result.concat_bygroups()['Percentile']

...: df.reset_index(level=0, inplace=True)

In [33]: p = Scatter(df, x='Pctl', y='Value', legend='top_center', marker='DemGender')

...: output_file('scatter.html')

...: show(p)

Correlations

The simple action set also provides a correlation action that generates a matrix of Pearson product-moment correlation coefficients for a set of variables.

In [34]: organics = conn.CASTable('ORGANICS')

...: organics.correlation()

Out[34]:

[CorrSimple]

Summary Statistics in Correlation Analysis for ORGANICS

Variable N Mean Sum StdDev

0 DemAffl 1606488.0 8.711893 1.399555e+07 3.421045

1 DemAge 1574340.0 53.797152 8.469501e+07 13.205734

2 PromTime 1667592.0 6.564670 1.094719e+07 4.657008

3 purchase_3mon 1688948.0 950.027539 1.604547e+09 50.067179

4 purchase_6mon 1688948.0 2049.979250 3.462308e+09 70.731010

5 purchase_9mon 1688948.0 3070.016785 5.185099e+09 86.588115

6 purchase_12mon 1688948.0 4189.994798 7.076683e+09 100.009042

Minimum Maximum

0 0.00 34.00

1 18.00 79.00

2 0.00 39.00

3 698.44 1188.06

4 1668.77 2370.87

5 2624.09 3468.72

6 3698.92 4684.88

[Correlation]

Pearson Correlation Coefficients for ORGANICS

Variable DemAffl DemAge PromTime purchase_3mon

0 DemAffl 1.000000 -0.137767 -0.029736 0.001571

1 DemAge -0.137767 1.000000 0.207221 -0.001160

2 PromTime -0.029736 0.207221 1.000000 -0.000304

3 purchase_3mon 0.001571 -0.001160 -0.000304 1.000000

4 purchase_6mon 0.001686 -0.001008 -0.000564 0.707469

5 purchase_9mon 0.001459 -0.000541 -0.001720 0.578367

6 purchase_12mon 0.000965 -0.000587 -0.002255 0.500961

purchase_6mon purchase_9mon purchase_12mon

0 0.001686 0.001459 0.000965

1 -0.001008 -0.000541 -0.000587

2 -0.000564 -0.001720 -0.002255

3 0.707469 0.578367 0.500961

4 1.000000 0.816600 0.707387

5 0.816600 1.000000 0.866058

6 0.707387 0.866058 1.000000

By default, the correlation action uses all of the numeric columns and also computes the univariate statistics for each column as a by-product of computing the Pearson’s correlation coefficients. It also provides a simple flag to disable the output of the univariate statistics. To select a set of columns instead of all numeric columns, you must set the inputs option of the correlation action.

In [35]: varlist = ['DemAffl', 'DemAge', 'purchase_3mon']

...: organics.correlation(inputs=varlist, simple=False)

Out[35]:

[Correlation]

Pearson Correlation Coefficients for ORGANICS

Variable DemAffl DemAge purchase_3mon

0 DemAffl 1.000000 -0.137767 0.001571

1 DemAge -0.137767 1.000000 -0.001160

2 purchase_3mon 0.001571 -0.001160 1.000000

Summarizing Categorical Variables

In the previous section, we examined some useful actions in the simple, dataPreprocess, and percentile action sets of CAS for summarizing and exploring continuous variables. The simple action set also provides several actions for categorical data analysis, such as these actions: distinct, freq, topk, and crosstab.

Distinct Counts

Before you work on any categorical data, it is useful to first check the cardinality of the data using the distinct action.

In [1]: organics = conn.CASTable('ORGANICS')

In [2]: organics.distinct()

Out[2]:

[Distinct]

Distinct Counts for ORGANICS

Column NDistinct NMiss Trunc

0 ID 1688948.0 0.0 0.0

1 DemAffl 34.0 82460.0 0.0

2 DemAge 63.0 114608.0 0.0

3 DemGender 3.0 0.0 0.0

4 DemHomeowner 2.0 0.0 0.0

5 DemAgeGroup 4.0 0.0 0.0

6 DemCluster 56.0 0.0 0.0

7 DemReg 6.0 0.0 0.0

8 DemTVReg 14.0 0.0 0.0

9 DemFlag1 2.0 0.0 0.0

10 DemFlag2 2.0 0.0 0.0

11 DemFlag3 2.0 0.0 0.0

12 DemFlag4 2.0 0.0 0.0

13 DemFlag5 2.0 0.0 0.0

14 DemFlag6 2.0 0.0 0.0

15 DemFlag7 2.0 0.0 0.0

16 DemFlag8 2.0 0.0 0.0

17 PromClass 4.0 0.0 0.0

18 PromTime 40.0 21356.0 0.0

19 TargetBuy 2.0 0.0 0.0

20 Bought_Beverages 2.0 0.0 0.0

21 Bought_Bakery 2.0 0.0 0.0

22 Bought_Canned 2.0 0.0 0.0

23 Bought_Dairy 2.0 0.0 0.0

24 Bought_Baking 2.0 0.0 0.0

25 Bought_Frozen 2.0 0.0 0.0

26 Bought_Meat 2.0 0.0 0.0

27 Bought_Fruits 2.0 0.0 0.0

28 Bought_Vegetables 2.0 0.0 0.0

29 Bought_Cleaners 2.0 0.0 0.0

30 Bought_PaperGoods 2.0 0.0 0.0

31 Bought_Others 2.0 0.0 0.0

32 purchase_3mon 32944.0 0.0 0.0

33 purchase_6mon 44997.0 0.0 0.0

34 purchase_9mon 54032.0 0.0 0.0

35 purchase_12mon 61444.0 0.0 0.0

By default, the distinct action calculates two types of information for each column in the table: 1) the number of distinct values, and 2) the number of missing values for both character and numeric columns. Computing the exact distinct counts for the numeric columns and the unique identifiers (such as the ID column in the Organics table) could be slow and could require large memory footprints, but often the exact distinct counts for these columns are not useful because we need to know only whether some variables have distinct levels that exceed a certain threshold. The distinct action provides the option maxnvals to skip the exact counting of the distinct levels for high cardinality variables.

In [3]: organics.distinct(maxnvals=500)

Out[3]:

[Distinct]

Distinct Counts for ORGANICS

Column NDistinct NMiss Trunc

0 ID 500.0 0.0 1.0

1 DemAffl 34.0 82460.0 0.0

2 DemAge 63.0 114608.0 0.0

3 DemGender 3.0 0.0 0.0

4 DemHomeowner 2.0 0.0 0.0

5 DemAgeGroup 4.0 0.0 0.0

6 DemCluster 56.0 0.0 0.0

7 DemReg 6.0 0.0 0.0

8 DemTVReg 14.0 0.0 0.0

9 DemFlag1 2.0 0.0 0.0

10 DemFlag2 2.0 0.0 0.0

11 DemFlag3 2.0 0.0 0.0

12 DemFlag4 2.0 0.0 0.0

13 DemFlag5 2.0 0.0 0.0

14 DemFlag6 2.0 0.0 0.0

15 DemFlag7 2.0 0.0 0.0

16 DemFlag8 2.0 0.0 0.0

17 PromClass 4.0 0.0 0.0

18 PromTime 40.0 21356.0 0.0

19 TargetBuy 2.0 0.0 0.0

20 Bought_Beverages 2.0 0.0 0.0

21 Bought_Bakery 2.0 0.0 0.0

22 Bought_Canned 2.0 0.0 0.0

23 Bought_Dairy 2.0 0.0 0.0

24 Bought_Baking 2.0 0.0 0.0

25 Bought_Frozen 2.0 0.0 0.0

26 Bought_Meat 2.0 0.0 0.0

27 Bought_Fruits 2.0 0.0 0.0

28 Bought_Vegetables 2.0 0.0 0.0

29 Bought_Cleaners 2.0 0.0 0.0

30 Bought_PaperGoods 2.0 0.0 0.0

31 Bought_Others 2.0 0.0 0.0

32 purchase_3mon 500.0 0.0 1.0

33 purchase_6mon 500.0 0.0 1.0

34 purchase_9mon 500.0 0.0 1.0

35 purchase_12mon 500.0 0.0 1.0

In this example, the distinct action loops through the data and has early stops for the distinct count calculations for these columns—ID, purchase_3mon, purchase_6mon, purchase_9mon, and purchase_12mon—after their current distinct counts reach 500. Therefore, in the output table, the number of distinct values for these columns are shown as 500. A new column Trunc is added as a binary indicator for whether the distinct count values are exact.

Note that in this data, the categorical variable with the highest cardinality is DemCluster, which has 56 distinct levels. Missing values is another useful item of information that you can access in the distinct action output. In this example, three columns—DemAffl, DemAge and PromTime—have a significant number of missing values.

In the preceding examples, we requested that the output be downloaded to the Python client. You can also store the output of the distinct action in a table on the CAS server. You can then use the head method to download results from the output table.

In [4]: result = conn.CASTable('distinctOutput', replace=True)

...: organics.distinct(maxnvals=500, casout=result)

...: result.head()

Out[4]:

Selected Rows from Table CAS.DISTINCTOUTPUT

_Column_ _NDis_ _NMiss_ _Truncated_

0 ID 0.0 0.0 1.0

1 DemAffl 34.0 82460.0 0.0

2 DemAge 63.0 114608.0 0.0

3 DemGender 3.0 0.0 0.0

4 DemHomeowner 2.0 0.0 0.0

The columninfo action generates basic information about each column, such as whether it is numeric (double, int32, or int64) or character (char or varchar). The distinct action further computes the cardinality of the columns. It is common to combine these two pieces of information to determine whether a column should be used as a categorical variable or a continuous variable. The following example merges the output tables from the columninfo and distinct actions using the Pandas merge function. Then it selects character columns with no more than 128 levels and numeric columns with no more than 16 levels as categorical variables. Also, it selects numeric columns with more than 16 levels as continuous variables. All other columns are dropped into the Others group.

In [5]: out1 = organics.columninfo()['ColumnInfo']

...: out2 = organics.distinct(maxnvals=1000)['Distinct']

...: out3 = pd.merge(out1, out2, left_on='Column', right_on='Column',

how='left')

...:

...: varlist = { 'cats': [], 'conts': [], 'others' : [] }

...:

...: for index, row in out3.iterrows():

...: varname = row['Column'].lower()

...: vartype = row['Type'].lower()

...: if vartype == 'char' and row['NDistinct'] <= 128:

...: varlist['cats'].append(varname)

...: elif vartype == 'double' and row['NDistinct'] <= 16:

...: varlist['cats'].append(varname)

...: elif vartype == 'double' and row['NDistinct'] > 16:

...: varlist['conts'].append(varname)

...: else:

...: varlist['others'].append(varname)

...:

...: print(varlist)

{'conts': ['demaffl', 'demage', 'promtime', 'purchase_3mon', 'purchase_6mon', 'purchase_9mon', 'purchase_12mon'],

'cats': ['demgender', 'demhomeowner', 'demagegroup', 'demcluster', 'demreg', 'demtvreg', 'demflag1', 'demflag2', 'demflag3', 'demflag4', 'demflag5', 'demflag6', 'demflag7', 'demflag8', 'promclass', 'targetbuy', 'bought_beverages', 'bought_bakery', 'bought_canned', 'bought_dairy', 'bought_baking', 'bought_frozen', 'bought_meat', 'bought_fruits', 'bought_vegetables', 'bought_cleaners', 'bought_papergoods', 'bought_others'],

'others': ['id']}

Frequency

The distinct action computes only the number of distinct levels of a categorical variable. To get the frequency distribution of a categorical variable, you must use the freq action in the simple action set.

In [6]: varlist=['TargetBuy']

...: organics.freq(inputs=varlist)

Out[6]:

[Frequency]

Frequency for ORGANICS

Column CharVar FmtVar Level Frequency

0 TargetBuy Bought Bought 1 418380.0

1 TargetBuy No No 2 1270568.0

Note that this method is equivalent to using column indexing first and then calling the freq action:

In [7]: organics[['TargetBuy']].freq()

The output table of the freq action contains one row for each level of the categorical variable, including:

Column Name	Column Label	Description
Column	Analysis Variable	Column name of the variable in the input data set
CharVar	Character Value	Unformatted value of the variable
FmtVar	Formatted Value	Formatted value of the variable (SAS formats)
Level	Level	Index of the distinct variable value
Frequency	Frequency	Frequency count

Next, let’s visualize the result from the freq action. Similar to the data explorations in the previous section, a Bokeh bar chart is used.

In [8]: from bokeh.charts import Bar, output_file, show

...:

...: df = organics['TargetBuy'].freq()

...:

...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

...: color='#1f77b4', agg='mean', title='', legend=None,

...: xlabel='TargetBuy',

...: ylabel='Frequency'

...: )

...: output_file('bar.html')

...: show(p)

You can request multiple frequency tables in a single freq call by adding more variables to the variable list of the CASTable object.

In [9]: organics[['TargetBuy','DemAgeGroup','DemHomeowner']].freq()

Out[9]:

[Frequency]

Frequency for ORGANICS

Column CharVar FmtVar Level Frequency

0 TargetBuy Bought Bought 1 418380.0

1 TargetBuy No No 2 1270568.0

2 DemAgeGroup middle middle 1 991800.0

3 DemAgeGroup senior senior 2 532456.0

4 DemAgeGroup unknown unknown 3 114608.0

5 DemAgeGroup young young 4 50084.0

6 DemHomeowner No No 1 1097860.0

7 DemHomeowner Yes Yes 2 591088.0

It is valid to use the freq action on numeric variables as well. The next example uses the freq action to generate a frequency table of the DemAge variable. In this example, the freq action is equivalent to a histogram action with integer binning and a step size set to 1.

In [10]: df = organics['DemAge'].freq(includemissing=False)

...:

...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

...: color='#1f77b4', agg='mean', title='', legend=None,

...: xlabel='Age',

...: ylabel='Frequency',

...: bar_width=1,

...: plot_width=1200, plot_height=600

...: )

...: output_file('bar.html')

...: show(p)

You can see that the age of the customers in this data set has a mode of 51, and it is skewed to the left. The minimum age is 18 and the maximum age is 79. From the distinct action, we also learn that this variable has a lot of missing values. So it is also helpful to enable the includemissing flag to display a missing bar on the bar chart.

In [11]: df = organics[['DemAge']].freq(includemissing=True)

...:

...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

...: color='#1f77b4', agg='mean', title='',

...: xlabel='Age',

...: ylabel='Frequency',

...: bar_width=1,

...: plot_width=1200, plot_height=600

...: )

...: output_file('bar.html')

...: show(p)

Top K

The freq action returns a complete one-way frequency table. For high cardinality categorical variables, computation of the entire frequency tables might be time-consuming. In this case, the simple action set also provides a topk action to generate partial frequency tables.

In [12]: organics['purchase_12mon'].topk(topk=5, bottomk=0)

Out[12]:

[Topk]

Top and Bottom Distinct Values for ORGANICS

Column FmtVar Rank

0 purchase_12mon 4684.88 1

1 purchase_12mon 4653.49 2

2 purchase_12mon 4646.95 3

3 purchase_12mon 4641.13 4

4 purchase_12mon 4631.91 5

[TopkMisc]

Miscellaneous Information on Distinct Values for ORGANICS

Column N TruncatedTopk TruncatedBtmk ScoreOther

0 purchase_12mon 61444 0 0 NaN

Unlike the freq action, the topk action produces two output tables instead of one output table. The first table contains the top or bottom values of the input variables, and the second table contains a brief report for each input variable. You can loop through the CASResults object to obtain the name of each table and its columns.

In [13]: result = organics['purchase_12mon'].topK(topk=5, bottomk=0)

...: for df in result:

...: print(df + ' table has: ' +

' '.join(result[df].columns.values))

Topk table has: Column, NumVar, Rank

TopkMisc table has: Column, N, TruncatedTopk, TruncatedBtmk, ScoreOther

Similar to the freq action, the topk action can be used to work on more than one variable and to compute both the bottom and top values of the variables. Simply specify the preferred variables when you subset the CASTable object, and specify the topk and bottomk parameters.

In [14]: organics[['purchase_12mon','DemAge']].topK(topk=5, bottomk=5)

Out[14]:

[Topk]

Top and Bottom Distinct Values for ORGANICS

Column FmtVar Rank

0 purchase_12mon 4684.88 1

1 purchase_12mon 4653.49 2

2 purchase_12mon 4646.95 3

3 purchase_12mon 4641.13 4

4 purchase_12mon 4631.91 5

5 purchase_12mon 3826 61444

6 purchase_12mon 3833 61443

7 purchase_12mon 3849 61442

8 purchase_12mon 3850 61441

9 purchase_12mon 3851 61440

10 DemAge 79 1

11 DemAge 78 2

12 DemAge 77 3

13 DemAge 76 4

14 DemAge 75 5

15 DemAge . 63

16 DemAge 18 62

17 DemAge 19 61

18 DemAge 20 60

19 DemAge 21 59

[TopkMisc]

Miscellaneous Information on Distinct Values for ORGANICS

Column N TruncatedTopk TruncatedBtmk ScoreOther

0 purchase_12mon 61444 0 0 NaN

1 DemAge 63 0 0 NaN

The smallest reported value of DemAge is NaN. By default, the includemissing flag for the topk action is enabled, and missing values are always treated as the lowest machine value by topk. This is the reason NaN is reported as the bottom value of DemAge in the preceding example. To disable reporting NaN as the bottom value, set includemissing to False.

The topk action is not only used to report the smallest and largest values of a variable, but also to report the levels of a categorical variable with the largest or smallest values with respect to another score variable. In the following example, the topk action is used to return the top and bottom rich TV regions with respect to customers’ affluence score:

In [15]: organics['DemTVReg'].topk(topk=3, bottomk=3,

weight='DemAffl', agg='mean')

Out[15]:

[Topk]

Top and Bottom Distinct Values for ORGANICS

Column FmtVar Rank Score

0 DemTVReg Unknown 1 8.968397

1 DemTVReg East 2 8.921506

2 DemTVReg Yorkshire 3 8.752555

3 DemTVReg Ulster 14 8.493976

4 DemTVReg N East 13 8.512684

5 DemTVReg N West 12 8.533601

[TopkMisc]

Miscellaneous Information on Distinct Values for ORGANICS

Column N TruncatedTopk TruncatedBtmk ScoreOther

0 DemTVReg 14 0 0 8.716037

The first result table shows that Unknown, East, and Yorkshire are the top three TV Regions with the highest average affluence scores 8.97, 8.92, and 8.75. TV Regions Ulster, N East, and N West are the bottom three TV Regions with the lowest average affluence scores 8.49, 8.51, and 8.53. The second table also indicates that there are 14 TV Regions in this data set, and the average affluence score for other TV regions (ScoreOther column) is 8.72.

You can also apply aggregators other than sample average in the topk action. For example, let’s get the TV regions with the highest and lowest total sales amounts.

In [16]: organics['DemTVReg'].topK(topk=3, bottomk=3,

weight='purchase_3mon', agg='sum')

Out[16]:

[Topk]

Top and Bottom Distinct Values for ORGANICS

Column FmtVar Rank Score

0 DemTVReg London 1 4.468443e+08

1 DemTVReg Midlands 2 2.255111e+08

2 DemTVReg S & S East 3 1.765412e+08

3 DemTVReg Border 14 1.465313e+07

4 DemTVReg Ulster 13 1.919992e+07

5 DemTVReg N Scot 12 2.374323e+07

[TopkMisc]

Miscellaneous Information on Distinct Values for ORGANICS

Column N TruncatedTopk TruncatedBtmk ScoreOther

0 DemTVReg 14 0 0 6.980542e+08

The aggregators that are supported by the topk action are max, mean, min, n, and sum.

Cross Tabulations

Crosstab is another useful action from the simple action set for categorical data exploration. This action requires a row variable and a col variable to generate a two-way frequency table.

In [17]: result = organics.crosstab(row='DemAgeGroup', col='DemGender')

...: result

Out[17]:

[Crosstab]

DemAgeGroup Col1 Col2 Col3

0 middle 550240.0 258172.0 183388.0

1 senior 278312.0 143412.0 110732.0

2 unknown 61864.0 30172.0 22572.0

3 young 32908.0 10184.0 6992.0

The crosstab output table has a DemAgeGroup column that indicates the levels of the row variable (DemAgeGroup) in the two-way frequency table. However, the levels of the column variable (DemGender) are not displayed in the output table because the crosstab action stores the levels of the row variable in the labels instead of the names of the output columns:

In [18]: result['Crosstab'].colinfo

...:

Out[18]:

{'Col1': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='F', name='Col1', size=(1, 1), width=8),

'Col2': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='M', name='Col2', size=(1, 1), width=8),

'Col3': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='U', name='Col3', size=(1, 1), width=8),

'DemAgeGroup': SASColumnSpec(attrs=dict(Template='Row'), dtype='varchar', name='DemAgeGroup', size=(1, 1), width=7)}

In Chapter 3, we explained that the output tables of CAS actions are extensions of the DataFrames that are defined by the Pandas package. Such extensions allow the output table to contain extra metadata that is contained in typical SAS data sets. Examples of metadata include the column label, the column format, and the data type. Such information is stored in the colinfo attribute of the output DataFrame.

In the preceding crosstab example, the levels of the row variable are stored in the output column label instead of in the column names mainly because, in some situation, the levels might contain special characters that are not allowed in column names. You can retrieve the levels from the colinfo dictionary, and you can use the rename function to display the complete two-way frequency table:

In [19]: df = result['Crosstab']

...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

...: df = df.rename(columns = labels)

...: df

Out[19]:

DemAgeGroup F M U

0 middle 550240.0 258172.0 183388.0

1 senior 278312.0 143412.0 110732.0

2 unknown 61864.0 30172.0 22572.0

3 young 32908.0 10184.0 6992.0

Very similar to what the SAS PROC FREQ provides, you can request chi-square tests of homogeneity and measures of associations.

In [20]: organics.crosstab(row='DemAgeGroup', col='DemGender',

association=True, chisq=True)

Out[20]:

[Crosstab]

DemAgeGroup Col1 Col2 Col3

0 middle 550240.0 258172.0 183388.0

1 senior 278312.0 143412.0 110732.0

2 unknown 61864.0 30172.0 22572.0

3 young 32908.0 10184.0 6992.0

[Association]

Statistic Value ASE LowerCL

0 Gamma 0.022513 0.001227 0.020107

1 Kendall's Tau-B 0.012913 0.000706 0.011529

2 Stuart's Tau-c 0.011092 0.000606 0.009904

3 NaN NaN NaN

4 Somers' D C|R 0.013438 0.000735 0.011998

5 Somers' D R|C 0.012408 0.000678 0.011079

6 NaN NaN NaN

7 Lambda Asymmetric C|R 0.000000 0.000000 0.000000

8 Lambda Asymmetric R|C 0.000000 0.000000 0.000000

9 Lambda Symmetric 0.000000 0.000000 0.000000

10 NaN NaN NaN

11 Uncertainty Coefficient C|R 0.001286 0.000039 0.001210

12 Uncertainty Coefficient R|C 0.001332 0.000040 0.001253

13 Uncertainty Coefficient Symmetric 0.001309 0.000039 0.001231

UpperCL

0 0.024919

1 0.014296

2 0.012280

3 NaN

4 0.014877

5 0.013737

6 NaN

7 0.000000

8 0.000000

9 0.000000

10 NaN

11 0.001362

12 0.001411

13 0.001386

[ChiSq]

Statistic DF Value Prob

0 Chi-Square 6.0 4284.485198 0.0

1 Likelihood Ratio Chi-Square 6.0 4334.499685 0.0

A two-way frequency table is usually displayed as a grouped or stacked bar chart. We can use the Bokeh package to visualize the output from the crosstab action.

In [21]: from bokeh.charts import Bar, output_file, show

...: from bokeh.charts.operations import blend

...: from bokeh.charts.attributes import cat, color

...:

...: result1 = organics.crosstab(row='DemGender', col='DemAgeGroup')

...: # rename output columns

...: df1 = result1['Crosstab']

...: df1.columns = ['DemGender','middle','senior','unknown','young']

...:

...: bar = Bar(df1,

values=blend('middle','senior','unknown','young',

name='counts', labels_name='AgeGroup'),

...: label='DemGender',

...: group='AgeGroup',

...: xlabel='Gender',

...: ylabel='Frequency',

...: legend='top_right')

...:

...: output_file('bar.html')

...: show(bar)

Note that we use the blend function from Bokeh to transform a Pandas DataFrame from a wide format to a long format. Likewise, you can use the Pandas melt function to reshape the data before you call the Bokeh package. You can also replace the group option with the stack option to generate a stacked bar chart instead of a grouped bar chart.

The crosstab action can do more than frequency counts. Similar to the topk action, the crosstab action can be used to specify an aggregation variable for each cell in the two-way table and to specify the type of aggregation. The following example shows how you can compute the total recent three-month sales amounts for the same two-way frequency table that is shown in the preceding example:

In [22]: result = organics.crosstab(row='DemAgeGroup', col='DemGender',

weight='purchase_3mon', aggregators='sum')

...: df = result['Crosstab']

...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

...: df = df.rename(columns = labels)

...: df

Out[22]:

DemAgeGroup F M U

0 middle 5.227447e+08 2.452811e+08 1.742326e+08

1 senior 2.643612e+08 1.362322e+08 1.052163e+08

2 unknown 5.878180e+07 2.866586e+07 2.143959e+07

3 young 3.127620e+07 9.675101e+06 6.640414e+06

The results show that the female customers within the middle age group make the largest contribution in terms of purchase amount. Another difference between the crosstab action and the SAS FREQ procedure is that the crosstab action supports binning of numeric variables.

In [23]: result = organics.crosstab(row='DemAgeGroup',

col='purchase_3mon', colnbins=4, chisq=True)

...: df = result['Crosstab']

...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

...: df = df.rename(columns = labels)

...: df

Out[23]:

DemAgeGroup ( 500, 700] ( 700, 900] ( 900, 1100]

0 middle 1.0 157358.0 833045.0

1 senior 0.0 84901.0 446820.0

2 unknown 0.0 18225.0 96230.0

3 young 0.0 7835.0 42188.0

(1100, 1300]

0 1396.0

1 735.0

2 153.0

3 61.0

In [24]: result['ChiSq']

Out[24]:

Statistic DF Value Prob

0 Chi-Square 9.0 6.223867 0.717321

1 Likelihood Ratio Chi-Square 9.0 6.641251 0.674414

In this case, the row variable DemAgeGroup is a categorical variable, and the column variable purchase_3mon is numeric. The crosstab action uses equal-space binning to discretize the column variable purchase_3mon into four buckets before it computes the two-way frequency table. The p-values of the Chi-square test show that these two variables are not correlated.

Similar to the SAS FREQ procedure, the crosstab action also supports frequency tables with more than two dimensions. You must put the other dimensions into the group-by list. The next example returns a three-way frequency table as a collection of five different two-way frequency tables. For each table, you can also compute the Chi-square likelihood of homogeneity.

In [25]: organics.groupby='DemReg'

...: result2 = organics.crosstab(row='DemAgeGroup',

col='DemGender',chisq=True)

...: for table_name in result2:

...: df = result2[table_name]

...: if 'ChiSq' in table_name:

...: print(df[df['Statistic']=='Chi-Square'])

...:

Statistic DF Value Prob

DemReg

Midlands Chi-Square 6.0 2256.130597 0.0

Statistic DF Value Prob

DemReg

North Chi-Square 6.0 1096.161298 1.412025e-233

Statistic DF Value Prob

DemReg

Scottish Chi-Square 6.0 146.563561 4.120410e-29

Statistic DF Value Prob

DemReg

South East Chi-Square 6.0 2959.658738 0.0

Statistic DF Value Prob

DemReg

South West Chi-Square 6.0 1137.124002 1.935146e-242

Statistic DF Value Prob

DemReg

Unknown Chi-Square 6.0 913.565675 4.385563e-194

Variable Transformation and Dimension Reduction

Once you get some ideas about a data set using the data exploration actions that were introduced in the previous two sections, it is often useful to transform some data columns because of “missing-ness” or non-linearity. The datapreprocess action set provides a set of actions for variable imputation, variable binning, and dimension reduction.

In [1]: conn.loadactionset('datapreprocess')

...: conn.help(actionset='dataPreprocess')

Out[1]:

[dataPreprocess]

Name Description

0rustats Computes robust univariate statistics, centralized

moments, and quantiles

1impute Performs data matrix (variable) imputation

2outlier Performs outlier detection and treatment

3binning Categorizes continuous variables into bins

4discretize Performs supervised and unsupervised variable

discretization

5catTrans Groups and encodes categorical variables using

unsupervised and supervised grouping techniques

Name Description

6histogram Generates histogram bins and simple bin-based statistics

for numeric variables

7transform Performs pipelined variable imputation, outlier detection

and treatment, functional transformation, binning, and

robust univariate statistics to evaluate the quality of

the transformation

Variable Binning

We have shown the histogram action, which performs equal-spacing binning on one or multiple columns, and returns lower bounds, upper bounds, and frequencies for the bins for visualization purposes. The histogram action does not actually bin the data and store the binned results. This is accomplished by the binning action from the datapreprocess action set.

In [2]: out_data = conn.CASTable('binnedData', replace=True)

...:

...: result = organics.binning(

...: inputs='purchase_3mon',

...: tech='bucket',

...: casout=out_data,

...: nBinsArray=10

...: )

...: out_data.head(10)

Out[2]:

Selected Rows from Table BINNEDDATA

BIN_purchase_3mon

0 5.0

1 7.0

2 6.0

3 5.0

4 5.0

5 4.0

6 5.0

7 5.0

8 7.0

9 6.0

In this example, we perform a bucket binning (which is also known as histogram binning or equal-space binning) on the purchase_3mon column with number of bins equal to 10. You also must specify an output CAS table that contains the binned result. In most of the CAS actions, the option to specify a CAS table is casout. In the preceding example, the result table contains only one column that is the bin assignment (BIN_purchase_3mon).

The result object is a CASResults object that is downloaded to the Python client, and binnedData is actually a CAS table that is stored on the CAS server. If the CAS server is remote, the CAS table is not downloaded to the Python client. You can always print out some observations from the casout table using the head method or the fetch action.

Unlike the histogram action, the binning action produces more than one result table, including VarTransInfo for information about the variable transformations, BinDetails for the actual binning information, and OutputCasTables for the casout table. The last table contains information about the CAS table such as the table name, the library name, the number of observations, and the number of columns.

In [3]: result

Out[3]:

[VarTransInfo]

Variable Discretization Information for ORGANICS

Variable ResultVar NBins

0 purchase_3mon BIN_purchase_3mon 10.0

[BinDetails]

Bin Details for ORGANICS

Variable BinId BinLowerBnd BinUpperBnd BinWidth NInBin

0 purchase_3mon 1 698.440 747.402 48.962 45

1 purchase_3mon 2 747.402 796.364 48.962 1802

2 purchase_3mon 3 796.364 845.326 48.962 28782

3 purchase_3mon 4 845.326 894.288 48.962 193828

4 purchase_3mon 5 894.288 943.250 48.962 529051

5 purchase_3mon 6 943.250 992.212 48.962 597872

6 purchase_3mon 7 992.212 1041.174 48.962 279514

7 purchase_3mon 8 1041.174 1090.136 48.962 53719

8 purchase_3mon 9 1090.136 1139.098 48.962 4214

9 purchase_3mon 10 1139.098 1188.060 48.962 121

Mean Std Min Max

0 735.920444 11.629722 698.44 747.30

1 783.902153 10.767204 747.72 796.36

2 830.094716 12.081154 796.37 845.32

3 875.761523 13.155664 845.33 894.28

4 921.156725 13.782269 894.29 943.24

5 966.341785 13.869978 943.25 992.21

6 1011.694702 13.360849 992.22 1041.17

7 1057.383763 12.429201 1041.18 1090.12

8 1103.745873 11.365814 1090.14 1139.07

9 1151.365785 10.677088 1139.32 1188.06

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) cas.binnedData 1688948 1

casTable

0 CASTable('cas.binnedData', caslib='C...

You can specify multiple variables in the binning action, and you can use a different number of bins for each column as well.

In [4]: result = organics.binning(

...: inputs=['purchase_3mon', 'purchase_6mon',

'purchase_9mon', 'purchase_12mon'],

...: tech='bucket',

...: casout=out_data,

...: nbinsarray=[4, 10, 20, 6]

...: );

...: out_data.head(10)

Out[4]:

Selected Rows from Table CAS.BINNEDDATA

BIN_purchase_12mon BIN_purchase_3mon BIN_purchase_6mon

0 3.0 2.0 6.0

1 5.0 3.0 7.0

2 3.0 3.0 6.0

3 3.0 2.0 6.0

4 3.0 2.0 4.0

5 3.0 2.0 5.0

6 3.0 2.0 5.0

7 3.0 2.0 5.0

8 4.0 3.0 7.0

9 4.0 3.0 6.0

BIN_purchase_9mon

0 9.0

1 14.0

2 11.0

3 10.0

4 10.0

5 9.0

6 8.0

7 10.0

8 12.0

9 12.0

The binning action supports three types of binning techniques as follows:

Technique	Description
Bucket	Equal-width binning. Each bin has the same width.
Cutpts	Creates bins according to the user-specified cutpoints.
Quantile	Equal-frequency binning. Each bin has approximately the same number of observations.

The following example shows how to perform a quantile binning of four purchase amount columns. Also, it uses the copyallvars flag to pass all columns from the input table to the casout table.

In [5]: out_data2 = s.CASTable('binnedData2', replace=True)

...: result = organics.binning(

...: inputs = ['purchase_3mon', 'purchase_6mon',

'purchase_9mon', 'purchase_12mon'],

...: tech = 'Quantile',

...: casout = out_data2,

...: copyallvars = True,

...: nbinsarray = [4, 4, 4, 4]

...: )

...:

...: out_data2.columns

Out[5]:

Index(['ID', 'DemAffl', 'DemAge', 'DemGender', 'DemHomeowner',

'DemAgeGroup', 'DemCluster', 'DemReg', 'DemTVReg', 'DemFlag1',

'DemFlag2', 'DemFlag3', 'DemFlag4', 'DemFlag5', 'DemFlag6',

'DemFlag7', 'DemFlag8', 'PromClass', 'PromTime', 'TargetBuy',

'Bought_Beverages', 'Bought_Bakery', 'Bought_Canned',

'Bought_Dairy', 'Bought_Baking', 'Bought_Frozen',

'Bought_Meat', 'Bought_Fruits', 'Bought_Vegetables',

'Bought_Cleaners', 'Bought_PaperGoods', 'Bought_Others',

'purchase_3mon', 'purchase_6mon', 'purchase_9mon',

'purchase_12mon', 'BIN_purchase_12mon', 'BIN_purchase_3mon',

'BIN_purchase_6mon', 'BIN_purchase_9mon'],

dtype='object')

You can get some idea about the binned data using the data exploration actions that were introduced previously. For example, you can generate a cross tabulation of two binned purchase amount columns.

In [6]: out_data2.crosstab(row='bin_purchase_3mon',

col='bin_purchase_12mon')

Out[6]:

[Crosstab]

BIN_purchase_3mon Col1 Col2 Col3 Col4

0 1 203175.0 117851.0 70718.0 30483.0

1 2 117668.0 124401.0 109169.0 70929.0

2 3 71142.0 108972.0 124566.0 117614.0

3 4 30231.0 70984.0 117812.0 203233.0

The binning action also provides a few options for cutpoints and observations assignment. The binning options are listed in the following table:

Binning Options	Description	Default Value
binmapping='left' \| 'right'	Controls how to map values that fall at the boundary between consecutive bins. left enables you to express the bins with [), [), ..., [] notation. right enables [], (], ..., (] notation.	left
binmissing=True \| False	When set to True, bins missing values are stored in a separate bin. The ID for this bin is 0.	False
cutpts=list	Specifies the user-provided cutpoints for the cutpts binning technique.	Null
copyallvars=True \| False	When set to True, all the variables from the input table are copied to the scored output table.	False
copyvars=list	Specifies the names of the variables in the input table to use for identifying scored observations in the output table. The specified variables are copied to the output table.	Null
includeinputvars =True \| False	When set to True, the analysis variables from the input table that are specified in the vars parameter are copied to the output table.	False
pctldef=integer	Specifies the percentile definition to use. The definitions are numbered from 1 to 5. The default value, 0, specifies not to use a percentile definition.	0
pctlmaxniters=integer	Specifies the maximum number of iterations for percentile computation.
pctltol=double	Specifies the tolerance for percentile computation.	1e-05

Another difference between the binning action and the histogram action is that the binning action can generate SAS DATA step code that represents the binning model. This code can be downloaded and saved for later use in scoring new observations. You can also apply the code to score a CAS table directly. Such DATA step code is often called score code when it is used to represent a data transformation or an analytical model.

The next example requests the score code for a bucket binning of the purchase_3mon variable, and then saves the score code into a local result table. Also, you can convert the score code that is stored in result['CodeGen'] into a text file using Pandas.

In [7]: result = organics.binning(

...: inputs='purchase_3mon',

...: tech='bucket',

...: casout=out_data,

...: code={'comment':True, 'tabform':True},

...: nBinsArray=10

...: )

...:

...: # score code is saved in this table

...: # df = result['CodeGen']

Variable Imputation

The impute action from the datapreprocess action set manages missing values. You can use the distinct action to check which columns have missing values.

In [8]: df = organics.distinct()['Distinct']

...: df[df['NMiss'] > 0]

Out[8]:

Distinct Counts for CAS.ORGANICS

Analysis Number of Distinct Number of Missing Truncated or

Variable Values Values Not

1 DemAffl 34 82460 0

2 DemAge 63 114608 0

18 PromTime 40 21356 0

Let’s run the impute action for the promotion time variable.

In [9]: organics[['PromTime']].impute()

Out[9]:

[ImputeInfo]

Imputation Information for ORGANICS

Variable ImputeTech ResultVar N NMiss

0 PromTime Mean IMP_PromTime 1667592.0 21356.0

ImputedValueContinuous

0 6.56467

The result table of the impute action contains information about the treatment of missing values for each specified variable. PromTime is a numeric variable and, by default, the impute action selects sample mean as the value to fill the observations with missing PromTime.

The preceding example only suggests the imputation method but won’t actually change the data. You must specify a casout table for the impute action in order to create a new data set with imputed values.

In [10]: out_data = s.CASTable('cas.imputedData1', replace=True)

...:

...: impute1 = organics[['PromTime']].Impute()

...: impute1.methodcontinuous = 'Median'

...: impute1.copyallvars = True

...: impute1.casout = out_data

...: impute1()

Out[10]:

[ImputeInfo]

Imputation Information for ORGANICS

Variable ImputeTech ResultVar N NMiss

0 PromTime Median IMP_PromTime 1667592.0 21356.0

ImputedValueContinuous

0 5.0

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) imputedData1 1688948 37

casTable

0 CASTable('imputedData1', caslib='CAS...

In this example, we create a CASAction object impute1, add a casout table to the impute action, and request that all columns to be copied from the input table to the casout table. Also, the imputation technique is changed from mean to median. The impute action also enables you to specify the value to impute.

In [11]: impute1.methodcontinuous = 'Value'

...: impute1.valuescontinuous = [0]

...: impute1()

Out[11]:

[ImputeInfo]

Imputation Information for ORGANICS

Variable ImputeTech ResultVar N NMiss

0 PromTime Value IMP_PromTime 1667592.0 21356.0

ImputedValueContinuous

0 0.0

[OutputCasTables]

casLib Name Rows Columns

0 CASUSERHDFS(username) imputedData1 1688948 37

casTable

0 CASTable('imputedData1', caslib='CAS...

The methodcontinuous option and the methodnominal option define the imputation treatment for continuous and categorical variables. The default imputations are mean and mode.

The following table lists techniques for continuous variables:

Technique	Description
MAX	Replaces missing values with the maximum value.
MEAN	Replaces missing values with the mean.
MEDIAN	Replaces missing values with the median.
MIDRANGE	Replaces missing values with the mean of the maximum value and the minimum value.
MIN	Replaces missing values with the minimum value.
RANDOM	Replaces missing values with uniform random numbers.
VALUE	Replaces missing values with the values that are specified in the valuescontinuous parameters.

The following table lists techniques for categorical variables:

Technique	Description
MODE	Replaces missing values with the mode.
VALUE	Replaces missing values with the values that are specified in the valuesnominal parameters.

Similar to other actions in the datapreprocess action set, the impute action also generates score code that you can download so that you can repeat the imputation for a new data source.

In [12]: impute1.techforcont = 'Median'

...: impute1.code = {'comment':True}

...: impute1()['CodeGen']

Out[12]:

Score code for Impute action
SASCode

0 /*-----------------------------------------------------------------------------

1 SAS Code Generated by Cloud Analytic Server for Impute Action

2 Date : 07Mar2016:11:12:45

3 Number of variables : 1

4 -----------------------------------------------------------------------------*/

6 _igby_ = 0;

7 _tnn_ntrans_ = 1;

9 _fuzcmp_ = 0.000000000100;

11 array _tnn_vnames_{1} imp_PromTime ;

13 array _vnn_names_{1} PromTime ;

15 array _tnn_ntransvars_{1} _temporary_ (1 );

17 array _tv_nn_indices_{1} _temporary_ (1 );

19 imp_PromTime = .;

21 array _tnn_imputetype_{1} _temporary_ (2 );

24 array _tnn_imputeuniquevals_{1} _temporary_ (5 );

27 /*---------Iterate and score----------------*/

29 /*---------Count variables----------------*/

30 _ct_ = 0;

31 _impct_ = 0;

33 do _i_ = 1 to _tnn_ntrans_;

34 do _j_ = 1 to _tnn_ntransvars_{_i_};

36 if (_tnn_imputetype_{_i_} ~= 0) then

37 _impct_ + 1;

38 _ct_ + 1;

39 _numval_ = _vnn_names_{_tv_nn_indices_{_ct_}};

41 /*-------Apply Imputation phase--------------*/

42 if missing(_numval_) then

43 do;

44 if _tnn_imputetype_{_i_} = 0 then

45 goto _impute_done1_;

46 else _numval_ = _tnn_imputeuniquevals_{1 *_igby_ + _impct_};

47 end;

48 _impute_done1_:;

49 _tnn_vnames_{_ct_} = _numval_;

50 end;

51 end;

drop _igby_ _tnn_ntrans_ _fuzcmp_ _ct_ _impct_ _i_ _j_ _numval_ ;

Conclusion

In this chapter, we introduced actions in the simple and dataPreprocess action sets for data exploration, data summarization, and basic variable transformation. These actions enable you to learn about your data using different techniques to summarize continuous and categorical variables. You can also apply variable transformation such as binning and imputation to reshape your data. Now that you understand your data, let’s move on to Chapter 8 and Chapter 9 for building analytic models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Data Exploration and Summary Statistics

Create new playlist

Sign In

Sign Up