Chapter 7: Data Exploration and Summary Statistics

Overview

Summarizing Continuous Variables

Descriptive Statistics

Histograms

Percentiles

Correlations

Summarizing Categorical Variables

Distinct Counts

Frequency

Top K

Cross Tabulations

Variable Transformation and Dimension Reduction

Variable Binning

Variable Imputation

Conclusion

Overview

The description of the columns in a table using tabular or visual outputs is typically the first step in a data analysis or a statistical modeling process. In this chapter, you learn how to use CAS to explore and summarize data. Topics include summarizing continuous variables and categorical variables, data transformation, dimensional reduction, and related visualizations using the Python Bokeh package.

Let’s start by uploading the Organics Purchase data set from your local directory to CAS. In this example, conn is a CAS object that is connected to a session on a CAS server. For more information about managing your data in CAS, refer to Chapters 4 - 6.

In [1]: organics = conn.upload('yourSamplePath/organics.sashdat')

We also import the Pandas and Bokeh packages:

In [2]: import pandas as pd

   ...: from bokeh.charts import Bar, Scatter, output_file, show, Area

Summarizing Continuous Variables

The simple action set provides several useful actions for univariate and multivariate data exploration and summarization. Similar to some action sets that you have learned about in previous chapters, such as builtins and table, the simple action set is also preloaded when you start a new CAS server. Therefore, you can use it on the organics variable, which is a CASTable reference to the data set organics.sashdat that you just uploaded.

Descriptive Statistics

Let’s get started with a basic summary action:

In [3]: organics.summary()

Out[3]:

[Summary]

 

 Descriptive Statistics for ORGANICS

 

            Column      Min      Max          N     NMiss  

 0         DemAffl     0.00    34.00  1606488.0   82460.0   

 1          DemAge    18.00    79.00  1574340.0  114608.0   

 2        PromTime     0.00    39.00  1667592.0   21356.0   

 3   purchase_3mon   698.44  1188.06  1688948.0       0.0   

 4   purchase_6mon  1668.77  2370.87  1688948.0       0.0   

 5   purchase_9mon  2624.09  3468.72  1688948.0       0.0   

 6  purchase_12mon  3698.92  4684.88  1688948.0       0.0   

 

           Mean           Sum         Std    StdErr           Var  

 0     8.711893  1.399555e+07    3.421045  0.002699     11.703547   

 1    53.797152  8.469501e+07   13.205734  0.010525    174.391404   

 2     6.564670  1.094719e+07    4.657008  0.003606     21.687723   

 3   950.027539  1.604547e+09   50.067179  0.038525   2506.722438   

 4  2049.979250  3.462308e+09   70.731010  0.054425   5002.875812   

 5  3070.016785  5.185099e+09   86.588115  0.066627   7497.501651   

 6  4189.994798  7.076683e+09  100.009042  0.076954  10001.808545   

 

             USS           CSS         CV        TValue  ProbT  

 0  1.407294e+08  1.880160e+07  39.268672   3227.695308    0.0  

 1  4.830901e+09  2.745512e+08  24.547273   5111.472302    0.0  

 2  1.080310e+08  3.616625e+07  70.940468   1820.333053    0.0  

 3  1.528598e+12  4.233721e+09   5.270077  24659.894243    0.0  

 4  7.106110e+12  8.449592e+09   3.450328  37665.847454    0.0  

 5  1.593100e+13  1.266288e+10   2.820444  46077.679978    0.0  

 6  2.966816e+13  1.689252e+10   2.386854  54448.053001    0.0  

In this example, you execute the summary action to generate descriptive statistics for all the numeric columns in the organics data set. The default statistics that are generated by the summary action are described in the following table:

Statistic Description
Min Minimum value
Max Maximum value
N Number of observations with nonmissing values
NMiss Number of observations with missing values
Sum Sum
Std Standard Deviation
StdErr Standard Error
Var Variable
USS Uncorrected Sum of Squares
CSS Corrected Sum of Squares
CV Coefficient of Variation
TValue Value of T-statistics for hypothesis testing
ProbT p-value of the T-statistics

Let’s work on a specific set of columns. You can set the inputs parameter of the summary action to obtain the descriptive statistics for specified columns:

In [4]: varlist = ['DemAge', 'Purchase_12mon', 'Purchase_6mon']

   ...: organics.summary(inputs=varlist)

Out[4]:

[Summary]

 

 Descriptive Statistics for ORGANICS

 

            Column      Min      Max          N     NMiss  

 0          DemAge    18.00    79.00  1574340.0  114608.0   

 1  purchase_12mon  3698.92  4684.88  1688948.0       0.0   

 2   purchase_6mon  1668.77  2370.87  1688948.0       0.0   

 

           Mean           Sum         Std    StdErr           Var  

 0    53.797152  8.469501e+07   13.205734  0.010525    174.391404   

 1  4189.994798  7.076683e+09  100.009042  0.076954  10001.808545   

 2  2049.979250  3.462308e+09   70.731010  0.054425   5002.875812   

 

             USS           CSS         CV        TValue  ProbT  

 0  4.830901e+09  2.745512e+08  24.547273   5111.472302    0.0  

 1  2.966816e+13  1.689252e+10   2.386854  54448.053001    0.0  

 2  7.106110e+12  8.449592e+09   3.450328  37665.847454    0.0  

This code generates summary statistics only for columns that you have specified: DemAge, Purchase_12mon, and Purchase_6mon. In the next example, we compute the summary statistics for the columns for recent purchase amounts and save the result in a local object.

In [5]: varlist = ['Purchase_3mon', 'Purchase_6mon', 'Purchase_9mon',

                   'Purchase_12mon']

   ...: result = organics.summary(inputs=varlist)

The result is a CASResults object that contains the log, the performance data, the server information, and the output of the submitted CAS action. For example, you can print out the run time of the preceding summary action:

In [6]: print(result.performance.cpu_system_time)

        print(result.performance.cpu_user_time)

Out[6]:

0.766884

0.367944

A CAS action might return one or more tables referenced by the keys of the CASResult object. In this example, only one summary table has been returned by the summary action.

In [7]: list(result.keys())

Out[7]: ['Summary']

Result tables are returned in the form of SASDataFrame objects.  Because the SASDataFrame is a subclass of the Pandas DataFrame, it works seamlessly with Pandas and many Python packages that support DataFrames. For example, you can use the stack method to rearrange the output summary table.

In [8]: df = result['Summary']

   ...: df.columns

 

Out[8]: Index(['Column', 'Min', 'Max', 'N', 'NMiss', 'Mean', 'Sum', 'Std', 'StdErr', 'Var', 'USS', 'CSS', 'CV', 'TValue', 'ProbT'], dtype='object')

 

 

In [9]: df.index = df['Column']

   ...: stackedDf = df[['Min','Mean','Max']].stack()

   ...: print(stackedDf)

 

Out[9]:

Column              

purchase_3mon   Min      698.440000

                Mean     950.006638

                Max     1188.060000

purchase_6mon   Min     1668.770000

                Mean    2049.989809

                Max     2370.870000

purchase_9mon   Min     2624.090000

                Mean    3070.037322

                Max     3468.720000

purchase_12mon  Min     3704.700000

                Mean    4190.004978

                Max     4684.880000

You can also use the Bokeh package to generate a bar chart using the summary statistics that are generated by the summary action.

In [10]: p = Bar(df, 'Column', values='Mean',

    ...:         color='#1f77b4', agg='mean', title='',

    ...:         xlabel = '', ylabel ='Frequency')

    ...: output_file('visual1.html')   

 

    ...: show(p)

image

Sometimes it is useful to summarize your data by segments, which is accomplished with the groupby parameter.

In [11]: organics.groupby='DemGender'

    ...: result = organics.summary(inputs='DemAge')

The groupby parameter groups the data into three segments for DemGender : F, M, or U. Also, it  summarizes the column DemAge for each data segment, respectively. The summary statistics are returned in three different DataFrames.

In [12]: list(result.keys())

Out[12]: ['ByGroupInfo', 'ByGroup1.Summary', 'ByGroup2.Summary',

          'ByGroup3.Summary']

In this case, the summary action also returns a table called ByGroupInfo that provides basic information about the by groups.

In [13]: result['ByGroupInfo']

Out[13]:

ByGroupInfo

 

  DemGender DemGender_f _key_

0         F           F     F

1         M           M     M

2         U           U     U

To merge these tables into one result table, use the CASResults concat_bygroups method. It returns a new CASResults object with all by-group tables that are concatenated.

In [14]: result2 = result.concat_bygroups()

    ...: result2['Summary'][['N','Min','Mean','Max','Std']]

 

Out[14]:

Descriptive Statistics for ORGANICS

 

                  N   Min       Mean   Max        Std

DemGender                                            

F          861460.0  18.0  52.880723  79.0  13.546649

M          411768.0  18.0  54.535807  79.0  12.817555

U          301112.0  18.0  55.408884  79.0  12.504695

When you specify a group-by variable, the summary action does not require any data shuffle to summarize data by segments. In fact, you can have as many group-by variables as you need. For example, the next action call summarizes DemAge by two columns: DemGender and DemHomeowner.

In [15]: varlist='DemAge'

    ...: organics.groupby=['DemGender','DemHomeowner']

    ...: result = organics.summary(inputs=varlist).concat_bygroups()

    ...: result['Summary'][['N','Min','Mean','Max','Std']]

 

Out[15]:

Descriptive Statistics for ORGANICS

 

                               N   Min       Mean   Max        Std

DemGender DemHomeowner                                            

F         No            560471.0  18.0  52.883482  79.0  13.543064

          Yes           300989.0  18.0  52.875587  79.0  13.553343

M         No            267238.0  18.0  54.525262  79.0  12.814615

          Yes           144530.0  18.0  54.555303  79.0  12.823012

U         No            195894.0  18.0  55.414745  79.0  12.504366

          Yes           105218.0  18.0  55.397974  79.0  12.505360

In this example, we concatenate the by-group tables and automatically assign the levels of the group-by variables as the indices of the underlying Pandas DataFrame. The index is shown below:

In [16]: result['Summary'].index

Out[16]:

MultiIndex(levels=[['F', 'M', 'U'], ['No', 'Yes']],

           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],

           names=['DemGender', 'DemHomeowner'])

To remove the grouping variables from the CASTable object, you can use either the Python del statement or the CASTable del_params method. After the grouping variables are removed, the summary action computes summary statistics for the entire data set.

In [17]: del organics.groupby

You can also use

In [18]: organics.del_param('groupby')

Histograms

Histograms are another useful visualization tool for summarizing continuous data. A histogram is an estimate of the probability distribution of a continuous variable by equal-distance binning. The CAS dataPreprocess action set provides a histogram action. In the next example, we execute the histogram action for the DemAge column:

In [19]: result = organics.histogram(

    ...:     reqpacks=[{'nicebinning':False, 'nbins':10}],

    ...:     inputs=['Purchase_3mon']

    ...: )

 

In [20]: result['BinDetails']

Out[20]:

Bin Details for ORGANICS

 

        Variable  BinSetId  BinId  BinLowerBnd  BinUpperBnd  

0  purchase_3mon         1      1      698.440      747.402   

1  purchase_3mon         1      2      747.402      796.364   

2  purchase_3mon         1      3      796.364      845.326   

3  purchase_3mon         1      4      845.326      894.288   

4  purchase_3mon         1      5      894.288      943.250   

5  purchase_3mon         1      6      943.250      992.212   

6  purchase_3mon         1      7      992.212     1041.174   

7  purchase_3mon         1      8     1041.174     1090.136   

8  purchase_3mon         1      9     1090.136     1139.098   

9  purchase_3mon         1     10     1139.098     1188.060   

 

   BinWidth  NInBin         Mean        Std      Min      Max  

0    48.962      45   735.920444  11.629722   698.44   747.30   

1    48.962    1802   783.902153  10.767204   747.72   796.36   

2    48.962   28782   830.094716  12.081154   796.37   845.32   

3    48.962  193828   875.761523  13.155664   845.33   894.28   

4    48.962  529051   921.156725  13.782269   894.29   943.24   

5    48.962  597872   966.341785  13.869978   943.25   992.21   

6    48.962  279514  1011.694702  13.360849   992.22  1041.17   

7    48.962   53719  1057.383763  12.429201  1041.18  1090.12   

8    48.962    4214  1103.745873  11.365814  1090.14  1139.07   

9    48.962     121  1151.365785  10.677088  1139.32  1188.06   

 

   MidPoint    Percent  

0   722.921   0.002664  

1   771.883   0.106694  

2   820.845   1.704138  

3   869.807  11.476256  

4   918.769  31.324292  

5   967.731  35.399077  

6  1016.693  16.549592  

7  1065.655   3.180619  

8  1114.617   0.249504  

9  1163.579   0.007164  

Each row of the BinDetails output contains the binning information for a specific bin, including the lower and upper bounds, the bin width, and summary statistics per bin such as counts, averages, standard deviations, minimums, maximums and midpoints. You can also use the Bokeh package to create a histogram.

In [21]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

    ...:         color='#1f77b4', agg='mean', title='', legend=None,

    ...:         xlabel = 'Purchase_3mon',

    ...:         ylabel = 'Percent'

    ...: )

    ...: output_file('bar.html')

    ...: show(p)

image

In the preceding example, we set nicebinning to False, which is an option to request coarser upper bounds and lower bounds. Such a calculation usually returns nicer cutpoints for visualization and interpretation. A setting of True for nicebinning generates the following Bokeh visualization:

image

Compared to the previous graph, the nicebinning option allows bounds only at a step size of 100, which is determined automatically by the range of the variable Purchase_3mon. Sometimes you need to increase the number of bins when you use the nicebinning option to get a better understanding of the data. In this example, let us change the number of bins from 10 to 25.

In [22]: result= organics.histogram(

    ...:     reqpacks=[{'nicebinning':True, 'nbins':25}],

    ...:     inputs=['Purchase_3mon']

...: )

 

In [23]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

    ...:         color='#1f77b4', agg='mean', title='', legend=None,

    ...:         xlabel = 'Purchase_3mon',

    ...:         ylabel = 'Percent'

...: )

...: output_file('bar.html')

...: show(p)

image

In this example, 25 is chosen as the step size. The histogram action also enables you to set the bins manually.  For example, you can force a fixed bin width of 50. In this case, you don’t need to specify a number of bins because it is determined automatically.

In [24]: result= organics.histogram(

    ...:     reqpacks=[{'binwidth':50}],

    ...:     inputs=['Purchase_3mon']

    ...: )

 

In [25]: p = Bar(result['BinDetails'], 'MidPoint', values='Percent',

    ...:         color='#1f77b4', agg='mean', title='', legend=None,

    ...:         xlabel='Purchase_3mon',

    ...:         ylabel='Percent'

    ...: )

    ...: output_file('bar.html')

    ...: show(p)

image

Similar to the summary action, the histogram action also supports group-by processing. You can request a histogram for each data segment that is defined by a set of group-by columns.

In [26]: organics.groupby = ['DemGender', 'DemAgeGroup']

    ...: result = organics.histogram(

    ...:     reqpacks=[{'nicebinning':False, 'nbins':20}],

    ...:     inputs=['DemAffl']

    ...: )

    ...: list(result.keys())

Out[26]:

['ByGroupInfo',

 'ByGroup1.BinDetails',

 'ByGroup2.BinDetails',

 'ByGroup3.BinDetails',

 'ByGroup4.BinDetails',

 'ByGroup5.BinDetails',

 'ByGroup6.BinDetails',

 'ByGroup7.BinDetails',

 'ByGroup8.BinDetails',

 'ByGroup9.BinDetails',

 'ByGroup10.BinDetails',

 'ByGroup11.BinDetails',

 'ByGroup12.BinDetails']

Except for the ByGroupInfo table, each result table contains the histogram statistics for one by group. The ByGroupInfo table provides an overview of the by-group levels:

In [27]: result['ByGroupInfo']

Out[27]:

ByGroupInfo

 

   DemGender DemGender_f DemAgeGroup DemAgeGroup_f     _key_

0          F           F      middle        middle   Fmiddle

1          F           F      senior        senior   Fsenior

2          F           F     unknown       unknown  Funknown

3          F           F       young         young    Fyoung

4          M           M      middle        middle   Mmiddle

5          M           M      senior        senior   Msenior

6          M           M     unknown       unknown  Munknown

7          M           M       young         young    Myoung

8          U           U      middle        middle   Umiddle

9          U           U      senior        senior   Usenior

10         U           U     unknown       unknown  Uunknown

11         U           U       young         young    Uyoung

Next, let’s collect the statistics from the output table and use Bokeh to generate a grid of histograms. The all_df dictionary contains the title of each histogram as well as its data. You must generate visualizations, one by one, using the Bokeh bar chart and then define the layout of the grid using the gridplot function from Bokeh.

In [28]: all_df = {

    ...:'Gender=Female, AgeGroup=Middle':  result['ByGroup1.BinDetails'],

    ...:'Gender=Female, AgeGroup=Senior':  result['ByGroup2.BinDetails'],

    ...:'Gender=Female, AgeGroup=Unknown': result['ByGroup3.BinDetails'],

    ...:'Gender=Female, AgeGroup=Young' :  result['ByGroup4.BinDetails']

    ...:}

    ...: all_pic = []

    ...:

    ...: for this_title in all_df:

    ...:    this_pic = Bar(all_df[this_title], 'MidPoint', values='Percent',

    ...:        color='#1f77b4', agg='mean', title=this_title, legend=None,

    ...:        xlabel='DemAffl',

    ...:        ylabel='Percent'

    ...:    )

    ...:    all_pic.append(this_pic)

    ...:

    ...: from bokeh.io import gridplot

    ...:

    ...: p = gridplot([[all_pic[0], all_pic[1]],

    ...:               [all_pic[2], all_pic[3]]])

    ...: output_file('grid.html')

    ...: show(p)

 

 

image

Percentiles

Percentile is another useful tool for numeric data exploration. A percentile is a statistic for indicating a variable that has a given percentage of observations below the percentile value. In CAS, such statistics are available in the percentile action set.

In [29]: conn.loadactionset('percentile')

    ...:

    ...: del organics.groupby

    ...: organics.percentile(inputs='DemAge')

NOTE: Added action set 'percentile'.

Out[29]:

[Percentile]

Percentiles for ORGANICS

 

   Variable  Pctl  Value  Converged

 0   DemAge  25.0   44.0        1.0

 1   DemAge  50.0   54.0        1.0

 2   DemAge  75.0   64.0        1.0

By default, the percentile action computes the 25th, the 50th, and the 75th percentiles for the selected columns.  The preceding result shows that 25% of the observations of DemAge are less than 44, and half of the observations of DemAge are less than 54.  Finally, 75% of the observations of DemAge are less than 64. These three percentiles are also known as the first quartile (Q1), the second quartile (median), and the third quartile (Q3).

The last column of the output table indicates whether the percentile calculation has converged. The CAS percentile action does not sort the data to find the percentiles. Instead, it has a numeric approximation algorithm to estimate the percentile, and the last column indicates whether the algorithm has converged. Two parameters maxiters and epsilon are used to control the accuracy of the approximation.

You can also specify the percentage values directly using the values option. For example, you can get a list of percentiles for percentage from 5% to 90% with a step size of 5% as follows.

In [30]: result = organics.percentile(inputs='DemAge', values=list(range(5,95,5)))

    ...: result

Out[30]:

[Percentile]

 

Percentiles for ORGANICS

 

    Variable  Pctl  Value  Converged

 0    DemAge   5.0   32.0        1.0

 1    DemAge  10.0   36.0        1.0

 2    DemAge  15.0   39.0        1.0

 3    DemAge  20.0   41.0        1.0

 4    DemAge  25.0   44.0        1.0

 5    DemAge  30.0   46.0        1.0

 6    DemAge  35.0   48.0        1.0

 7    DemAge  40.0   50.0        1.0

 8    DemAge  45.0   52.0        1.0

 9    DemAge  50.0   54.0        1.0

 10   DemAge  55.0   56.0        1.0

 11   DemAge  60.0   58.0        1.0

 12   DemAge  65.0   60.0        1.0

 13   DemAge  70.0   62.0        1.0

 14   DemAge  75.0   64.0        1.0

 15   DemAge  80.0   66.0        1.0

 16   DemAge  85.0   69.0        1.0

 17   DemAge  90.0   72.0        1.0

The percentile action supports group-by processing as well:

In [31]: organics.groupby = ['DemGender']

    ...: result = organics.percentile(inputs='DemAge', values=list(range(5,95,5)))

In this case, we can use the Bokeh package to visually compare the distributions of DemAge across three different levels of the DemGender column.

In [32]: df = result.concat_bygroups()['Percentile']

    ...: df.reset_index(level=0, inplace=True)

 

In [33]: p = Scatter(df, x='Pctl', y='Value', legend='top_center', marker='DemGender')

    ...: output_file('scatter.html')

    ...: show(p)

image

Correlations

The simple action set also provides a correlation action that generates a matrix of Pearson product-moment correlation coefficients for a set of variables.

In [34]: organics = conn.CASTable('ORGANICS')

    ...: organics.correlation()

Out[34]:

[CorrSimple]

 

 Summary Statistics in Correlation Analysis for ORGANICS

 

          Variable          N         Mean           Sum      StdDev  

 0         DemAffl  1606488.0     8.711893  1.399555e+07    3.421045   

 1          DemAge  1574340.0    53.797152  8.469501e+07   13.205734   

 2        PromTime  1667592.0     6.564670  1.094719e+07    4.657008   

 3   purchase_3mon  1688948.0   950.027539  1.604547e+09   50.067179   

 4   purchase_6mon  1688948.0  2049.979250  3.462308e+09   70.731010   

 5   purchase_9mon  1688948.0  3070.016785  5.185099e+09   86.588115   

 6  purchase_12mon  1688948.0  4189.994798  7.076683e+09  100.009042   

 

    Minimum  Maximum  

 0     0.00    34.00  

 1    18.00    79.00  

 2     0.00    39.00  

 3   698.44  1188.06  

 4  1668.77  2370.87  

 5  2624.09  3468.72  

 6  3698.92  4684.88  

 

[Correlation]

 

 Pearson Correlation Coefficients for ORGANICS

 

          Variable   DemAffl    DemAge  PromTime  purchase_3mon  

 0         DemAffl  1.000000 -0.137767 -0.029736       0.001571   

 1          DemAge -0.137767  1.000000  0.207221      -0.001160   

 2        PromTime -0.029736  0.207221  1.000000      -0.000304   

 3   purchase_3mon  0.001571 -0.001160 -0.000304       1.000000   

 4   purchase_6mon  0.001686 -0.001008 -0.000564       0.707469   

 5   purchase_9mon  0.001459 -0.000541 -0.001720       0.578367   

 6  purchase_12mon  0.000965 -0.000587 -0.002255       0.500961   

 

    purchase_6mon  purchase_9mon  purchase_12mon  

 0       0.001686       0.001459        0.000965  

 1      -0.001008      -0.000541       -0.000587  

 2      -0.000564      -0.001720       -0.002255  

 3       0.707469       0.578367        0.500961  

 4       1.000000       0.816600        0.707387  

 5       0.816600       1.000000        0.866058  

 6       0.707387       0.866058        1.000000  

By default, the correlation action uses all of the numeric columns and also computes the univariate statistics for each column as a by-product of computing the Pearson’s correlation coefficients. It also provides a simple flag to disable the output of the univariate statistics. To select a set of columns instead of all numeric columns, you must set the inputs option of the correlation action.

In [35]: varlist = ['DemAffl', 'DemAge', 'purchase_3mon']

    ...: organics.correlation(inputs=varlist, simple=False)

Out[35]:

[Correlation]

 

 Pearson Correlation Coefficients for ORGANICS

 

         Variable   DemAffl    DemAge  purchase_3mon

 0        DemAffl  1.000000 -0.137767       0.001571

 1         DemAge -0.137767  1.000000      -0.001160

 2  purchase_3mon  0.001571 -0.001160       1.000000

Summarizing Categorical Variables

In the previous section, we examined some useful actions in the simple, dataPreprocess, and percentile action sets of CAS for summarizing and exploring continuous variables. The simple action set also provides several actions for categorical data analysis, such as these actions: distinct, freq, topk, and crosstab.

Distinct Counts

Before you work on any categorical data, it is useful to first check the cardinality of the data using the distinct action.

In [1]: organics = conn.CASTable('ORGANICS')

 

In [2]: organics.distinct()

Out[2]:

[Distinct]

 

 Distinct Counts for ORGANICS

 

                Column  NDistinct     NMiss  Trunc

 0                  ID  1688948.0       0.0    0.0

 1             DemAffl       34.0   82460.0    0.0

 2              DemAge       63.0  114608.0    0.0

 3           DemGender        3.0       0.0    0.0

 4        DemHomeowner        2.0       0.0    0.0

 5         DemAgeGroup        4.0       0.0    0.0

 6          DemCluster       56.0       0.0    0.0

 7              DemReg        6.0       0.0    0.0

 8            DemTVReg       14.0       0.0    0.0

 9            DemFlag1        2.0       0.0    0.0

 10           DemFlag2        2.0       0.0    0.0

 11           DemFlag3        2.0       0.0    0.0

 12           DemFlag4        2.0       0.0    0.0

 13           DemFlag5        2.0       0.0    0.0

 14           DemFlag6        2.0       0.0    0.0

 15           DemFlag7        2.0       0.0    0.0

 16           DemFlag8        2.0       0.0    0.0

 17          PromClass        4.0       0.0    0.0

 18           PromTime       40.0   21356.0    0.0

 19          TargetBuy        2.0       0.0    0.0

 20   Bought_Beverages        2.0       0.0    0.0

 21      Bought_Bakery        2.0       0.0    0.0

 22      Bought_Canned        2.0       0.0    0.0

 23       Bought_Dairy        2.0       0.0    0.0

 24      Bought_Baking        2.0       0.0    0.0

 25      Bought_Frozen        2.0       0.0    0.0

 26        Bought_Meat        2.0       0.0    0.0

 27      Bought_Fruits        2.0       0.0    0.0

 28  Bought_Vegetables        2.0       0.0    0.0

 29    Bought_Cleaners        2.0       0.0    0.0

 30  Bought_PaperGoods        2.0       0.0    0.0

 31      Bought_Others        2.0       0.0    0.0

 32      purchase_3mon    32944.0       0.0    0.0

 33      purchase_6mon    44997.0       0.0    0.0

 34      purchase_9mon    54032.0       0.0    0.0

 35     purchase_12mon    61444.0       0.0    0.0

By default, the distinct action calculates two types of information for each column in the table: 1) the number of distinct values, and 2) the number of missing values for both character and numeric columns. Computing the exact distinct counts for the numeric columns and the unique identifiers (such as the ID column in the Organics table) could be slow and could require large memory footprints, but often the exact distinct counts for these columns are not useful because we need to know only whether some variables have distinct levels that exceed a certain threshold. The distinct action provides the option maxnvals to skip the exact counting of the distinct levels for high cardinality variables.

In [3]: organics.distinct(maxnvals=500)

Out[3]:

[Distinct]

 

 Distinct Counts for ORGANICS

 

                Column  NDistinct     NMiss  Trunc

 0                  ID      500.0       0.0    1.0

 1             DemAffl       34.0   82460.0    0.0

 2              DemAge       63.0  114608.0    0.0

 3           DemGender        3.0       0.0    0.0

 4        DemHomeowner        2.0       0.0    0.0

 5         DemAgeGroup        4.0       0.0    0.0

 6          DemCluster       56.0       0.0    0.0

 7              DemReg        6.0       0.0    0.0

 8            DemTVReg       14.0       0.0    0.0

 9            DemFlag1        2.0       0.0    0.0

 10           DemFlag2        2.0       0.0    0.0

 11           DemFlag3        2.0       0.0    0.0

 12           DemFlag4        2.0       0.0    0.0

 13           DemFlag5        2.0       0.0    0.0

 14           DemFlag6        2.0       0.0    0.0

 15           DemFlag7        2.0       0.0    0.0

 16           DemFlag8        2.0       0.0    0.0

 17          PromClass        4.0       0.0    0.0

 18           PromTime       40.0   21356.0    0.0

 19          TargetBuy        2.0       0.0    0.0

 20   Bought_Beverages        2.0       0.0    0.0

 21      Bought_Bakery        2.0       0.0    0.0

 22      Bought_Canned        2.0       0.0    0.0

 23       Bought_Dairy        2.0       0.0    0.0

 24      Bought_Baking        2.0       0.0    0.0

 25      Bought_Frozen        2.0       0.0    0.0

 26        Bought_Meat        2.0       0.0    0.0

 27      Bought_Fruits        2.0       0.0    0.0

 28  Bought_Vegetables        2.0       0.0    0.0

 29    Bought_Cleaners        2.0       0.0    0.0

 30  Bought_PaperGoods        2.0       0.0    0.0

 31      Bought_Others        2.0       0.0    0.0

 32      purchase_3mon      500.0       0.0    1.0

 33      purchase_6mon      500.0       0.0    1.0

 34      purchase_9mon      500.0       0.0    1.0

 35     purchase_12mon      500.0       0.0    1.0

In this example, the distinct action loops through the data and has early stops for the distinct count calculations for these columns—ID, purchase_3mon, purchase_6mon, purchase_9mon, and purchase_12mon—after their current distinct counts reach 500. Therefore, in the output table, the number of distinct values for these columns are shown as 500. A new column Trunc is added as a binary indicator for whether the distinct count values are exact.

Note that in this data, the categorical variable with the highest cardinality is DemCluster, which has 56 distinct levels. Missing values is another useful item of information that you can access in the distinct action output. In this example, three columns—DemAffl, DemAge and PromTime—have a significant number of missing values.

In the preceding examples, we requested that the output be downloaded to the Python client. You can also store the output of the distinct action in a table on the CAS server. You can then use the head method to download results from the output table.

In [4]: result = conn.CASTable('distinctOutput', replace=True)

   ...: organics.distinct(maxnvals=500, casout=result)

   ...: result.head()

Out[4]:

Selected Rows from Table CAS.DISTINCTOUTPUT

 

       _Column_  _NDis_   _NMiss_  _Truncated_

0            ID     0.0       0.0          1.0

1       DemAffl    34.0   82460.0          0.0

2        DemAge    63.0  114608.0          0.0

3     DemGender     3.0       0.0          0.0

4  DemHomeowner     2.0       0.0          0.0

The columninfo action generates basic information about each column, such as whether it is numeric (double, int32, or int64) or character (char or varchar). The distinct action further computes the cardinality of the columns. It is common to combine these two pieces of information to determine whether a column should be used as a categorical variable or a continuous variable. The following example merges the output tables from the columninfo and distinct actions using the Pandas merge function. Then it selects character columns with no more than 128 levels and numeric columns with no more than 16 levels as categorical variables. Also, it selects numeric columns with more than 16 levels as continuous variables. All other columns are dropped into the Others group.

In [5]: out1 = organics.columninfo()['ColumnInfo']

   ...: out2 = organics.distinct(maxnvals=1000)['Distinct']

   ...: out3 = pd.merge(out1, out2, left_on='Column', right_on='Column',

                      how='left')

   ...:

   ...: varlist = { 'cats': [], 'conts': [], 'others' : [] }

   ...:

   ...: for index, row in out3.iterrows():

   ...:     varname = row['Column'].lower()

   ...:     vartype = row['Type'].lower()

   ...:     if vartype == 'char' and row['NDistinct'] <= 128:

   ...:         varlist['cats'].append(varname)

   ...:     elif vartype == 'double' and row['NDistinct'] <= 16:

   ...:         varlist['cats'].append(varname)

   ...:     elif vartype == 'double' and row['NDistinct'] > 16:

   ...:         varlist['conts'].append(varname)

   ...:     else:

   ...:         varlist['others'].append(varname)

   ...:

   ...: print(varlist)

{'conts': ['demaffl', 'demage', 'promtime', 'purchase_3mon', 'purchase_6mon', 'purchase_9mon', 'purchase_12mon'],

 'cats': ['demgender', 'demhomeowner', 'demagegroup', 'demcluster', 'demreg', 'demtvreg', 'demflag1', 'demflag2', 'demflag3', 'demflag4', 'demflag5', 'demflag6', 'demflag7', 'demflag8', 'promclass', 'targetbuy', 'bought_beverages', 'bought_bakery', 'bought_canned', 'bought_dairy', 'bought_baking', 'bought_frozen', 'bought_meat', 'bought_fruits', 'bought_vegetables', 'bought_cleaners', 'bought_papergoods', 'bought_others'],

 'others': ['id']}

Frequency

The distinct action computes only the number of distinct levels of a categorical variable. To get the frequency distribution of a categorical variable, you must use the freq action in the simple action set.

In [6]: varlist=['TargetBuy']

   ...: organics.freq(inputs=varlist)

Out[6]:

[Frequency]

 

 Frequency for ORGANICS

 

       Column CharVar  FmtVar  Level  Frequency

 0  TargetBuy  Bought  Bought      1   418380.0

 1  TargetBuy      No      No      2  1270568.0

Note that this method is equivalent to using column indexing first and then calling the freq action:

In [7]: organics[['TargetBuy']].freq()

The output table of the freq action contains one row for each level of the categorical variable, including:

Column Name Column Label Description
Column Analysis Variable Column name of the variable in the input data set
CharVar Character Value Unformatted value of the variable
FmtVar Formatted Value Formatted value of the variable (SAS formats)
Level Level Index of the distinct variable value
Frequency Frequency Frequency count

Next, let’s visualize the result from the freq action. Similar to the data explorations in the previous section, a Bokeh bar chart is used.

In [8]: from bokeh.charts import Bar, output_file, show

   ...:

   ...: df = organics['TargetBuy'].freq()

   ...:

   ...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

   ...:         color='#1f77b4', agg='mean', title='', legend=None,

   ...:         xlabel='TargetBuy',

   ...:         ylabel='Frequency'

   ...: )

   ...: output_file('bar.html')

   ...: show(p)

image

You can request multiple frequency tables in a single freq call by adding more variables to the variable list of the CASTable object.

In [9]: organics[['TargetBuy','DemAgeGroup','DemHomeowner']].freq()

Out[9]:

[Frequency]

 

 Frequency for ORGANICS

 

          Column  CharVar   FmtVar  Level  Frequency

 0     TargetBuy   Bought   Bought      1   418380.0

 1     TargetBuy       No       No      2  1270568.0

 2   DemAgeGroup   middle   middle      1   991800.0

 3   DemAgeGroup   senior   senior      2   532456.0

 4   DemAgeGroup  unknown  unknown      3   114608.0

 5   DemAgeGroup    young    young      4    50084.0

 6  DemHomeowner       No       No      1  1097860.0

 7  DemHomeowner      Yes      Yes      2   591088.0

It is valid to use the freq action on numeric variables as well. The next example uses the freq action to generate a frequency table of the DemAge variable. In this example, the freq action is equivalent to a histogram action with integer binning and a step size set to 1.

In [10]: df = organics['DemAge'].freq(includemissing=False)

    ...:

    ...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

    ...:         color='#1f77b4', agg='mean', title='', legend=None,

    ...:         xlabel='Age',

    ...:         ylabel='Frequency',

    ...:         bar_width=1,

    ...:         plot_width=1200, plot_height=600

    ...: )

    ...: output_file('bar.html')

    ...: show(p)

image

You can see that the age of the customers in this data set has a mode of 51, and it is skewed to the left. The minimum age is 18 and the maximum age is 79. From the distinct action, we also learn that this variable has a lot of missing values. So it is also helpful to enable the includemissing flag to display a missing bar on the bar chart.

In [11]: df = organics[['DemAge']].freq(includemissing=True)

    ...:

    ...: p = Bar(df['Frequency'], 'FmtVar', values='Frequency',

    ...:         color='#1f77b4', agg='mean', title='',

    ...:         xlabel='Age',

    ...:         ylabel='Frequency',

    ...:         bar_width=1,

    ...:         plot_width=1200, plot_height=600

    ...: )

    ...: output_file('bar.html')

    ...: show(p)

 

image

Top K

The freq action returns a complete one-way frequency table. For high cardinality categorical variables, computation of the entire frequency tables might be time-consuming. In this case, the simple action set also provides a topk action to generate partial frequency tables.

In [12]: organics['purchase_12mon'].topk(topk=5, bottomk=0)

Out[12]:

[Topk]

 

 Top and Bottom Distinct Values for ORGANICS

 

            Column        FmtVar  Rank

 0  purchase_12mon       4684.88     1

 1  purchase_12mon       4653.49     2

 2  purchase_12mon       4646.95     3

 3  purchase_12mon       4641.13     4

 4  purchase_12mon       4631.91     5

 

[TopkMisc]

 

 Miscellaneous Information on Distinct Values for ORGANICS

 

            Column      N  TruncatedTopk  TruncatedBtmk  ScoreOther

 0  purchase_12mon  61444              0              0         NaN

Unlike the freq action, the topk action produces two output tables instead of one output table. The first table contains the top or bottom values of the input variables, and the second table contains a brief report for each input variable. You can loop through the CASResults object to obtain the name of each table and its columns.

In [13]: result = organics['purchase_12mon'].topK(topk=5, bottomk=0)

    ...: for df in result:

    ...:     print(df + ' table has: ' +

                   ' '.join(result[df].columns.values))

 

Topk table has: Column, NumVar, Rank

TopkMisc table has: Column, N, TruncatedTopk, TruncatedBtmk, ScoreOther

Similar to the freq action, the topk action can be used to work on more than one variable and to compute both the bottom and top values of the variables. Simply specify the preferred variables when you subset the CASTable object, and specify the topk and bottomk parameters.

In [14]: organics[['purchase_12mon','DemAge']].topK(topk=5, bottomk=5)

Out[14]:

[Topk]

 

 Top and Bottom Distinct Values for ORGANICS

 

             Column        FmtVar   Rank

 0   purchase_12mon       4684.88      1

 1   purchase_12mon       4653.49      2

 2   purchase_12mon       4646.95      3

 3   purchase_12mon       4641.13      4

 4   purchase_12mon       4631.91      5

 5   purchase_12mon          3826  61444

 6   purchase_12mon          3833  61443

 7   purchase_12mon          3849  61442

 8   purchase_12mon          3850  61441

 9   purchase_12mon          3851  61440

 10          DemAge            79      1

 11          DemAge            78      2

 12          DemAge            77      3

 13          DemAge            76      4

 14          DemAge            75      5

 15          DemAge             .     63

 16          DemAge            18     62

 17          DemAge            19     61

 18          DemAge            20     60

 19          DemAge            21     59

 

[TopkMisc]

 

 Miscellaneous Information on Distinct Values for ORGANICS

 

            Column      N  TruncatedTopk  TruncatedBtmk  ScoreOther

 0  purchase_12mon  61444              0              0         NaN

 1          DemAge     63              0              0         NaN

The smallest reported value of DemAge is NaN. By default, the includemissing flag for the topk action is enabled, and missing values are always treated as the lowest machine value by topk. This is the reason NaN is reported as the bottom value of DemAge in the preceding example. To disable reporting NaN as the bottom value, set includemissing to False.

The topk action is not only used to report the smallest and largest values of a variable, but also to report the levels of a categorical variable with the largest or  smallest values with respect to another score variable. In the following example, the topk action is used to return the top and bottom rich TV regions with respect to customers’ affluence score:

In [15]: organics['DemTVReg'].topk(topk=3, bottomk=3,

                                weight='DemAffl', agg='mean')

Out[15]:

[Topk]

 

 Top and Bottom Distinct Values for ORGANICS

 

      Column     FmtVar  Rank     Score

 0  DemTVReg    Unknown     1  8.968397

 1  DemTVReg       East     2  8.921506

 2  DemTVReg  Yorkshire     3  8.752555

 3  DemTVReg     Ulster    14  8.493976

 4  DemTVReg     N East    13  8.512684

 5  DemTVReg     N West    12  8.533601

 

[TopkMisc]

 

 Miscellaneous Information on Distinct Values for ORGANICS

 

      Column   N  TruncatedTopk  TruncatedBtmk  ScoreOther

 0  DemTVReg  14              0              0    8.716037

The first result table shows that Unknown, East, and Yorkshire are the top three TV Regions with the highest average affluence scores 8.97, 8.92, and 8.75.  TV Regions Ulster, N East, and N West are the bottom three TV Regions with the lowest average affluence scores 8.49, 8.51, and 8.53. The second table also indicates that there are 14 TV Regions in this data set, and the average affluence score for other TV regions (ScoreOther column) is 8.72.

You can also apply aggregators other than sample average in the topk action. For example, let’s get the TV regions with the highest and lowest total sales amounts.

In [16]: organics['DemTVReg'].topK(topk=3, bottomk=3,

                                weight='purchase_3mon', agg='sum')

Out[16]:

[Topk]

 

 Top and Bottom Distinct Values for ORGANICS

 

      Column      FmtVar  Rank         Score

 0  DemTVReg      London     1  4.468443e+08

 1  DemTVReg    Midlands     2  2.255111e+08

 2  DemTVReg  S & S East     3  1.765412e+08

 3  DemTVReg      Border    14  1.465313e+07

 4  DemTVReg      Ulster    13  1.919992e+07

 5  DemTVReg      N Scot    12  2.374323e+07

 

[TopkMisc]

 

 Miscellaneous Information on Distinct Values for ORGANICS

 

      Column   N  TruncatedTopk  TruncatedBtmk    ScoreOther

 0  DemTVReg  14              0              0  6.980542e+08

The aggregators that are supported by the topk action are max, mean, min, n, and sum.

Cross Tabulations

Crosstab is another useful action from the simple action set for categorical data exploration. This action requires a row variable and a col variable to generate a two-way frequency table.

In [17]: result = organics.crosstab(row='DemAgeGroup', col='DemGender')

    ...: result

Out[17]:

[Crosstab]

 

   DemAgeGroup      Col1      Col2      Col3

 0      middle  550240.0  258172.0  183388.0

 1      senior  278312.0  143412.0  110732.0

 2     unknown   61864.0   30172.0   22572.0

 3       young   32908.0   10184.0    6992.0

The crosstab output table has a DemAgeGroup column that indicates the levels of the row variable (DemAgeGroup) in the two-way frequency table. However, the levels of the column variable (DemGender) are not displayed in the output table because the crosstab action stores the levels of the row variable in the labels instead of the names of the output columns:

In [18]: result['Crosstab'].colinfo

    ...:

Out[18]:

{'Col1': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='F', name='Col1', size=(1, 1), width=8),

'Col2': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='M', name='Col2', size=(1, 1), width=8),

'Col3': SASColumnSpec(attrs=dict(Template='Col'), dtype='double', label='U', name='Col3', size=(1, 1), width=8),

'DemAgeGroup': SASColumnSpec(attrs=dict(Template='Row'), dtype='varchar', name='DemAgeGroup', size=(1, 1), width=7)}

In Chapter 3, we explained that the output tables of CAS actions are extensions of the DataFrames that are defined by the Pandas package. Such extensions allow the output table to contain extra metadata that is contained in typical SAS data sets. Examples of metadata include the column label, the column format, and the data type. Such information is stored in the colinfo attribute of the output DataFrame.

In the preceding crosstab example, the levels of the row variable are stored in the output column label instead of in the column names mainly because, in some situation, the levels might contain special characters that are not allowed in column names. You can retrieve the levels from the colinfo dictionary, and you can use the rename function to display the complete two-way frequency table:

In [19]: df = result['Crosstab']

    ...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

    ...: df = df.rename(columns = labels)

    ...: df

Out[19]:

  DemAgeGroup         F         M         U

0      middle  550240.0  258172.0  183388.0

1      senior  278312.0  143412.0  110732.0

2     unknown   61864.0   30172.0   22572.0

3       young   32908.0   10184.0    6992.0

Very similar to what the SAS PROC FREQ provides, you can request chi-square tests of homogeneity and measures of associations.

In [20]: organics.crosstab(row='DemAgeGroup', col='DemGender',

                        association=True, chisq=True)

Out[20]:

[Crosstab]

 

   DemAgeGroup      Col1      Col2      Col3

 0      middle  550240.0  258172.0  183388.0

 1      senior  278312.0  143412.0  110732.0

 2     unknown   61864.0   30172.0   22572.0

 3       young   32908.0   10184.0    6992.0

 

[Association]

 

                             Statistic     Value       ASE   LowerCL  

 0                               Gamma  0.022513  0.001227  0.020107   

 1                     Kendall's Tau-B  0.012913  0.000706  0.011529   

 2                      Stuart's Tau-c  0.011092  0.000606  0.009904   

 3                                           NaN       NaN       NaN   

 4                       Somers' D C|R  0.013438  0.000735  0.011998   

 5                       Somers' D R|C  0.012408  0.000678  0.011079   

 6                                           NaN       NaN       NaN   

 7               Lambda Asymmetric C|R  0.000000  0.000000  0.000000   

 8               Lambda Asymmetric R|C  0.000000  0.000000  0.000000   

 9                    Lambda Symmetric  0.000000  0.000000  0.000000   

 10                                          NaN       NaN       NaN   

 11        Uncertainty Coefficient C|R  0.001286  0.000039  0.001210   

 12        Uncertainty Coefficient R|C  0.001332  0.000040  0.001253   

 13  Uncertainty Coefficient Symmetric  0.001309  0.000039  0.001231   

 

      UpperCL  

 0   0.024919  

 1   0.014296  

 2   0.012280  

 3        NaN  

 4   0.014877  

 5   0.013737  

 6        NaN  

 7   0.000000  

 8   0.000000  

 9   0.000000  

 10       NaN  

 11  0.001362  

 12  0.001411  

 13  0.001386  

 

[ChiSq]

 

                      Statistic   DF        Value  Prob

 0                   Chi-Square  6.0  4284.485198   0.0

 1  Likelihood Ratio Chi-Square  6.0  4334.499685   0.0

A two-way frequency table is usually displayed as a grouped or stacked bar chart. We can use the Bokeh package to visualize the output from the crosstab action.

In [21]: from bokeh.charts import Bar, output_file, show

    ...: from bokeh.charts.operations import blend

    ...: from bokeh.charts.attributes import cat, color

    ...:

    ...: result1 = organics.crosstab(row='DemGender', col='DemAgeGroup')

    ...: # rename output columns

    ...: df1 = result1['Crosstab']

    ...: df1.columns = ['DemGender','middle','senior','unknown','young']

    ...:

    ...: bar = Bar(df1,

                   values=blend('middle','senior','unknown','young',

                                name='counts', labels_name='AgeGroup'),

    ...:           label='DemGender',

    ...:           group='AgeGroup',

    ...:           xlabel='Gender',

    ...:           ylabel='Frequency',

    ...:           legend='top_right')

    ...:

    ...: output_file('bar.html')

    ...: show(bar)

image

Note that we use the blend function from Bokeh to transform a Pandas DataFrame from a wide format to a long format. Likewise, you can use the Pandas melt function to reshape the data before you call the Bokeh package. You can also replace the group option with the stack option to generate a stacked bar chart instead of a grouped bar chart.

The crosstab action can do more than frequency counts. Similar to the topk action, the crosstab action can be used to specify an aggregation variable for each cell in the two-way table and to specify the type of aggregation. The following example shows how you can compute the total recent three-month sales amounts for the same two-way frequency table that is shown in the preceding example:

In [22]: result = organics.crosstab(row='DemAgeGroup', col='DemGender',

                                    weight='purchase_3mon', aggregators='sum')

    ...: df = result['Crosstab']

    ...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

    ...: df = df.rename(columns = labels)

    ...: df

Out[22]:

  DemAgeGroup             F             M             U

0      middle  5.227447e+08  2.452811e+08  1.742326e+08

1      senior  2.643612e+08  1.362322e+08  1.052163e+08

2     unknown  5.878180e+07  2.866586e+07  2.143959e+07

3       young  3.127620e+07  9.675101e+06  6.640414e+06

The results show that the female customers within the middle age group make the largest contribution in terms of purchase amount. Another difference between the crosstab action and the SAS FREQ procedure is that the crosstab action supports binning of numeric variables.

In [23]: result = organics.crosstab(row='DemAgeGroup',

                      col='purchase_3mon', colnbins=4, chisq=True)

    ...: df = result['Crosstab']

    ...: labels = {k: v.label for k, v in df.colinfo.items() if v.label}

    ...: df = df.rename(columns = labels)

    ...: df

Out[23]:

  DemAgeGroup  ( 500,  700]  ( 700,  900]  ( 900, 1100]  

0      middle           1.0      157358.0      833045.0   

1      senior           0.0       84901.0      446820.0   

2     unknown           0.0       18225.0       96230.0   

3       young           0.0        7835.0       42188.0   

 

   (1100, 1300]  

0        1396.0  

1         735.0  

2         153.0  

3          61.0  

 

In [24]: result['ChiSq']

Out[24]:

                     Statistic   DF     Value      Prob

0                   Chi-Square  9.0  6.223867  0.717321

1  Likelihood Ratio Chi-Square  9.0  6.641251  0.674414

In this case, the row variable DemAgeGroup is a categorical variable, and the column variable purchase_3mon is numeric.  The crosstab action uses equal-space binning to discretize the column variable purchase_3mon into four buckets before it computes the two-way frequency table. The p-values of the Chi-square test show that these two variables are not correlated.

Similar to the SAS FREQ procedure, the crosstab action also supports frequency tables with more than two dimensions. You must put the other dimensions into the group-by list. The next example returns a three-way frequency table as a collection of five different two-way frequency tables. For each table, you can also compute the Chi-square likelihood of homogeneity.

In [25]: organics.groupby='DemReg'

    ...: result2 = organics.crosstab(row='DemAgeGroup',

                              col='DemGender',chisq=True)

    ...: for table_name in result2:

    ...:     df = result2[table_name]

    ...:     if 'ChiSq' in table_name:

    ...:        print(df[df['Statistic']=='Chi-Square'])

    ...:

 

           Statistic   DF        Value  Prob

DemReg                                      

Midlands  Chi-Square  6.0  2256.130597   0.0

         Statistic   DF        Value           Prob

DemReg                                             

North   Chi-Square  6.0  1096.161298  1.412025e-233

           Statistic   DF       Value          Prob

DemReg                                             

Scottish  Chi-Square  6.0  146.563561  4.120410e-29

             Statistic   DF        Value  Prob

DemReg                                        

South East  Chi-Square  6.0  2959.658738   0.0

             Statistic   DF        Value           Prob

DemReg                                                 

South West  Chi-Square  6.0  1137.124002  1.935146e-242

          Statistic   DF       Value           Prob

DemReg                                             

Unknown  Chi-Square  6.0  913.565675  4.385563e-194

Variable Transformation and Dimension Reduction

Once you get some ideas about a data set using the data exploration actions that were introduced in the previous two sections, it is often useful to transform some data columns because of “missing-ness” or non-linearity. The datapreprocess action set provides a set of actions for variable imputation, variable binning, and dimension reduction.   

 In [1]: conn.loadactionset('datapreprocess')

 ...: conn.help(actionset='dataPreprocess')

 Out[1]:

 [dataPreprocess]

 Name              Description

0rustats           Computes robust univariate statistics, centralized

                   moments, and quantiles

1impute            Performs data matrix (variable) imputation

2outlier           Performs outlier detection and treatment

3binning           Categorizes continuous variables into bins

4discretize        Performs supervised and unsupervised variable

                   discretization

5catTrans          Groups and encodes categorical variables using

                   unsupervised and supervised grouping techniques

 Name              Description

6histogram         Generates histogram bins and simple bin-based statistics

                   for numeric variables

7transform         Performs pipelined variable imputation, outlier detection

                   and treatment, functional transformation, binning, and

                   robust univariate statistics to evaluate the quality of

                   the transformation

Variable Binning

We have shown the histogram action, which performs equal-spacing binning on one or multiple columns, and returns lower bounds, upper bounds, and frequencies for the bins for visualization purposes. The histogram action does not actually bin the data and store the binned results. This is accomplished by the binning action from the datapreprocess action set.

In [2]: out_data = conn.CASTable('binnedData', replace=True)

   ...:

   ...: result = organics.binning(

   ...:     inputs='purchase_3mon',

   ...:     tech='bucket',

   ...:     casout=out_data,

   ...:     nBinsArray=10

   ...: )

   ...: out_data.head(10)

Out[2]:

Selected Rows from Table BINNEDDATA

 

   BIN_purchase_3mon

0                5.0

1                7.0

2                6.0

3                5.0

4                5.0

5                4.0

6                5.0

7                5.0

8                7.0

9                6.0

In this example, we perform a bucket binning (which is also known as histogram binning or equal-space binning) on the purchase_3mon column with number of bins equal to 10. You also must specify an output CAS table that contains the binned result. In most of the CAS actions, the option to specify a CAS table is casout. In the preceding example, the result table contains only one column that is the bin assignment (BIN_purchase_3mon).

The result object is a CASResults object that is downloaded to the Python client, and binnedData is actually a CAS table that is stored on the CAS server. If the CAS server is remote, the CAS table is not downloaded to the Python client. You can always print out some observations from the casout table using the head method or the fetch action.

Unlike the histogram action, the binning action produces more than one result table, including VarTransInfo for information about the variable transformations, BinDetails for the actual binning information, and OutputCasTables for the casout table. The last table contains information about the CAS table such as the table name, the library name, the number of observations, and the number of columns.

In [3]: result

Out[3]:

[VarTransInfo]

 

 Variable Discretization Information for ORGANICS

 

         Variable          ResultVar  NBins

 0  purchase_3mon  BIN_purchase_3mon   10.0

 

[BinDetails]

 

 Bin Details for ORGANICS

 

         Variable  BinId  BinLowerBnd  BinUpperBnd  BinWidth  NInBin  

 0  purchase_3mon      1      698.440      747.402    48.962      45   

 1  purchase_3mon      2      747.402      796.364    48.962    1802   

 2  purchase_3mon      3      796.364      845.326    48.962   28782   

 3  purchase_3mon      4      845.326      894.288    48.962  193828   

 4  purchase_3mon      5      894.288      943.250    48.962  529051   

 5  purchase_3mon      6      943.250      992.212    48.962  597872   

 6  purchase_3mon      7      992.212     1041.174    48.962  279514   

 7  purchase_3mon      8     1041.174     1090.136    48.962   53719   

 8  purchase_3mon      9     1090.136     1139.098    48.962    4214   

 9  purchase_3mon     10     1139.098     1188.060    48.962     121   

 

           Mean        Std      Min      Max  

 0   735.920444  11.629722   698.44   747.30  

 1   783.902153  10.767204   747.72   796.36  

 2   830.094716  12.081154   796.37   845.32  

 3   875.761523  13.155664   845.33   894.28  

 4   921.156725  13.782269   894.29   943.24  

 5   966.341785  13.869978   943.25   992.21  

 6  1011.694702  13.360849   992.22  1041.17  

 7  1057.383763  12.429201  1041.18  1090.12  

 8  1103.745873  11.365814  1090.14  1139.07  

 9  1151.365785  10.677088  1139.32  1188.06  

 

[OutputCasTables]

 

                 casLib            Name     Rows  Columns  

 0  CASUSERHDFS(username)  cas.binnedData  1688948        1   

 

                                   casTable  

 0  CASTable('cas.binnedData', caslib='C...  

You can specify multiple variables in the binning action, and you can use a different number of bins for each column as well.

In [4]: result = organics.binning(

   ...:     inputs=['purchase_3mon', 'purchase_6mon',

                'purchase_9mon', 'purchase_12mon'],

   ...:     tech='bucket',

   ...:     casout=out_data,

   ...:     nbinsarray=[4, 10, 20, 6]

   ...: );

   ...: out_data.head(10)

Out[4]:

Selected Rows from Table CAS.BINNEDDATA

 

   BIN_purchase_12mon  BIN_purchase_3mon  BIN_purchase_6mon  

0                 3.0                2.0                6.0   

1                 5.0                3.0                7.0   

2                 3.0                3.0                6.0   

3                 3.0                2.0                6.0   

4                 3.0                2.0                4.0   

5                 3.0                2.0                5.0   

6                 3.0                2.0                5.0   

7                 3.0                2.0                5.0   

8                 4.0                3.0                7.0   

9                 4.0                3.0                6.0   

 

   BIN_purchase_9mon  

0                9.0  

1               14.0  

2               11.0  

3               10.0  

4               10.0  

5                9.0  

6                8.0  

7               10.0  

8               12.0  

9               12.0  

The binning action supports three types of binning techniques as follows:

Technique Description
Bucket Equal-width binning. Each bin has the same width.
Cutpts Creates bins according to the user-specified cutpoints.
Quantile Equal-frequency binning. Each bin has approximately the same number of observations.

The following example shows how to perform a quantile binning of four purchase amount columns. Also, it uses the copyallvars flag to pass all columns from the input table to the casout table.

In [5]: out_data2 = s.CASTable('binnedData2', replace=True)

   ...: result = organics.binning(

   ...:     inputs = ['purchase_3mon', 'purchase_6mon',

                 'purchase_9mon', 'purchase_12mon'],

   ...:     tech = 'Quantile',

   ...:     casout = out_data2,

   ...:     copyallvars = True,

   ...:     nbinsarray = [4, 4, 4, 4]

   ...: )

   ...:

   ...: out_data2.columns

 

Out[5]:

Index(['ID', 'DemAffl', 'DemAge', 'DemGender', 'DemHomeowner',

       'DemAgeGroup', 'DemCluster', 'DemReg', 'DemTVReg', 'DemFlag1',

       'DemFlag2', 'DemFlag3', 'DemFlag4', 'DemFlag5', 'DemFlag6',

       'DemFlag7', 'DemFlag8', 'PromClass', 'PromTime', 'TargetBuy',

       'Bought_Beverages', 'Bought_Bakery', 'Bought_Canned',

       'Bought_Dairy', 'Bought_Baking', 'Bought_Frozen',

       'Bought_Meat', 'Bought_Fruits', 'Bought_Vegetables',

       'Bought_Cleaners', 'Bought_PaperGoods', 'Bought_Others',

       'purchase_3mon', 'purchase_6mon', 'purchase_9mon',

       'purchase_12mon', 'BIN_purchase_12mon', 'BIN_purchase_3mon',

       'BIN_purchase_6mon', 'BIN_purchase_9mon'],

      dtype='object')

You can get some idea about the binned data using the data exploration actions that were introduced previously. For example, you can generate a cross tabulation of two binned purchase amount columns.

In [6]: out_data2.crosstab(row='bin_purchase_3mon',

                           col='bin_purchase_12mon')

Out[6]:

[Crosstab]

 

   BIN_purchase_3mon      Col1      Col2      Col3      Col4

 0                 1  203175.0  117851.0   70718.0   30483.0

 1                 2  117668.0  124401.0  109169.0   70929.0

 2                 3   71142.0  108972.0  124566.0  117614.0

 3                 4   30231.0   70984.0  117812.0  203233.0

The binning action also provides a few options for cutpoints and observations assignment. The binning options are listed in the following table:

Binning Options Description Default Value
binmapping='left' | 'right' Controls how to map values that fall at the boundary between consecutive bins. left enables you to express the bins with [), [), ..., [] notation. right enables [], (], ..., (] notation. left
binmissing=True | False When set to True, bins missing values are stored in a separate bin. The ID for this bin is 0. False
cutpts=list Specifies the user-provided cutpoints for the cutpts binning technique. Null
copyallvars=True | False When set to True, all the variables from the input table are copied to the scored output table. False
copyvars=list Specifies the names of the variables in the input table to use for identifying scored observations in the output table. The specified variables are copied to the output table. Null
includeinputvars =True | False When set to True, the analysis variables from the input table that are specified in the vars parameter are copied to the output table. False
pctldef=integer Specifies the percentile definition to use. The definitions are numbered from 1 to 5. The default value, 0, specifies not to use a percentile definition. 0
pctlmaxniters=integer Specifies the maximum number of iterations for percentile computation.  
pctltol=double Specifies the tolerance for percentile computation. 1e-05

Another difference between the binning action and the histogram action is that the binning action can generate SAS DATA step code that represents the binning model. This code can be downloaded and saved for later use in scoring new observations. You can also apply the code to score a CAS table directly. Such DATA step code is often called score code when it is used to represent a data transformation or an analytical model.

The next example requests the score code for a bucket binning of the purchase_3mon variable, and then saves the score code into a local result table. Also, you can convert the score code that is stored in result['CodeGen'] into a text file using Pandas.

In [7]: result = organics.binning(

   ...:     inputs='purchase_3mon',

   ...:     tech='bucket',

   ...:     casout=out_data,

   ...:     code={'comment':True, 'tabform':True},

   ...:     nBinsArray=10

   ...: )

   ...:

   ...: # score code is saved in this table

   ...: # df = result['CodeGen']

Variable Imputation

The impute action from the datapreprocess action set manages missing values. You can use the distinct action to check which columns have missing values.

In [8]: df = organics.distinct()['Distinct']

   ...: df[df['NMiss'] > 0]

Out[8]:

                               Distinct Counts for CAS.ORGANICS

           Analysis           Number of Distinct           Number of Missing           Truncated or

           Variable                 Values                       Values                    Not

 1     DemAffl                                   34                      82460                      0

 2     DemAge                                    63                     114608                      0

18     PromTime                                  40                      21356                      0

Let’s run the impute action for the promotion time variable.

In [9]: organics[['PromTime']].impute()

Out[9]:

[ImputeInfo]

 

 Imputation Information for ORGANICS

 

    Variable ImputeTech     ResultVar          N    NMiss  

 0  PromTime       Mean  IMP_PromTime  1667592.0  21356.0   

 

    ImputedValueContinuous  

 0                 6.56467  

The result table of the impute action contains information about the treatment of missing values for each specified variable. PromTime is a numeric variable and, by default, the impute action selects sample mean as the value to fill the observations with missing PromTime.

The preceding example only suggests the imputation method but won’t actually change the data. You must specify a casout table for the impute action in order to create a new data set with imputed values.

In [10]: out_data = s.CASTable('cas.imputedData1', replace=True)

    ...:

    ...: impute1 = organics[['PromTime']].Impute()

    ...: impute1.methodcontinuous = 'Median'

    ...: impute1.copyallvars = True

    ...: impute1.casout = out_data

    ...: impute1()

Out[10]:

[ImputeInfo]

 

 Imputation Information for ORGANICS

 

    Variable ImputeTech     ResultVar          N    NMiss  

 0  PromTime     Median  IMP_PromTime  1667592.0  21356.0   

 

    ImputedValueContinuous  

 0                     5.0  

 

[OutputCasTables]

 

                 casLib          Name     Rows  Columns  

 0  CASUSERHDFS(username)  imputedData1  1688948       37   

 

                                   casTable  

 0  CASTable('imputedData1', caslib='CAS...  

In this example, we create a CASAction object impute1, add a casout table to the impute action, and request that all columns to be copied from the input table to the casout table. Also, the imputation technique is changed from mean to median. The impute action also enables you to specify the value to impute.

In [11]: impute1.methodcontinuous = 'Value'

    ...: impute1.valuescontinuous = [0]

    ...: impute1()

Out[11]:

[ImputeInfo]

 

 Imputation Information for ORGANICS

 

    Variable ImputeTech     ResultVar          N    NMiss  

 0  PromTime      Value  IMP_PromTime  1667592.0  21356.0   

 

    ImputedValueContinuous  

 0                     0.0  

 

[OutputCasTables]

 

                 casLib          Name     Rows  Columns  

 0  CASUSERHDFS(username)  imputedData1  1688948       37   

 

                                   casTable  

 0  CASTable('imputedData1', caslib='CAS...  

The methodcontinuous option and the methodnominal option define the imputation treatment for continuous and categorical variables. The default imputations are mean and mode.

The following table lists techniques for continuous variables:

Technique Description
MAX Replaces missing values with the maximum value.
MEAN Replaces missing values with the mean.
MEDIAN Replaces missing values with the median.
MIDRANGE Replaces missing values with the mean of the maximum value and the minimum value.
MIN Replaces missing values with the minimum value.
RANDOM Replaces missing values with uniform random numbers.
VALUE Replaces missing values with the values that are specified in the valuescontinuous parameters.

The following table lists techniques for categorical variables:

Technique Description
MODE Replaces missing values with the mode.
VALUE Replaces missing values with the values that are specified in the valuesnominal parameters.

Similar to other actions in the datapreprocess action set, the impute action also generates score code that you can download so that you can repeat the imputation for a new data source.

In [12]: impute1.techforcont = 'Median'

    ...: impute1.code = {'comment':True}

    ...: impute1()['CodeGen']

Out[12]:

 

                                      Score code for Impute action
                                                  SASCode

 0  /*-----------------------------------------------------------------------------

 1  SAS Code Generated by Cloud Analytic Server for Impute Action

 2  Date : 07Mar2016:11:12:45

 3  Number of variables : 1

 4  -----------------------------------------------------------------------------*/

 5  

 6  _igby_ = 0;

 7  _tnn_ntrans_ = 1;

 8  

 9  _fuzcmp_ = 0.000000000100;

10   

11   array _tnn_vnames_{1} imp_PromTime ;

12   

13   array _vnn_names_{1} PromTime ;

14   

15   array _tnn_ntransvars_{1} _temporary_ (1 );

16   

17   array _tv_nn_indices_{1} _temporary_ (1 );

18   

19   imp_PromTime = .;

20   

21   array _tnn_imputetype_{1} _temporary_ (2 );

22   

23   

24   array _tnn_imputeuniquevals_{1} _temporary_ (5 );

25   

26   

27   /*---------Iterate and score----------------*/

28   

29   /*---------Count variables----------------*/

30   _ct_ = 0;

31   _impct_ = 0;

32   

33   do _i_ = 1 to _tnn_ntrans_;

34   do _j_ = 1 to _tnn_ntransvars_{_i_};

35   

36   if (_tnn_imputetype_{_i_} ~= 0) then

37   _impct_ + 1;

38   _ct_ + 1;

39   _numval_ = _vnn_names_{_tv_nn_indices_{_ct_}};

40   

41   /*-------Apply Imputation phase--------------*/

42   if missing(_numval_) then

43   do;

44   if _tnn_imputetype_{_i_} = 0 then

45   goto _impute_done1_;

46   else _numval_ = _tnn_imputeuniquevals_{1 *_igby_ + _impct_};

47   end;

48   _impute_done1_:;

49   _tnn_vnames_{_ct_} = _numval_;

50   end;

51   end;

52   

    drop _igby_ _tnn_ntrans_ _fuzcmp_ _ct_ _impct_ _i_ _j_ _numval_ ;

53

Conclusion

In this chapter, we introduced actions in the simple and dataPreprocess action sets for data exploration, data summarization, and basic variable transformation. These actions enable you to learn about your data using different techniques to summarize continuous and categorical variables. You can also apply variable transformation such as binning and imputation to reshape your data. Now that you understand your data, let’s move on to Chapter 8 and Chapter 9 for building analytic models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.184.6