Performing advanced analytics

You can find many different libraries for statistics, data mining, and machine learning in Python. Probably the best known one is the scikit-learn package. It provides most of the commonly used algorithms, and also tools for data preparation and model evaluation.

In scikit-learn, you work with data in a tabular representation by using pandas data frames. The input table (actually a two-dimensional array, not a table in the relational sense) has columns used to train the model. Columns, or attributes, represent some features, and therefore this table is also called the features matrix. There is no prescribed naming convention; however, in most of the Python code you will note that this features matrix is stored in variable X.

If you have a directed or supervised algorithm, then you also need the target variable. This is represented as a vector, or one-dimensional target array. Commonly, this target array is stored in variable y.

Without further hesitation, let's create some mining models. First, the following code imports all necessary libraries for this section:

import numpy as np 
import pandas as pd 
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score 
from sklearn.linear_model import LinearRegression 
from sklearn.naive_bayes import GaussianNB 
from sklearn.mixture import GaussianMixture

Next, let's re-read the target mail data from the CSV file, using the following code:

TM = pd.read_csv("C:SQL2017DevGuideChapter15_TM.csv")

The next step is to prepare the features matrix and the target array. The following code also checks the shape of both:

X = TM[['TotalChildren', 'NumberChildrenAtHome', 
        'HouseOwnerFlag', 'NumberCarsOwned', 
        'YearlyIncome', 'Age']] 
X.shape 
y = TM['BikeBuyer'] 
y.shape

The first model will be a supervised one, using the Naïve Bayes classification. For testing the accuracy of the model, you need to split the data into the training and the test set. You can use the train_test_split() function from the scikit-learn library for this task:

Xtrain, Xtest, ytrain, ytest = train_test_split( 
    X, y, random_state = 0, train_size = 0.7)

Note that the previous code puts 70% of the data into the training set and 30% into the test set.

The next step is to initialize and train the model with the training data set, as shown in the following code:

model = GaussianNB() 
model.fit(Xtrain, ytrain)

That's it. The model is prepared and trained. You can start using it for making predictions. You can use the test set for predictions, and for checking the accuracy of the model. In the previous chapter, Chapter 14, Data Exploration and Predictive Modeling with R in SQL Server R, you learned about the classification matrix. You can derive many measures out of it. A very well-known measure is the accuracy. The accuracy is the proportion of the total number of predictions that were correct, defined as the sum of true positive and true negative predictions with the total number of cases predicted. The following code uses the test set for the predictions and then measures the accuracy:

ymodel = model.predict(Xtest) 
accuracy_score(ytest, ymodel)

You can see that you can do quite advanced analyses with just a few lines of code. Let's make another model, this time an undirected one, using the clustering algorithm. For this one, you don't need training and test sets, and also not the target array. The only thing you need to prepare is the features matrix, as shown in the following code:

X = TM[['TotalChildren', 'NumberChildrenAtHome', 
        'HouseOwnerFlag', 'NumberCarsOwned', 
        'YearlyIncome', 'Age', 'BikeBuyer']]

Again, you need to initialize and fit the model. Note that the following code tries to group cases in two clusters:

model = GaussianMixture(n_components = 2, covariance_type = 'full') 
model.fit(X)

The predict() function for the clustering model creates the cluster information for each case in the form of a resulting vector. The following code creates this vector and shows it:

ymodel = model.predict(X) 
ymodel

You can add the cluster information to the input feature matrix, as shown in the following code:

X['Cluster'] = ymodel 
X.head()

Now you need to understand the clusters. You can get this understanding graphically. The following code shows how you can use the seaborn lmplot() function to create a scatterplot showing the cluster membership of the cases spread over income and age:

sns.set(font_scale = 3) 
lm = sns.lmplot(x = 'YearlyIncome', y = 'Age',  
                hue = 'Cluster',  markers = ['o', 'x'], 
                palette = ["orange", "blue"], scatter_kws={"s": 200}, 
                data = X, fit_reg = False, 
                sharex = False, legend = True) 
axes = lm.axes 
axes[0,0].set_xlim(0, 190000) 
plt.show(lm)

The following figure shows the result. You can see that in cluster 0 there are older people with less income, while cluster 1 consists of younger people, with not so distinctively higher income only:

Understanding clusters

Table of Contents for Performing advanced analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Performing advanced analytics