Creating scalable solutions

You can use the scalable server resources from the client. You start development in RStudio or R Tools for Visual Studio by setting the execution context to the server. Of course, you must do it with a function from the RevoScaleR package, which is loaded in memory at the beginning of the following R code. The code defines the execution context on SQL Server, in the context of the AdventureWorksDW2014 database. Remember, the dbo.vTargetMail view comes from this database. Also note that the RUser used to connect to SQL Server needs permission to use the sys.sp_execute_external_script procedure. For the sake of simplicity, I just added the RUser database user in the AdventureWorksDW2014 database on my local SQL Server instance to the db_owner database role. The following code changes the execution context to SQL Server:

library(RevoScaleR); 
sqlConnStr <- "Driver=SQL Server;Server=SQL2017EIM; 
 Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd"; 
sqlShare <- "C:\SQL2017DevGuide"; 
chunkSize = 1000; 
srvEx <- RxInSqlServer(connectionString = sqlConnStr, shareDir = sqlShare, 
                       wait = TRUE, consoleOutput = FALSE); 
rxSetComputeContext(srvEx);

The parameters define the connection string to my SQL Server instance, the shared folder used to exchange the data between SQL Server and R engine, and the chunk size, which is actually used later when reading the data. Please note that you need to change the name of the SQL Server to your SQL Server instance. The RxInSqlServer() object creates the compute context in SQL Server. The wait parameter defines whether the execution in SQL Server is blocking and the control does not return to the client until the execution is finished or the execution is not blocking. The consoleOutput parameter defines whether the output of the R code started by SQL Server should be returned to the user console. The rxSetComputeContext() function actually sets the execution context to SQL Server.

After the execution context has been set to SQL Server, you can try to use other scalable functions. For example, rxImport() can be used to import comma-separated value file data to a data frame. The rowsPerRead parameter reads in batches, using the chunk size defined earlier in the code. The batch size of 1,000 rows is quite small, just to show how this import in chunks works. For larger datasets, you should use much larger batches. You should test what the best size is for your datasets and the processing power you have:

TMCSV = rxImport(inData = "C:\SQL2017DevGuide\Chapter13_TM.csv", 
                 stringsAsFactors = TRUE, type = "auto", 
                 rowsPerRead = chunkSize, reportProgress = 3);

The reportProgress parameter defines a detailed output. The abbreviated result of the previous code is as follows:

ReadNum=1, StartRowNum=1, CurrentNumRows=1000, TotalRowsProcessed=1000, ReadTime=0.01, ProcessDataTime = 0, LoopTime = 0.01
ReadNum=2, StartRowNum=1001, CurrentNumRows=1000, TotalRowsProcessed=2000, ReadTime=0.007, ProcessDataTime = 0.002, LoopTime = 0.007
...
Overall compute summaries time: 0.133 secs.
Total loop time: 0.132
Total read time for 19 reads: 0.115
Total process data time: 0.041
Average read time per read: 0.00605263
Average process data time per read: 0.00215789
Number of threads used: 2

You can see that the chunk size was really 1,000 rows, how much time was needed for each chunk, the total time, the number of threads used, and more. This confirms that RevoScaleR functions use parallelism. Note that you might get a different number of threads, depending on your system configuration and available resources.

The next code reads the same data again, this time from SQL Server. Note that an ODBC connection is not needed; the code is already executed on the server side in the context of the AdventureWorksDW2014 database. The RxSqlServerData() function generates a SQL Server data source object. You can think of it as a proxy object to the SQL Server rowset, which is the result of the query:

TMquery <-  
"SELECT CustomerKey, MaritalStatus, Gender, 
  TotalChildren, NumberChildrenAtHome, 
  EnglishEducation AS Education, 
  EnglishOccupation AS Occupation, 
  HouseOwnerFlag, NumberCarsOwned, CommuteDistance, 
  Region, BikeBuyer, 
  YearlyIncome, Age 
 FROM dbo.vTargetMail"; 
sqlTM <- RxSqlServerData(sqlQuery = TMquery, 
                         connectionString = sqlConnStr, 
                         stringsAsFactors = TRUE, 
                         rowsPerRead = chunkSize); 
TMSQL <- rxImport(inData = sqlTM, reportProgress = 3);

The sqlTM object is the pointer to the SQL Server data, and exposes the metadata of the result set of the query to the client R code. Note that the last line creates a new data frame and physically transfers data to the client. Therefore, if you executed the code in this section step by step, you should have two data frames—TMCSV and TMSQL—with data in local client memory, and the sqlTM data source connection, which you can use as a data frame. You can see the difference if you try to get info about all three objects with the rxGetInfo() function:

rxGetInfo(TMCSV); 
rxGetInfo(sqlTM);

The previous code returns the following result:

Data frame: TMCSV 
Number of observations: 18484 
Number of variables: 14 
    
Connection string: Driver=SQL Server;Server=localhost;
 Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd 
Data Source: SQLSERVER

You get the details about the metadata of the SQL data source connection object with the rxGetVarInfo() function. You can get summary statistics and different cross tabulations of the SQL Server data with the rxSummary(), rxCrossTabs(), and rxCube() functions. You can create histograms with the rxHistogram() function. All these functions use the SQL Server execution context. The following code shows how to use the functions mentioned:

sumOut <- rxSummary( 
  formula = ~ NumberCarsOwned + Occupation + F(BikeBuyer), 
  data = sqlTM); 
sumOut; 
cTabs <- rxCrossTabs(formula = BikeBuyer ~ 
                     Occupation : F(HouseOwnerFlag),  
                     data = sqlTM); 
print(cTabs, output = "counts"); 
print(cTabs, output = "sums"); 
print(cTabs, output = "means"); 
summary(cTabs, output = "sums"); 
summary(cTabs, output = "counts"); 
summary(cTabs, output = "means"); 
cCube <- rxCube(formula = BikeBuyer ~ 
                Occupation : F(HouseOwnerFlag),  
                data = sqlTM); 
cCube; 
rxHistogram(formula = ~ BikeBuyer | MaritalStatus, 
            data = sqlTM);

Note that all of these scalable functions accept data from the SQL Server data source connection object. Because they execute on SQL Server, you cannot use the local data frames to feed them. If you used a local data frame, you would get an error. If you want to use the scalable functions with the local datasets, you need to switch the execution context back to local.

The following code shows how to set the execution context back to the client machine:

rxSetComputeContext("local");

The RevoScaleR package includes a function to calculate clusters of similar cases based on the values of the input variables. It uses the K-means clustering algorithm. The rxKmeans() function in the following code uses a local data frame. It defines two clusters and then assigns each case to one of the clusters. The summary() function gives you the details of the clustering model:

TwoClust <- rxKmeans(formula = ~ BikeBuyer + TotalChildren + NumberCarsOwned, 
                     data = TMSQL, 
                     numClusters = 2); 
summary(TwoClust);

You can add the cluster membership to the original data frame and rename the variable to a friendlier name:

TMClust <- cbind(TMSQL, TwoClust$cluster); 
names(TMClust)[15] <- "ClusterID";

In order to understand the meaning of the clusters, you need to analyze them. The following code creates a nice graph that consists of three individual small graphs showing the distribution of each input variable in each cluster:

attach(TMClust); 
oldpar <- par(no.readonly = TRUE); 
par(mfrow=c(1,3)); 
 
# NumberCarsOwned and clusters 
nofcases <- table(NumberCarsOwned, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Number of cars owned and cluster ID',     
        xlab='Cluster Id', ylab ='Number of Cars', 
        legend=rownames(nofcases), 
        col=c("black", "blue", "red", "orange", "yellow"), 
        beside=TRUE); 
# BikeBuyer and clusters 
nofcases <- table(BikeBuyer, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Bike buyer and cluster ID',     
        xlab='Cluster Id', ylab ='BikeBuyer', 
        legend=rownames(nofcases), 
        col=c("blue", "yellow"), 
        beside=TRUE); 
# TotalChildren and clusters 
nofcases <- table(TotalChildren, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Total children and cluster ID',     
        xlab='Cluster Id', ylab ='Total Children', 
        legend=rownames(nofcases), 
        col=c("black", "blue", "green", "red", "orange", "yellow"), 
        beside=TRUE); 
 
# Clean up 
par(oldpar); 
detach(TMClust);

You should already be familiar with this code from the examples earlier in this chapter. The next screenshot shows the results:

The analysis of the clusters

Table of Contents for Creating scalable solutions

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating scalable solutions