Basic visualizations

In R, you build a graph step by step. You start with a simple graph, and then add features to it with additional commands. You can even combine multiple plots and other graphs into a single graph. Besides viewing the graph interactively, you can save it to a file. There are many functions for drawing graphs. Let's start with the most basic one, the plot() function.

Note that, if you removed the TM dataset from memory or didn't save the workspace when you exited the last session, you need to re-read the dataset. Here is the code that uses it to draw a graph for the Education variable of the TM dataset, which is also added to the search path to simplify further addressing to the variables:

TM = read.table("C:\SQL2017DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, 
                stringsAsFactors = TRUE); 
attach(TM); 
 
# A simple distribution 
plot(Education);

The following screenshot shows the graph:

Basic plot for Education

The plot does not look too good. The Education variable is correctly identified as a factor; however, the variable is ordinal, so the levels should be defined. You can see the problem in the graph in the previous figure, where the values are sorted alphabetically. In addition, the graph and axes could also have titles, and a different color might look better. The plot() function accepts many parameters. In the following code, the parameters main for the main title, xlab for the x axis label, ylab for the y axis label, and col for the fill color are introduced:

Education = factor(Education, order=TRUE,  
                   levels=c("Partial High School",  
                            "High School","Partial College", 
                            "Bachelors", "Graduate Degree")); 
plot(Education, main = 'Education', 
     xlab='Education', ylab ='Number of Cases', 
     col="purple");

This code produces a nicer graph, as you can see in the following screenshot:

Enhanced plot for Education

Now, let's make some more complex visualizations with multiple lines of code! For a start, the following code generates a new data frame TM1 as a subset of the TM data frame, selecting only 10 rows and 3 columns. This data frame will be used for line plots, where each case is plotted. The code also renames the variables to get unique names and adds the data frame to the search path:

cols1 <- c("CustomerKey", "NumberCarsOwned", "TotalChildren"); 
TM1 <- TM[TM$CustomerKey < 11010, cols1]; 
names(TM1) <- c("CustomerKey1", "NumberCarsOwned1", "TotalChildren1"); 
attach(TM1);

The next code cross-tabulates the NumberCarsOwned and the BikeBuyer variables and stores the result in an object. This cross-tabulation is used later for a bar plot:

nofcases <- table(NumberCarsOwned, BikeBuyer); 
nofcases;

The cross-tabulation result is shown here:

                BikeBuyer
NumberCarsOwned    0    1
              0  1551 2687
              1  2187 2696
              2  3868 2589
              3  951  694
              4  795  466

You can specify graphical parameters directly or through the par() function. When you set parameters with this function, these parameters are valid for all subsequent graphs until you reset them. You can get a list of parameters by simply calling the function without any arguments. The following line of code saves all modifiable parameters to an object by using the no.readonly = TRUE argument when calling the par() function. This way, it is possible to restore the default parameters later, without exiting the session:

oldpar <- par(no.readonly = TRUE);

The next line defines that the next four graphs will be combined in a single graph in a 2 x 2 invisible grid, filled by rows:

par(mfrow=c(2,2));

Now let's start filling the grid with smaller graphs. The next command creates a stacked bar showing marital status distribution at different education levels. It also adds a title and x and y axis labels. It changes the default colors used for different marital statuses to blue and yellow. This graph appears in the top-left corner of the invisible grid:

plot(Education, MaritalStatus, 
     main='Education and marital status', 
     xlab='Education', ylab ='Marital Status', 
     col=c("blue", "yellow"));

The hist() function produces a histogram for numeric variables. Histograms are especially useful if they don't have too many bars. You can define breakpoints for continuous variables with many distinct values. However, the NumberCarsOwned variable has only five distinct values, and therefore defining breakpoints is not necessary. This graph fills the top-right cell of the grid:

hist(NumberCarsOwned, main = 'Number of cars owned', 
     xlab='Number of Cars Owned', ylab ='Number of Cases', 
     col="blue");

The next part of the code is slightly longer. It produces a line chart with two lines: one for the TotalChildren1 and one for the NumberCarsOwned1 variable. Note that the limited dataset is used, in order to get just a small number of plotting points. Firstly, a vector of colors is defined. This vector is used to define the colors for the legend. Then, the plot() function generates a line chart for the TotalChildren1 variable. Here, two new parameters are introduced: the type="o" parameter defines the over-plotted points and lines, and the lwd=2 parameter defines the line width. Then, the lines() function is used to add a line for the NumberCarsOwned1 variable to the current graph, to the current cell of the grid. Then, a legend is added with the legend() function to the same graph. The cex=1.4 parameter defines the character expansion factor relative to the current character size in the graph. The bty="n" parameter defines that there is no box drawn around the legend. The lty and lwd parameters define the line type and the line width for the legend. Finally, a title is added. This graph is positioned in the bottom-left cell of the grid:

plot_colors=c("blue", "red"); 
plot(TotalChildren1,  
     type="o",col='blue', lwd=2, 
     xlab="Key",ylab="Number"); 
lines(NumberCarsOwned1,  
      type="o",col='red', lwd=2); 
legend("topleft",  
       c("TotalChildren", "NumberCarsOwned"), 
       cex=1.4,col=plot_colors,lty=1:2,lwd=1, bty="n"); 
title(main="Total children and number of cars owned line chart",  
      col.main="DarkGreen", font.main=4);

There is one more cell in the grid to fill. The barplot() function generates a histogram of the NumberCarsOwned variable in groups of the BikeBuyer variable and shows the histograms side by side. Note that the input for this function is the cross-tabulation object generated with the table() function. The legend() function adds a legend in the top-right corner of the chart. This chart fills the bottom-right cell of the grid:

barplot(nofcases, 
        main='Number of cars owned and bike buyer gruped',     
        xlab='BikeBuyer', ylab ='NumberCarsOwned', 
        col=c("black", "blue", "red", "orange", "yellow"), 
        beside=TRUE); 
legend("topright",legend=rownames(nofcases),  
       fill = c("black", "blue", "red", "orange", "yellow"),  
       ncol = 1, cex = 0.75);

The following figure shows the overall results, all four graphs combined into one:

Four graphs in an invisible grid

The last part of the code is the cleanup part. It restores the old graphics parameters and removes both data frames from the search path:

par(oldpar); 
detach(TM); 
detach(TM1);

Table of Contents for Basic visualizations

Create new playlist

Sign In

Sign Up

Table of Contents for
Basic visualizations