Every data mining project is incomplete without proper data visualization. While looking at numbers and statistics it may tell a similar story for the variables we are looking at by different cuts, however, when we visually look at the relationship between variables and factors it shows a different story altogether. Hence data visualization tells you a message, that numbers and statistics fail to do that. From a data mining perspective, data visualization has many advantages, which can be summarized in three important points:
In this chapter, the reader will get to know the basics of data visualization along with how to create advanced data visualization using existing libraries in R programming language. Typically, data visualization approach can be addressed in two different ways:
Based on the preceding two points, let's have a look at the data visualization rules and theories behind each visualization, and then we are going to look at the practical aspect of implementing the graphs and charts using R Script.
From a functional point of view, the following are the graphs and charts which a data scientist would like the audience to look at to infer the information:
Keeping in mind the preceding functionalities that people use in displaying insights to the readers, we can see that one graph is referred by much functionality. In other words, one graph can be used in multiple functions to show the insights. These graphs and charts can be displayed by using various open source R packages, such as ggplot2
, ggvis
, rCharts
, plotly
, and googleVis
by taking one open source dataset.
In the light of the previously mentioned ten points, the data visualization rules can be created to select the best representation depending on what you want to represent:
In this chapter, we will primarily focus on the ggplot2
library and plotly
library. Of course we will cover a few more interesting libraries to create data visualization. The graphics packages in R can be organized as per the following sequences:
A detailed explanation of the various libraries supporting the previous functionalities can be found in the link https://cran.r-project.org/web/views/Graphics.html. For good visual object, we need more data points so that the density of the graphs can be more. In this context, we will be using two datasets, diamonds.csv
and cars93.csv
, to show data visualization.
There are two approaches to go ahead with data visualization, horizontally and vertically. Horizontal drill down means creating different charts and graphs using ggplot2
and vertical drill down implies creating one graph and adding different components to the graph. First, we will understand how to add and modify different components to a graph, and then we will move horizontally to create different types of charts.
Let's look at the dataset and libraries required to create data visualization:
> #getting the library > library(ggplot2);head(diamonds);names(diamonds) X carat cut color clarity depth table price x y z 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 [1] "X" "carat" "cut" "color" "clarity" "depth" "table" "price" "x" [10] "y" "z"
The ggplot2
library is also known as the grammar of graphics for data visualization. The process to start the graph requires a dataset and two variables, and then different components of a graph can be added to the base graph by using the +
sign. Let's get into creating a nice visualization using the diamonds.csv
dataset:
> #starting a basic ggplot plot object > gg<-ggplot(diamonds,aes(price,carat))+geom_point(color="brown4") > gg
In the preceding script, diamonds
is the dataset, and carat
and price
are the two variables. Using ggplot
function, the base graph is created and adding point to the ggplot
, the object is stored in object gg
.
Now we are going to add various components of a graph to make the ggplot
graph more interesting:
After creating the base plot, it is required to add title and label to the graph. This can be done using two functions, ggtitle
or labs
. Then, let's add a theme to the plot for customizing text element:
> #adding a title or label to the graph > gg<-gg+ggtitle("Diamond Carat & Price") > gg > gg<-gg+labs("Diamond Carat & Price") > gg > #adding theme to the plot > gg<-gg+theme(plot.title= element_text(size = 20, face = "bold")) > gg
Currently, the graph looks a little congested. To make the graph more intuitive, we need to add labels to the x axis and y axis, removing ticks and text from any axis to make the graph more clear. Rotating text in any axis is required when the row name or the column name contains text or a large number which is difficult to read in full length:
> #adding labels to the graph > gg<-gg+labs(x="Price in Dollar", y="Carat", ) > gg > #removing text and ticks from an axis > gg<-gg+theme(axis.ticks.y=element_blank(),axis.text.y=element_blank()) > gg > gg<-gg + theme(axis.text.x=element_text(angle=50, size=10, vjust=0.5)) > gg > gg<-gg + theme(axis.text.x=element_text(color = "chocolate", vjust=0.45), + axis.text.y=element_text(color = "brown1", vjust=0.45)) > gg
In order to focus on any specific portion of the plot, the x axis limit and y axis limit can be changed as follows. It also shows the number of rows removed while executing the limit on both the axes:
> #setting limits to both axis > gg<-gg + ylim(0,0.8)+xlim(250,1500) > gg Warning message: Removed 33937 rows containing missing values (geom_point).
If both x axis and y axis represent continuous data, any third variable as a factor can be introduced to the ggplot
object to set legends and look at the data how it is distributed across the factor
variable:
> #how to set legends in a graph > gg<-ggplot(diamonds,aes(price,carat,color=factor(cut)))+geom_point() > gg > gg<-ggplot(diamonds,aes(price,carat,color=factor(color)))+geom_point() > gg > gg<-ggplot(diamonds,aes(price,carat,color=factor(clarity)))+geom_point() > gg > gg<-gg+theme(legend.title=element_blank()) > gg
> gg<-gg+theme(legend.title = element_text(colour="darkblue", size=16, + face="bold"))+scale_color_discrete(name="By Different Grids of Clarity") > #changing the backgroup boxes in legend > gg<-gg+theme(legend.key=element_rect(fill='dodgerblue1')) > gg > #changing the size of the symbols used in legend > gg<-gg+guides(colour = guide_legend(override.aes = list(size=4))) > gg > #changing the size of the symbols used in legend > gg<-gg+guides(colour = guide_legend(override.aes = list(size=4))) > gg
In addition to the previous visualization, it is required to connect the scatterplots and change the background. Add lines to the scatterplot in order to understand the sequence r pattern that exists between the variables which are related:
> #adding line to the data points > gg<-gg+geom_line(color="darkcyan") > gg > #changing the background of an image > gg<-gg+theme(panel.background = element_rect(fill = 'chocolate3')) > gg > #changing plot background > gg<-gg+theme(plot.background = element_rect(fill = 'skyblue')) > gg
Another important aspect of data visualization is how to display multi-dimensional cuts in a plot. For example, in the diamonds
datasets, we are currently looking at the relationship between the price of the diamond and the carat it contains. There are three more variables: cut
, color
, and clarity
. It makes sense to understand if the relationship is consistent across those three variables. That means, understanding if we can plot the relationship between the carat and the price of the diamond by different cut, different color, and different clarity categories. Let's look at the distribution of the three categorical variables:
> table(diamonds$cut);table(diamonds$clarity);table(diamonds$color) Fair Good Very Good Premium Ideal 1610 4906 12082 13791 21551 I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF 741 9194 13065 12258 8171 5066 3655 1790 D E F G H I J 6775 9797 9542 11292 8304 5422 2808 > #adding a multi-variable cut to the graph > gg<-gg+facet_wrap(~cut, nrow=4) > gg
In the preceding graph, the cut
variable is used to show the relationship between the carat and the price. The number of rows selected is four, to represent the graphs in a clear manner. If we add one more variable to the cut
variable, that is clarity
, the graph would be more intuitive and insightful:
> #adding two variables as cut to display the relationship > gg<-gg+facet_wrap(~cut+clarity, nrow=4) > gg
While creating the graphs using a multi-dimensional cut
variable, it is not necessary that all the graphs would be on the same scale. In automatic mode, the scales become a standard for all the plots, hence, sometimes certain plots get compressed. Thus, it is required to make the graphs scale free in order to rearrange the scales based on observed values:
> #scale free graphs in multi-panels > gg<-gg+facet_wrap(~color, ncol=2, scales="free") > gg
Using the facet_grid()
option, we can display the bi-variate relationship between two categorical variables using the ggplot2
library:
> #bi-variate plotting using ggplot2 > gg<-gg+facet_grid(color~cut) > gg
There are certain external graphical themes which can be imported to the ggplot2
function for visualization, such as library(ggthemes)
. Tableau, which is a tool known for data visualization, its color, and themes, can also be used along with ggplots:
> #changing discrete category colors > ggplot(diamonds, aes(price, carat, color=factor(cut)))+ + geom_point() + + scale_color_brewer(palette="Set1") > #Using tableau colors > library(ggthemes) > ggplot(diamonds, aes(price, carat, color=factor(cut)))+ + geom_point() + + scale_color_tableau()
Plots created can be slightly modified using the color
gradient and plotting a distribution on the graph itself:
> #using color gradient > ggplot(diamonds, aes(price, carat))+ + geom_point() + + scale_color_gradient(low = "blue", high = "red") > #plotting a distribution on a graph > mid<-mean(diamonds$price) > ggplot(diamonds, aes(price, carat, color=depth))+geom_point()+ + scale_color_gradient2(midpoint=mid, + low="blue", mid="white", high="red" )
Having discussed in depth about the components in a graph building process, now let's try to understand how to create different charts and graphs using ggplot2
. qplot()
is a basic plotting function in ggplot2
, which is a wrapper for creating different types of plots. There are two options for a user, either go plotting with the qplot()
or ggplot()
function. To create different graphs, we are going to use the Cars93.csv
dataset.
Bar charts are preferred as a method of visualization for the categorical variables, also used to represent the count or percentage of each group. The horizontal axis represents the categories and the vertical axis either represents the count or the percentage:
> #creating bar chart > barplot <- ggplot(Cars93,aes(Type))+ + geom_bar(width = 0.5,fill="royalblue4",color="red")+ + ggtitle("Vehicle Count by Category") > barplot
It is not only easy to interpret boxplots using the ggplot
package, but also easy to customize the plot. One can easily recognize the outliers imposed on each of the corresponding boxplots:
> #creating boxplot > boxplot <- ggplot(Cars93,aes(Type,Price))+ + geom_boxplot(width = 0.5,fill="firebrick",color="cadetblue2", + outlier.colour = "purple",outlier.shape = 2)+ + ggtitle("Boxplot of Price by Car Type") > boxplot
The bubble chart belongs to the family of the scatterplot. It is preferred when it is required to represent three quantitative variables. Two quantitative variables are represented on two axis and one quantitative variable is used to represent the size of each bubble in a bubble chart:
> #creatting Bubble chart > bubble<-ggplot(Cars93, aes(x=EngineSize, y=MPG.city)) + + geom_point(aes(size=Price,color="red")) + + scale_size_continuous(range=c(2,15)) + + theme(legend.position = "bottom") > bubble
Donut chart is used in place of pie chart when the number of categories exceeds five:
> #creating Donut charts > ggplot(Cars93) + geom_rect(aes(fill=Cylinders, ymax=Max.Price, + ymin=Min.Price, xmax=4, xmin=3)) + + coord_polar(theta="y") + xlim(c(0, 4))
Any dataset that has a city name and state name or a country name can be plotted on a geographical map using google visualization library in R. Using another open source dataset, state.x77
inbuilt in R, we can show how a geo map looks. The google visualization library, using Google maps API, tries to plot the geographic locations on the plot along with the enterprise data. It publishes the output in a browser which can be stored back as an image to use it further:
> library(googleVis) > head(state.x77) Population Income Illiteracy Life Exp Murder HS Grad Frost Alabama 3615 3624 2.1 69.05 15.1 41.3 20 Alaska 365 6315 1.5 69.31 11.3 66.7 152 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 California 21198 5114 1.1 71.71 10.3 62.6 20 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 Area Alabama 50708 Alaska 566432 Arizona 113417 Arkansas 51945 California 156361 Colorado 103766 > states <- data.frame(state.name, state.x77) > gmap <- gvisGeoMap(states, "state.name", "Area", + options=list(region="US", dataMode="regions", + width=900, height=600)) > plot(gmap)
This is probably the easiest plot that every data mining professional must be doing. The following code explains how a histogram can be created using the ggplot
library:
> #creating histograms > histog <- ggplot(Cars93,aes(RPM))+ + geom_histogram(width = 0.5,fill="firebrick",color="cadetblue2", + bins = 20)+ + ggtitle("Histogram") > histog
Line chart is not a preferred chart while showing raw data. However, it is important while showing some variations across different categories relating to some metric. Though it is not a preferred chart, but it depends on the practitioner how he/she wants to display and tell a story to the reader:
> #creating line charts > linechart <- ggplot(Cars93,aes(RPM,Price))+ + geom_line(color="cadetblue4")+ + ggtitle("Line Charts") > > linechart
A pie chart is a representation of categorical variables when the label for each categorical variable is less than 10. If it exceeds 10, then it is suggested to look at a histogram or barplot for comparison. Using the ggplot
library, the pie chart can be created. The script is as follows:
> #creating pie charts > pp <- ggplot(Cars93, aes(x = factor(1), fill = factor(Type))) + + geom_bar(width = 1) > pp + coord_polar(theta = "y")
> # 3D Pie Chart from data frame > library(plotrix) > t <- table(Cars93$Type);par(mfrow=c(1,2)) > pct <- paste(names(t), " ", t, sep="") > pie(t, labels = pct, main="Pie Chart of Type of cars") > pie3D(t,labels=pct,main="Pie Chart of Type of cars")
Scatterplot is a very important plot to understand the bivariate relationship that exists in data. It also shows the pattern in which the data is stored over a period of time. It is also important to show the data in a proper way while showing it in scatterplots. The following example shows how a bivariate relationship can be displayed along with some third dimension dictating the visualization in a bivariate relationship. The third dimension could be a continuous variable or a categorical variable. Using the gridExtra()
library, additional graphing window can be created where two or more plots can be represented side by side, with some relationship:
> library(gridExtra) > sp <- ggplot(Cars93,aes(Horsepower,MPG.highway))+ + geom_point(color="dodgerblue",size=5)+ggtitle("Basic Scatterplot")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp > #adding a cantinuous variable Length to scale thee scatterplot points > sp2<-sp+geom_point(aes(color=Length), size=5)+ + ggtitle("Scatterplot: Adding Length Variable")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp2 > > grid.arrange(sp,sp2,nrow=1)
In the second graph in the preceding plot, the length variable, which is continuous, is dictating the relationship between the horsepower and the highway mileage per gallon. The light blue colored dots indicate lengthy cars while the darker dots indicate smaller cars. Instead of a continuous variable, if we use a factor variable to scale the relationship between the two variables, we will be able to see a plot like given in first graph in the following plot:
> #adding a factor variable Origin to scale the scatterplot points > sp3<-sp+geom_point(aes(color=factor(Origin)),size=5)+ + ggtitle("Scatterplot: Adding Origin Variable")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp3 > #adding custom color to the scatterplot > sp4<-sp+geom_point(aes(color=factor(Origin)),size=5)+ + scale_color_manual(values = c("red","blue"))+ + ggtitle("Scatterplot: Adding Custom Color")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp4 > grid.arrange(sp3,sp4,nrow=1)
To display the cause and effect relationship, one needs to display a trendline or a regression line on a scatterplot. Using the ggplot2
library, different regression lines can be plotted, such as linear, non linear, generalized linear, and so on. When the number of observations in a dataset is less than 1000, the loess regression method is applied by default. However, when it is more than 1000, the generalized additive model is applied. The trend lines are displayed next. The first plot indicates a line graph connecting all the points, and the second plot indicates the robust linear model:
> sp5<-sp+geom_point(color="blue",size=5)+geom_line()+ + ggtitle("Scatterplot: Adding Lines")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp5 > #adding regression lines to the scatterplot > sp6<-sp+geom_point(color="firebrick",size=5)+ + geom_smooth(method = "lm",se =T)+ + geom_smooth(method = "rlm",se =T)+ + ggtitle("Adding Regression Lines")+ + theme(plot.title= element_text(size = 12, face = "bold")) > sp6 > grid.arrange(sp5,sp6,nrow=1)
Adding the generalized regression model and loess as a non-linear regression model, we can modify the scatterplots as follows:
> sp7<-sp+geom_point(color="firebrick",size=5)+ + geom_smooth(method = "auto",se =T)+ + geom_smooth(method = "glm",se =T)+ + ggtitle("Adding Regression Lines")+ + theme(plot.title= element_text(size = 20, face = "bold")) > sp7 > #adding regression lines to the scatterplot > sp8<-sp+geom_point(color="firebrick",size=5)+ + geom_smooth(method = "gam",se =T)+ + ggtitle("Adding Regression Lines")+ + geom_smooth(method = "loess",se =T)+ + theme(plot.title= element_text(size = 20, face = "bold")) > sp8 > grid.arrange(sp7,sp8,nrow=1)
3D scatterplot is another addition to the list of scatterplot functions we are looking at. The 3D scatterplot library enables the users to look at the plot and rotate it, to view the data points from different angles. Once executed, the following script would open up a new rgl device window. Just rotate the graph and you would be able to see the data points from different angles:
> library(scatterplot3d);library(Rcmdr) > scatter3d(MPG.highway~Length+Width|Origin, data=Cars93, fit="linear",residuals=TRUE, parallel=FALSE, bg="black", axis.scales=TRUE, grid=TRUE, ellipsoid=FALSE)
Stacked bar charts are just another variant of bar charts, where more than two variables can be plotted with different combinations of color. The following example codes show some variants of stacked bar charts:
> qplot(factor(Type), data=Cars93, geom="bar", fill=factor(Origin)) > > #or > > ggplot(Cars93, aes(Type, fill=Origin)) + geom_bar()
A stem and leaf plot is a textual representation of a quantitative variable that segments the values to their most significant numeric digits. For example, the stem and leaf plot for the mileage within the city
variable from the Cars93.csv
dataset is represented as follows:
> stem(Cars93$MPG.city) The decimal point is 1 digit(s) to the right of the | 1 | 55666777777778888888888889999999999 2 | 0000000011111122222223333333344444 2 | 5555556688999999 3 | 01123 3 | 9 4 | 2 4 | 6
To interpret the results of a stem and leaf plot: if we need to know how many observations are there which are greater than 30, the answer is 8, the digit on the left of pipe indicates items and the numbers on the right indicate units, hence the respective numbers are 30, 31, 31, 32, 33, 39, 42, 46.
Word cloud is a data visualization method which is preferred when it is required to represent a textual data. For example, the representation of a bunch of text files with few words having frequent appearances across those set of documents would summarize the topic of discussion. Hence, word cloud representation is a visual summary of the textual unstructured data. This is mostly used to represent social media posts, such as Twitter tweets, Facebook posts, and so on. There are various pre-processing tasks before arriving to create a word cloud, the final output from a text mining exercise would be a data frame with words and their respective frequencies:
#Word cloud representation library(wordcloud) words<-c("data","data mining","analytics","statistics","graphs", "visualization","predictive analytics","modeling","data science", "R","Python","Shiny","ggplot2","data analytics") freq<-c(123,234,213,423,142,145,156,176,214,218,213,234,256,324) d<-data.frame(words,freq) set.seed(1234) wordcloud(words = d$words, freq = d$freq, min.freq = 1,c(8,.3), max.words=200, random.order=F, rot.per=0.35, colors=brewer.pal(7, "Dark2"))
The coxcomb chart, which is also known as polar chart or rose chart, is a combination of pie chart and bar chart. The area of each section is adjusted based on the values of that segment by changing the radius. Anyone can understand the insights represented using coxcomb chart and does not require any technical knowledge:
> #coxcomb chart = bar chart + pie chart > cox<- ggplot(Cars93, aes(x = factor(Type))) + + geom_bar(width = 1, colour = "goldenrod1",fill="darkviolet") > cox + coord_polar()
A new variant of coxcomb plot by changing the coordinate polar measure, which is theta:
> #coxcomb chart = bar chart + pie chart > cox<- ggplot(Cars93, aes(x = factor(Type))) + + geom_bar(width = 1, colour = "goldenrod1",fill="darkred") > cox + coord_polar() > #a second variant of coxcomb plot > cox + coord_polar(theta = "y")
18.216.255.250