Introducing data structures in R

A matrix is a two-dimensional array of values of the same type, or mode. You generate matrices from vectors with the matrix() function. Columns and rows can have labels. You can generate a matrix from a vector by rows or by columns (default). The following code shows some matrices and the difference of the generation, by rows or by columns:

x = c(1,2,3,4,5,6); x;          
Y = array(x, dim=c(2,3)); Y;    
Z = matrix(x,2,3,byrow=F); Z 
U = matrix(x,2,3,byrow=T); U;  # A matrix - fill by rows 
rnames = c("Row1", "Row2"); 
cnames = c("Col1", "Col2", "Col3"); 
V = matrix(x,2,3,byrow=T, dimnames = list(rnames, cnames)); V;

The first line generates and shows a one-dimensional vector. The second line creates a two-dimensional array, which is the same as a matrix, with two rows and three columns. The matrix is filled from a vector, column by column. The third row actually uses the matrix() function to create a matrix, and fill it by columns. The matrix is equivalent to the previous one. The fourth row fills the matrix by rows. The fifth and sixth rows define row and column names. The last row again creates a matrix filled by rows; however, this time it adds row and column names in a list of two vectors. Here are the results:

[1] 1 2 3 4 5 6
    
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
    
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
    
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
    
     Col1 Col2 Col3
Row1    1    2    3
Row2    4    5    6

You can see the difference between filling by rows or by columns. The following code shows how you can refer to matrix elements by position, or even by name, if you've named columns and rows:

U[1,]; 
U[1,c(2,3)]; 
U[,c(2,3)]; 
V[,c("Col2", "Col3")];

The results are:

[1] 1 2 3
    
[1] 2 3
    
     [,1] [,2]
[1,]    2    3
[2,]    5    6
    
     Col2 Col3
Row1    2    3
Row2    5    6

As you can see from the matrix examples, a matrix is just a two-dimensional array. You generate arrays with the array() function. This function again accepts a vector of values as the first input parameter, then a vector specifying the number of elements on dimensions, and then a list of vectors for the names of the dimensions' elements. An array is filled by columns, then by rows, then by the third dimension (let's call it pages), and so on. Here is an example that generates a three-dimensional array:

rnames = c("Row1", "Row2"); 
cnames = c("Col1", "Col2", "Col3"); 
pnames = c("Page1", "Page2", "Page3"); 
Y = array(1:18, dim=c(2,3,3), dimnames = list(rnames, cnames,  pnames)); Y;

The result is as follows:

, , Page1
    
     Col1 Col2 Col3
Row1    1    3    5
Row2    2    4    6
  
, , Page2
    
     Col1 Col2 Col3
Row1    7    9   11
Row2    8   10   12
    
, , Page3
    
     Col1 Col2 Col3
Row1   13   15   17
Row2   14   16   18

Variables can store discrete or continuous values. Discrete values can be nominal, or categorical, where they represent labels only, or ordinal, where there is an intrinsic order in the values. In R, factors represent nominal and ordinal variables. Levels of a factor represent distinct values. You create factors from vectors with the factor() function. It is important to properly determine the factors because advanced data mining and machine learning algorithms treat discrete and continuous variables differently. Here are some examples of factors:

x = c("good", "moderate", "good", "bad", "bad", "good"); 
y = factor(x); y;   
z = factor(x, order=TRUE); z; 
w = factor(x, order=TRUE,  
           levels=c("bad", "moderate","good")); w;

The first line defines a vector of six values denoting whether the observed person was in a good, moderate, or bad mood. The second line generates a factor from the vector and shows it. The third line generates an ordinal variable. Note the results—the order is defined alphabetically. The last commands in the last two lines generate another ordinal variable from the same vector, this time specifying the order explicitly. Here are the results:

[1] good     moderate good     bad      bad      good    
Levels: bad good moderate
   
[1] good     moderate good     bad      bad      good    
Levels: bad < good < moderate
    
[1] good     moderate good     bad      bad      good    
Levels: bad < moderate < good

Lists are the most complex data structures in R. Lists are ordered collections of different data structures. You typically do not work with them a lot. You need to know them because some functions return multiple results, or complex results, packed in a list, and you need to extract specific parts. You create lists with the list() function. You refer to objects of a list by position, using the index number enclosed in double parentheses. If an element is a vector or a matrix, you can additionally use the position of a value in a vector enclosed in single parentheses. Here is an example:

L = list(name1="ABC", name2="DEF", 
         no.children=2, children.ages=c(3,6)); 
L; 
L[[1]]; 
L[[4]]; 
L[[4]][2];

The example produces the following result:

$name1
[1] "ABC"
    
$name2
[1] "DEF"
    
$no.children
[1] 2
    
$children.ages
[1] 3 6
    
[1] "ABC"
    
[1] 3 6
    
[1] 6

Finally, the most important data structure is a data frame. Most of the time, you analyze data stored in a data frame. Data frames are matrices where each variable can be of a different type. Remember, a variable is stored in a column, and all values of a single variable must be of the same type. Data frames are very similar to SQL Server tables. However, they are still matrices, meaning that you can refer to the elements by position, and that they are ordered. You create a data frame with the data.frame() function from multiple vectors of the same length. Here is an example of generating a data frame:

CategoryId = c(1,2,3,4); 
CategoryName = c("Bikes", "Components", "Clothing", "Accessories"); 
ProductCategories = data.frame(CategoryId, CategoryName); 
ProductCategories;

The result is as follows:

      CategoryId      CategoryName
1          1           Bikes
2          2           Components
3          3           Clothing
4          4           Accessories

Most of the time, you get a data frame from your data source, for example from a SQL Server database. The results of the earlier example reading from SQL Server can be actually stored in a data frame. You can also enter the data manually, or read it from many other sources, including text files, Excel, and many more. The following code retrieves the data from a comma-separated values (CSV) file in a data frame, and then displays the first four columns for the first five rows of the data frame. The CSV file is provided in the download for the accompanying code for this book, as described in the preface of the book:

TM = read.table("C:\SQL2017DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, row.names = "CustomerKey", 
                stringsAsFactors = TRUE); 
TM[1:5,1:4];

The code specifies that the first row of the file holds column names (header), and that the CustomerKey column represents the row names (or row identifications). If you are interested, the data comes from the dbo.vTargetMail view from the AdventureWorksDW2014 demo SQL Server database you can download from Microsoft CodePlex at https://msftdbprodsamples.codeplex.com/. The first five rows are presented here:

      MaritalStatus Gender TotalChildren NumberChildrenAtHome
11000             M      M             2                    0
11001             S      M             3                    3
11002             M      M             3                    3
11003             S      F             0                    0
11004             S      F             5                    5

Table of Contents for Introducing data structures in R

Create new playlist

Sign In

Sign Up

Table of Contents for
Introducing data structures in R