Getting sorted with data management

After you've imported the data to be analyzed in a data frame, you need to prepare it for further analysis. There are quite a few possible ways to retrieve values from a data frame. You can refer to the data by position, or by using the subscript notation. However, if you use a single subscript only, then you retrieve columns defined by the subscript and all rows. The same is true if you use a vector of column names only, without specifying the rows. If you specify two indexes or index ranges, then the first one is used for rows and the second one for columns. The following code shows these options:

TM[1:2];                              # Two columns 
TM[c("MaritalStatus", "Gender")];     # Two columns 
TM[1:3,1:2];                          # Three rows, two columns 
TM[1:3,c("MaritalStatus", "Gender")];

The first command returns all rows and two columns only. The second command produces the same result. The third command returns three rows, again for the MaritalStatus and Gender columns only. The fourth row produces the same result as the third one.

The most common notation uses the data frame name and column name, separated by the dollar ($) sign, such as TM$Gender. You can also avoid excessive writing of the data frame name by using the attach() or with() functions. With the attach() function, you add the data frame to the search path that R uses to find the objects. You can refer to a variable name directly without the data frame name, if the variable name is unique in the search path. The detach() function removes the data frame from the search path, to avoid possible ambiguity with duplicate variable names later. The with() function allows you to name the data frame only once, and then use the variables in a set of statements enclosed in {} brackets inside the body of the function. The following code shows these approaches:

table(TM$MaritalStatus, TM$Gender); 
attach(TM); 
table(MaritalStatus, Gender); 
detach(TM); 
with(TM, 
     {table(MaritalStatus, Gender)});

The first line produces a cross-tabulation of MaritalStatus and Gender. Please note the dataframe$variable notation. The second line adds the data frame to the search path. The third command produces the same cross-tabulation as the first one, however, this time referring to variable names only. The fourth command removes the data frame from the search path. The last command uses the with() function to allow you to define the data frame name only once and then use only variable names in the commands inside the function. Note that, because there is only one command in the function, the brackets {} can be omitted. All three cross-tabulations return the same result:

                 Gender
MaritalStatus    F    M
            M 4745 5266
            S 4388 4085

Sometimes, you get numeric categorical variable values and you want to use character labels. The factor() function can help you here. For example, in the TM data frame, there is the BikeBuyer variable. For this variable, 0 means the person never purchased a bike and 1 means this person is a bike buyer. The following code shows you how to add labels to the numerical values:

table(TM$BikeBuyer, TM$Gender); 
TM$BikeBuyer <- factor(TM$BikeBuyer, 
                       levels = c(0,1), 
                       labels = c("No","Yes")); 
table(TM$BikeBuyer, TM$Gender);

The results are shown here. Note that, the second time, the labels for the values of the BikeBuyer variable are used:

     F    M
0 4536 4816
1 4597 4535
    
       F    M
No  4536 4816
Yes 4597 4535

You can easily get the metadata about your objects. Some useful functions that give you information about your objects include the following:

class(): This function returns the type of object
names(): This function returns the names of the components, such as variable names in a data frame
length(): This function returns the number of elements, for example, the number of variables in a data frame
dim(): This function returns the dimensionality of an object, for example, the number of rows and columns in a data frame
str(): This function gives details about the structure of an object

Here are examples of using these metadata functions:

class(TM); 
names(TM); 
length(TM); 
dim(TM); 
str(TM);

The results are as follows:

 [1] "data.frame"
    
 [1] "MaritalStatus"        "Gender"               "TotalChildren"       
 [4] "NumberChildrenAtHome" "Education"            "Occupation"          
 [7] "HouseOwnerFlag"       "NumberCarsOwned"      "CommuteDistance"     
[10] "Region"               "BikeBuyer"            "YearlyIncome"        
[13] "Age"                 
    
 [1] 13
    
 [1] 18484    13
    
'data.frame':     18484 obs. of  13 variables:
$ MaritalStatus   : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2 ...
$ Gender          : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 2 ...
$ TotalChildren   : int  2 3 3 0 5 0 0 3 4 0 ...

Note that the CustomerKey column is not listed among the 13 columns of the data frame, because when the data was imported, this column was set to row names. In addition, only string variables were converted to factors. In the abbreviated result of the last command, you can see that the TotalChildren is an integer and not a factor, although it can occupy only values from 0 to 5.

Many times, calculated variables are much more meaningful for an analysis than just the base ones you read from your data source. For example, in medicine, the body mass index (BMI), defined as weight divided by the square of the height, is much more meaningful than the base variables of height and weight it is derived from. You can add new variables to a data frame, recode continuous values to a list of discrete values, change the data type of variable values, and more.

The following example uses the within() function, which is similar to the with() function. It's just that it allows updates of a data frame, to add a new variable MartitalStatusInt, derived from the MaritalStatus as an integer. This variable tells us the number of additional people in the household of the case observed:

TM <- within(TM, { 
  MaritalStatusInt <- NA 
  MaritalStatusInt[MaritalStatus == "S"] <- 0 
  MaritalStatusInt[MaritalStatus == "M"] <- 1 
}); 
str(TM);

In the body of the function, firstly the new variable is defined as missing. Then, the MaritalStatus values are used to define the number of additional persons in the household. If the person in the case observed is married, then the value is 1; if the person is single, then 0. The last line of the code shows the new structure of the data frame. The abbreviated structure is:

'data.frame':    18484 obs. of  14 variables:
$ MaritalStatus    : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2...
$ MaritalStatusInt : num  1 0 1 0 0 0 0 1 0 0 ...

You can see that the new variable values are correct; however, the mode is defined as numeric. You can change the data type with one of the as.targettype() functions, where targettype() is a placeholder for the actual target type function, as the following example shows:

TM$MaritalStatusInt <- as.integer(TM$MaritalStatusInt); 
str(TM);

Now, the abbreviated structure of the TM data frame shows that the mode of the new column is integer.

In the next example, a new variable is added to the data frame, just as a simple calculation. The new variable, HouseholdNumber, is used to define the total number of people in the household of the person in the case observed. The calculation summarizes the number of children at home plus 1 if the person is married, plus 1 for the person herself/himself. Finally, the mode is changed to an integer:

TM$HouseholdNumber = as.integer( 
  1 + TM$MaritalStatusInt + TM$NumberChildrenAtHome); 
str(TM);

The structure of the data frame shows that the calculation is correct:

'data.frame':    18484 obs. of  15 variables:
$ MaritalStatus     : Factor w/ 2 levels "M","S": 1 2 1 2 2 2 2 1 2 2 ...
$ MaritalStatusInt  : int  1 0 1 0 0 0 0 1 0 0 ...
$ HouseholdNumber   : int  2 4 5 1 6 1 1 5 5 1 ...

On many occasions, you have to deal with missing values, denoted with a literal NA. R treats missing values as completely unknown. This influences the results of the calculations. For example, adding a missing value to a known integer produces a missing value. You have to decide how to deal with missing data. You can exclude rows with missing data completely, you can recode the missing values to a predefined value, or you can exclude the missing values from a single calculation. The following code defines a vector of six values; however, the last value is missing. You can use the is.na() function to check for each value, whether it is missing or not. Then, the code tries to calculate the mean value for all values of the vector. The last line of code tries to calculate the mean again, this time by disregarding the missing value by using the na.rm = TRUE option. This option is available in most numeric functions and simply removes missing values from the calculation:

x <- c(1,2,3,4,5,NA); 
is.na(x); 
mean(x); 
mean(x, na.rm = TRUE);

The results of the previous code are here:

[1] FALSE FALSE FALSE FALSE FALSE  TRUE
    
[1] NA
    
[1] 3

You can see that the is.na() function evaluated each element separately, and returned a vector of the same dimensionality as the vector checked for the missing values. The mean() function returned a missing value when, in the calculation, a missing value was present, and the result you might have expected when the missing values were removed.

Frequently, you need to merge two datasets, or to define a new dataset as a projection of an existing one. Merging is similar to joining two tables in SQL Server, and a projection means selecting a subset of variables only. The merge() function joins two data frames based on a common identification of each case. Of course, the identification must be unique. The following code shows how to do the projection:

TM = read.table("C:\SQL2017DevGuide\Chapter13_TM.csv", 
                sep=",", header=TRUE, 
                stringsAsFactors = TRUE); 
TM[1:3,1:3]; 
cols1 <- c("CustomerKey", "MaritalStatus"); 
TM1 <- TM[cols1]; 
cols2 <- c("CustomerKey", "Gender"); 
TM2 <- TM[cols2]; 
TM1[1:3, 1:2]; 
TM2[1:3, 1:2];

The code first re-reads the TM data frame, this time without using the CustomerKey column for the row names. This column must be available in the data frame, because this is the unique identification of each case. Then the code defines the columns for the two projection data frames and shows the first three rows of each new data frame, as you can see in the results of the last two commands:

  CustomerKey MaritalStatus
1       11000             M
2       11001             S
3       11002             M
    
  CustomerKey Gender
1       11000      M
2       11001      M
3       11002      M

Now, let's join the two new datasets:

TM3 <- merge(TM1, TM2, by = "CustomerKey"); 
TM3[1:3, 1:3];

The results show that the join was done correctly:

  CustomerKey MaritalStatus Gender
1       11000             M      M
2       11001             S      M
3       11002             M      M

A data frame is a matrix. Sort order is important. Instead of merging two data frames by columns, you can bind them by columns. However, you need to be sure that both data frames are sorted in the same way; otherwise you might bind variables from one case with variables from another case. The following code binds two data frames by columns:

TM4 <- cbind(TM1, TM2); 
TM4[1:3, 1:4];

The results show that, unlike the merge() function, the cbind() function did not use the CustomerKey for a common identification. It has blindly bound columns case by case, and preserved all variables from both source data frames. That's why the CustomerKey column appears twice in the result:

  CustomerKey MaritalStatus CustomerKey.1 Gender
1       11000             M         11000      M
2       11001             S         11001      M
3       11002             M         11002      M

You can also use the rbind() function to bind two data frames by rows. This is equal to the union of two rowsets in SQL Server. The following code shows how to filter a dataset by creating two new datasets, each one with two rows and two columns only. Then the code uses the rbind() function to unite both data frames:

TM1 <- TM[TM$CustomerKey < 11002, cols1]; 
TM2 <- TM[TM$CustomerKey > 29481, cols1]; 
TM5 <- rbind(TM1, TM2); 
TM5;

The results are here:

        CustomerKey      MaritalStatus
1           11000             M
2           11001             S
18483       29482             M
18484       29483             M

Finally, what happens if you want to bind two data frames by columns but you are not sure about the ordering? Of course, you can sort the data. The following code shows how to create a new data frame from the TM data frame, this time sorted by the Age column in descending order. Note the usage of the minus (-) sign in the order() function to achieve the descending sort:

TMSortedByAge <- TM[order(-TM$Age),c("CustomerKey", "Age")]; 
TMSortedByAge[1:5,1:2];

The result is shown here:

     CustomerKey Age
1726       12725  99
5456       16455  98
3842       14841  97
3993       14992  97
7035       18034  97

Table of Contents for Getting sorted with data management

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting sorted with data management