Chapter 2. Managing and Understanding Data

A key early component of any machine learning project involves managing and understanding data. Although this may not be as gratifying as building and deploying models—the stages in which you begin to see the fruits of your labor—it is unwise to ignore this important preparatory work.

Any learning algorithm is only as good as its input data, and in many cases, the input data is complex, messy, and spread across multiple sources and formats. Because of this complexity, often the largest portion of effort invested in machine learning projects is spent on data preparation and exploration.

This chapter approaches these topics in three ways. The first section discusses the basic data structures R uses to store data. You will become very familiar with these structures as you create and manipulate datasets. The second section is practical, as it covers several functions that are useful to get data in and out of R. In the third section, methods for understanding data are illustrated while exploring a real-world dataset.

By the end of this chapter, you will understand:

  • How to use R's basic data structures to store and extract data
  • Simple functions to get data into R from common source formats
  • Typical methods to understand and visualize complex data

Since the way R thinks about data will define the way you work with data, it is helpful to know R's data structures before jumping directly into data preparation. However, if you are already familiar with R programming, feel free to skip ahead to the section on data preprocessing.

R data structures

There are numerous types of data structures across programming languages, each with strengths and weaknesses suited to particular tasks. Since R is a programming language used widely for statistical data analysis, the data structures it utilizes were designed with this type of work in mind.

The R data structures used most frequently in machine learning are vectors, factors, lists, arrays and matrices, and data frames. Each is tailored to a specific data management task, which makes it important to understand how they will interact in your R project. In the sections that follow, we will review their similarities and differences.

Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements. A vector can contain any number of elements, but all of the elements must be of the same type of values. For instance, a vector cannot contain both numbers and text. To determine the type of vector v, use the typeof(v) command.

Several vector types are commonly used in machine learning: integer (numbers without decimals), double (numbers with decimals), character (text data), and logical (TRUE or FALSE values). There are also two special values: NULL, which is used to indicate the absence of any value, and NA, which indicates a missing value.

Tip

Some R functions will report both integer and double vectors as numeric, while others will distinguish between the two. As a result, although all double vectors are numeric, not all numeric vectors are double type.

It is tedious to enter large amounts of data manually, but small vectors can be created by using the c() combine function. The vector can also be given a name using the <- arrow operator, which is R's way of assigning values, much like the = assignment operator is used in many other programming languages.

For example, let's construct several vectors to store the diagnostic data of three medical patients. We'll create a character vector named subject_name to store the three patient names, a double vector named temperature to store each patient's body temperature, and a logical vector named flu_status to store each patient's diagnosis (TRUE if he or she has influenza, FALSE otherwise). Let's have a look at the following code to create these three vectors:

> subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
> temperature <- c(98.1, 98.6, 101.4)
> flu_status <- c(FALSE, FALSE, TRUE)

Because R vectors are inherently ordered, the records can be accessed by counting the item's number in the set, beginning at one, and surrounding this number with square brackets (that is, [ and ]) after the name of the vector. For instance, to obtain the body temperature for patient Jane Doe (the second element in the temperature vector) simply type:

> temperature[2]
[1] 98.6

R offers a variety of convenient methods to extract data from vectors. A range of values can be obtained using the (:) colon operator. For instance, to obtain the body temperature of Jane Doe and Steve Graves, type:

> temperature[2:3]
[1] 98.6 101.4

Items can be excluded by specifying a negative item number. To exclude Jane Doe's temperature data, type:

> temperature[-2]
[1]  98.1 101.4

Finally, it is also sometimes useful to specify a logical vector indicating whether each item should be included. For example, to include the first two temperature readings but exclude the third, type:

> temperature[c(TRUE, TRUE, FALSE)]
[1] 98.1 98.6

As you will see shortly, the vector provides the foundation for many other R data structures. Therefore, the knowledge of the various vector operations is crucial to work with data in R.

Factors

If you recall from Chapter 1, Introducing Machine Learning, features that represent a characteristic with categories of values are known as nominal. Although it is possible to use a character vector to store nominal data, R provides a data structure specifically for this purpose. A factor is a special case of vector that is solely used to represent categorical or ordinal variables. In the medical dataset we are building, we might use a factor to represent gender, because it uses two categories: MALE and FEMALE.

Why not use character vectors? An advantage of factors is that the category labels are stored only once. For instance, rather than storing MALE, MALE, FEMALE, the computer can store 1, 1, 2, which reduces the size of memory needed to store the same information. Additionally, many machine learning algorithms treat nominal and numeric data differently. Coding as factors is often needed to inform an R function to treat categorical data appropriately.

Tip

A factor should not be used for character vectors that are not truly categorical. If a vector stores mostly unique values like names or identification strings, keep it as a character vector.

To create a factor from a character vector, simply apply the factor() function. For example:

> gender <- factor(c("MALE", "FEMALE", "MALE"))
> gender
[1] MALE   FEMALE MALE
Levels: FEMALE MALE

Notice that when the gender data for John Doe and Jane Doe were displayed, R printed additional information about the gender factor. The levels variable comprise the set of possible categories factor could take, in this case: MALE or FEMALE.

When we create factors, we can add additional levels that may not appear in the data. Suppose we add another factor for the blood type, as shown in the following example:

> blood <- factor(c("O", "AB", "A"),
                            levels = c("A", "B", "AB", "O"))
> blood[1:2]
[1] O  AB
Levels: A B AB O

Notice that when we defined the blood factor for the three patients, we specified an additional vector of four possible blood types using the levels parameter. As a result, even though our data included only types O, AB, and A, all the four types are stored with the blood factor as indicated by the output. Storing the additional level allows for the possibility of adding data with the other blood types in the future. It also ensures that if we were to create a table of blood types, we would know that the B type exists, despite it not being recorded in our data.

The factor data structure also allows us to include information about the order of a nominal variable's categories, which provides a convenient way to store ordinal data. For example, suppose we have data on the severity of a patient's symptoms coded in an increasing level of severity from mild, to moderate, to severe. We indicate the presence of ordinal data by providing the factor's levels in the desired order, listed in ascending order from lowest to highest, and setting the ordered parameter to TRUE, as shown:

> symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
                                levels = c("MILD", "MODERATE", "SEVERE"),
                                ordered = TRUE)

The resulting symptoms factor now includes information about the order we requested. Unlike our prior factors, the levels value of this factor are separated by < symbols, to indicate the presence of a sequential order from mild to severe:

> symptoms
[1] SEVERE   MILD     MODERATE
Levels: MILD < MODERATE < SEVERE

A helpful feature of the ordered factors is that logical tests work as you expect. For instance, we can test whether each patient's symptoms are greater than moderate:

> symptoms > "MODERATE"
[1]  TRUE FALSE FALSE

Machine learning algorithms capable of modeling ordinal data will expect the ordered factors, so be sure to code your data accordingly.

Lists

A list is a data structure, much like a vector, in that it is used for storing an ordered set of elements. However, where a vector requires all its elements to be the same type, a list allows different types of elements to be collected. Due to this flexibility, lists are often used to store various types of input and output data and sets of configuration parameters for machine learning models.

To illustrate lists, consider the medical patient dataset we have been constructing with the data for three patients stored in six vectors. If we want to display all the data on John Doe (subject 1), we would need to enter five R commands:

> subject_name[1]
[1] "John Doe"
> temperature[1]
[1] 98.1
> flu_status[1]
[1] FALSE
> gender[1]
[1] MALE
Levels: FEMALE MALE
> blood[1]
[1] O
Levels: A B AB O
> symptoms[1]
[1] SEVERE
Levels: MILD < MODERATE < SEVERE

This seems like a lot of work to display one patient's medical data. The list structure allows us to group all of the patient's data into one object that we can use repeatedly.

Similar to creating a vector with c(), a list is created using the list() function, as shown in the following example. One notable difference is that when a list is constructed, each component in the sequence is almost always given a name. The names are not technically required, but allow the list's values to be accessed later on by name rather than by numbered position. To create a list with named components for all of the first patient's data, type the following:

> subject1 <- list(fullname = subject_name[1],
                           temperature = temperature[1],
                           flu_status = flu_status[1],
                           gender = gender[1],
                           blood = blood[1],
                           symptoms = symptoms[1])

This patient's data is now collected in the subject1 list:

> subject1
$fullname
[1] "John Doe"

$temperature
[1] 98.1

$flu_status
[1] FALSE

$gender
[1] MALE
Levels: FEMALE MALE

$blood
[1] O
Levels: A B AB O

$symptoms
[1] SEVERE
Levels: MILD < MODERATE < SEVERE

Note that the values are labeled with the names we specified in the preceding command. However, a list can still be accessed using methods similar to a vector. To access the temperature value, use the following command:

> subject1[2]
$temperature
[1] 98.1

The result of using vector-style operators on a list object is another list object, which is a subset of the original list. For example, the preceding code returned a list with a single temperature component. To return a single list item in its native data type, use double brackets ([[ and ]]) when attempting to select the list component. For example, the following returns a numeric vector of length one:

> subject1[[2]]
[1] 98.1

For clarity, it is often easier to access list components directly, by appending a $ and the value's name to the name of the list component, as follows:

> subject1$temperature
[1] 98.1

Like the double bracket notation, this returns the list component in its native data type (in this case, a numeric vector of length one).

Tip

Accessing the value by name also ensures that the correct item is retrieved, even if the order of the list's elements is changed later on.

It is possible to obtain several items in a list by specifying a vector of names. The following returns a subset of the subject1 list, which contains only the temperature and flu_status components:

> subject1[c("temperature", "flu_status")]
$temperature
[1] 98.1

$flu_status
[1] FALSE

Entire datasets could be constructed using lists and lists of lists. For example, you might consider creating a subject2 and subject3 list, and combining these into a single list object named pt_data. However, constructing a dataset in this way is common enough that R provides a specialized data structure specifically for this task.

Data frames

By far, the most important R data structure utilized in machine learning is the data frame, a structure analogous to a spreadsheet or database, since it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. Because the data frame is literally a list of vector type objects, it combines aspects of both vectors and lists.

Let's create a data frame for our patient dataset. Using the patient data vectors we created previously, the data.frame() function combines them into a data frame:

> pt_data <- data.frame(subject_name, temperature, flu_status,
      gender, blood, symptoms, stringsAsFactors = FALSE)

You might notice something new in the preceding code. We included an additional parameter: stringsAsFactors = FALSE. If we do not specify this option, R will automatically convert every character vector to a factor.

This feature is occasionally useful, but also sometimes unwarranted. Here, for example, the subject_name field is definitely not categorical data, as names are not categories of values. Therefore, setting the stringsAsFactors option to FALSE allows us to convert character vectors to factors only where it makes sense for the project.

When we display the pt_data data frame, we see that the structure is quite different from the data structures we worked with previously:

> pt_data
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE
2     Jane Doe        98.6      FALSE FEMALE    AB     MILD
3 Steve Graves       101.4       TRUE   MALE     A MODERATE

Compared to the one-dimensional vectors, factors, and lists, a data frame has two dimensions and is displayed in matrix format. This particular data frame has one column for each vector of patient data and one row for each patient. In machine learning terms, the data frame's columns are the features or attributes and the rows are the examples.

To extract entire columns (vectors) of data, we can take advantage of the fact that a data frame is simply a list of vectors. Similar to lists, the most direct way to extract a single element is by referring to it by name. For example, to obtain the subject_name vector, type:

> pt_data$subject_name
[1] "John Doe"     "Jane Doe"     "Steve Graves"

Also similar to lists, a vector of names can be used to extract several columns from a data frame:

> pt_data[c("temperature", "flu_status")]
  temperature flu_status
1        98.1      FALSE
2        98.6      FALSE
3       101.4       TRUE

When we access the data frame in this way, the result is a data frame containing all the rows of data for all the requested columns. Alternatively, the pt_data[2:3] command will also extract the temperature and flu_status columns. However, requesting the columns by name results in a clear and easy-to-maintain R code that will not break if the data frame is restructured in the future.

To extract values in the data frame, methods like those for accessing values in vectors are used. However, there is an important exception. Because the data frame is two-dimensional, both the desired rows and columns to be extracted must be specified. Rows are specified first, followed by a comma and then the columns in a format like this: [rows, columns]. As with vectors, rows and columns are counted beginning at one.

For instance, to extract the value in the first row and second column of the patient data frame (the temperature value for John Doe), use the following command:

> pt_data[1, 2]
[1] 98.1

If you like more than a single row or column of data, specify vectors for the rows and columns desired. The following command will pull data from the first and third rows and the second and fourth columns:

> pt_data[c(1, 3), c(2, 4)]
  temperature gender
1        98.1   MALE
3       101.4   MALE

To extract all the rows or columns, simply leave the row or column portion blank. For example, to extract all the rows of the first column:

> pt_data[, 1]
[1] "John Doe"     "Jane Doe"     "Steve Graves"

To extract all the columns of the first row, use the following command:

> pt_data[1, ]
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE

To extract everything, use the following command:

> pt_data[ , ]
  subject_name temperature flu_status gender blood symptoms
1     John Doe        98.1      FALSE   MALE     O   SEVERE
2     Jane Doe        98.6      FALSE FEMALE    AB     MILD
3 Steve Graves       101.4       TRUE   MALE     A MODERATE

Other methods to access values in lists and vectors can also be used to retrieve data frame rows and columns. For example, columns can be accessed by name rather than position, and negative signs can be used to exclude rows or columns of data. Therefore, the following command:

> pt_data[c(1, 3), c("temperature", "gender")]

Is equivalent to:

> pt_data[-2, c(-1, -3, -5, -6)]

To become more familiar with data frames, try practicing similar operations with the patient dataset, or even better, use data from one of your own projects. These types of operations are crucial for much of the work we will do in the upcoming chapters.

Matrixes and arrays

In addition to data frames, R provides other structures that store values in a tabular form. A matrix is a data structure that represents a two-dimensional table with rows and columns of data. Like vectors, R matrixes can contain any one type of data, although they are most often used for mathematical operations and, therefore, typically store only numeric data.

To create a matrix, simply supply a vector of data to the matrix() function along with a parameter specifying the number of rows (nrow) or number of columns (ncol). For example, to create a 2 x 2 matrix storing the numbers one through four, we can use the nrow parameter to request the data to be divided into two rows:

> m <- matrix(c(1, 2, 3, 4), nrow = 2)
> m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

This is equivalent to the matrix produced using ncol = 2:

> m <- matrix(c(1, 2, 3, 4), ncol = 2)
> m
     [,1] [,2]
[1,]    1    3
[2,]    2    4

You will notice that R loaded the first column of the matrix first before loading the second column. This is called column-major order, and is R's default method for loading matrices.

Tip

To override this default setting and load a matrix by rows, set the parameter byrow = TRUE when creating the matrix.

To illustrate this further, let's see what happens if we add more values to the matrix.

With six values, requesting two rows creates a matrix with three columns:

> m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Requesting two columns creates a matrix with three rows:

> m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
> m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

As with data frames, values in matrixes can be extracted using [row, column] notation. For instance, m[1, 1] will return the value 1 and m[3, 2] will extract 6 from the m matrix. Additionally, entire rows or columns can be requested:

> m[1, ]
[1] 1 4
> m[, 1]
[1] 1 2 3

Closely related to the matrix structure is the array, which is a multidimensional table of data. Where a matrix has rows and columns of values, an array has rows, columns, and any number of additional layers of values. Although we will be occasionally using matrixes in the upcoming chapters, the use of arrays is outside the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.149.182