Chapter 14

Predictive Modeling with R

IN THIS CHAPTER

Doing basic R programming

Setting up your programming environment

Manipulating data

Developing a regression model

Developing classification tree models

A book covering the major facets of predictive analysis isn't complete unless it covers the R programming language. Our goal is to get you up and running as quick as possible. That goal entails getting you started making predictions and experimenting with predictive analysis, using standard tools such as R and the algorithms data scientists and statisticians use to make predictive models.

So do you have to know how to program to create predictive models? We would answer, “probably not, but it surely helps.” Relax. We think you'll have fun learning R. Granted, this chapter is pretty high-level, so you may read it just to boost your understanding of how data scientists and statisticians use R.

In an enterprise environment, you'll most likely use commercial tools available from industry vendors. Getting familiar with a free, open-source, widely used, powerful tool like R prepares you to use the commercial tools with ease. By that point, you'll have gotten a big dose of the terminology, understand how to handle the data, and know all the steps of predictive modeling. After doing all those steps “by hand,” you'll be well prepared to use the commercial tools.

Open-source R has memory and computational limits for enterprise-level big data analysis. Even so, open-source R is more than capable of handling most datasets for learning on a standard personal computer. On a powerfully equipped machine, it can handle datasets up to a few gigabytes. Beyond a few gigabytes, the computer may either run out of memory or take a very long time to run the machine learning algorithms.

Commercial enterprise tools with R aim to smooth out all the complexities of storing, analyzing, and processing big data; they even choose the algorithms for you. Learning some R helps you understand what the software tools are doing under the hood. That may give you the confidence to experiment even more by adjusting the default values and tweaking stuff here and there to see how it changes your predictive model.

R is an easy language with which to start learning programming. It has most, if not all, the same features you'll find in most of the programming languages commonly used for commercial software — Java, C++, Python, and such. So if you already have a programming language under your belt, this should be a cakewalk.

If R is the first programming language you've been exposed to, it'll still be easy, but it'll require some time spent playing around with it. But that's the beauty of the R language: You can play around with it and learn as you go. There is no boilerplate code that you have to remember to put in your code to make it work. You don't have to compile the program for it to work. It just works.

R is an interpreted language, which means you can run it interactively. You can run each line of code one by one and get an instant output (provided the code isn't for an intensive operation, and most operations in R shouldn't be intensive).

We also introduce you to the RStudio integrated development environment (IDE). Using an IDE to write code will give you further help in learning R. IDEs have a vast number of features that even the most seasoned developers depend on. Using a commercial IDE for a software development class may seem over the top to seasoned programmers who prefer a plain text editor like vi or emacs with programming language support and syntax highlighting. Professional software development benefits from the use an IDE because the volume of production-level code is enormous. Even if you wrote the entire code base, there's no way to fully understand and navigate through all that code without an IDE. An IDE also lets you to do things much faster and easier.

Programming in R

R is a programming language originally written for statisticians to do statistical analysis. It's open-source software, used extensively in academia to teach such disciplines as statistics, bio-informatics, and economics. From its humble beginnings, it has since been extended to do data modeling, data mining, and predictive analysis.

R has a very active community; free code contributions are being made constantly and consistently. One of the benefits of using an open-source tool such as R is that most of the data analysis that you'll want to do has already been done by someone. Code samples are posted on many message boards and by universities. If you're stuck with some problematic code, simply post a question on a message board (such as stack-exchange or stack-overflow) and you'll have an answer in no time.

Because R is free to use, it's the perfect tool to use to build a rapid prototype to show management the benefits of predictive analytics. You don't have to ask management to buy anything in order to get started right away. Any one of your data scientists, business analysts, statisticians, or software engineers can do the prototype without any further investment in software.

Therefore R can be an inexpensive way to experiment with predictive analytics without having to purchase enterprise software. After you prove that predictive analytics can add (or is adding) value, you should be able to convince management to consider getting a commercial-grade tool for your newly minted data-science team.

Installing R

Installing R is an easy process that takes less than thirty minutes. Most of the default settings can be accepted during the installation process. You can install R by downloading the installation program for Windows and other operating systems from the R project website at https://cran.r-project.org. From there, you can choose any one of the download mirrors to get the R installation binary. The mirror for this example is https://cran.rstudio.com.

This chapter guides you through the installation process for the Windows 10 operating system and the latest R release (version R-3.3.1). After you get to the RStudio website, look for the downloads link to get the file. After you've downloaded the file, just double-click it to begin the installation process.

Here is a direct link to the download page:

https://cran.rstudio.com/bin/windows/base/

After you install R, there should be two icons for R on your desktop: a 32-bit version and a 64-bit version. Choose the appropriate version for your computer. You can find the system type of your machine by going to the system settings and looking for the “About” tab. This computer is a 64-bit operating system, x64-based processor. Click the R x64 3.3.1 icon to confirm that the installation was successful.

If your installation succeeded, you should see the R Console, as shown in Figure 14-1.

image

FIGURE 14-1: RGui console.

Installing RStudio

After you've finished the R installation process, you may install RStudio. Installing the RStudio IDE is just as easy as installing R. You can download RStudio Desktop open-source edition from their website at www.rstudio.com. You'll want to install the desktop version appropriate for your operating system (for example, RStudio version 0.99.902 for Windows 10). After you've downloaded the file, just double-click it to begin the installation process.

Here is a direct link to the download page:

www.rstudio.com/products/rstudio/download

If your installation succeeded, the RStudio program should have been added to the Start menu. If you can't find it, use the search tool to look for “RStudio.”

Getting familiar with the environment

RStudio is a graphical user interface for developing R programs. The default interface (the way it looks when you first start the program) has three window panes. Go to the file menu and create a new R script by going to File -> New File -> R Script, or simply press Control+Shift+N. You’ll see a screen similar to what is shown in Figure 14-2. You'll use all four of them frequently; we describe what each one is used for and how to use it.

  • The top-left window is your script window.

    This is where you can copy and paste R code. You can run the code line-by-line or in chunks by highlighting the lines you want to execute. The script window is also where you can view the values of data frames. When you click a data frame from the workspace pane, it will open a new tab in the script pane with the data frame values.

  • The bottom-left window is your console window.

    This is where you can type your R code one line at a time. It will execute each line immediately after you press carriage return, unlike the script window where you have to click the Run button to execute line(s) of code. The output (if there is any) is printed on the next line right after the command finishes execution.

  • The top-right window is your workspace and history window.

    It has two tabs:

    • The History tab stores the history of all the code you've executed in the current session.
    • The Environment tab lists all the variables in the memory. Here you can click the variables to see their values and (if you so choose) load datasets interactively.
  • The bottom-right window is where you'll find five tabs of interest:
    • A Help tab offers documentation such as descriptions of functions.
    • The Packages tab shows all the packages installed and available to load by your program. The checked packages are the ones that have been loaded for your program to use. You can search and install new packages here.
    • The Plots tab is where the output of any plots will appear.
    • The Files tab is your file explorer inside RStudio.
    • The Viewer tab is used to display local web content. It’s like a web browser that can display either local web files or local web applications.
image

FIGURE 14-2: RStudio in the default view of the graphical user interface.

Learning just a bit of R

It doesn't take much to program something useful in R. The nice thing about R is that you can learn just what you need at the moment. And if you need something more, you can learn that, as you need it. So, at minimum, you'll need to know how to:

  • Assign values to variables
  • Do operations on those variables
  • Access and manipulate data types and structures
  • Call a function to do something

The upcoming subsections detail these operations. Everything else you can learn as you go.

Assigning variables

Variables can be assigned in several ways in R. The convention for R is to use the less-than sign (<) and the minus sign (–) together, making an arrow-like sign (<–). Another way (standard for other programming languages) is to use just the equal sign (=). One other feature in R is that you may assign a variable in either direction using less-than and greater-than symbols with the minus sign to make the arrow-like sign (<– or –>).

As described in the preceding paragraph, you can assign a value to a variable in one of three ways:

> x <- "hello there"
> x = "hello there"
> "hello there" -> x

The preceding three lines of code all do the same thing; knowing all the different ways will come in handy because you'll surely come across all of them when you start reading other people's code. As mentioned previously, the preferred way in R is the first one shown.

To print out the value of the variable, simply type the variable and then press Enter, like this:

> x
[1] "hello there"

In the RStudio console pane, a line that has a number between [ ] symbols shows you the output from the execution of the preceding line(s) of code.

technicalstuff The R interpreter has a leading > sign as the command prompt in the console window. You can't copy and paste whole sections of code in that window because the > command prompt will be interpreted as part of the code. It will output an error if you do. You must use the script window of RStudio if you want to copy and paste sections of code.

Operating on variables

For most of its arithmetic and logical operators, R uses syntax that is standard to most other programming languages:

  • These are examples of the arithmetic operators in R:

    > w <- 5 + 5 # addition
    > x <- w * 5 # multiplication
    > y <- x / 5 # division
    > z <- y - 5 # subtraction

    The # symbol is the start of a comment; the R interpreter ignores comments. You can print out the values by using the concatenate function, like this:

    > c(w,x,y,z)
    [1] 10 50 10 5

    A function is called when its name is followed by either parameters or arguments inside enclosing parenthesis. In the preceding example, c is the name for the concatenate function, followed by arguments w, x, y, z.

  • These are examples of the logical operators in R:

    w == y # is equal to
    x > z # is greater than
    y >= 10 # is greater than or equal to
    z < w # is less than
    10 <= y # is less than or equal to

    All these comparisons evaluate to TRUE values; you can see the results by executing them:

    > w == y
    [1] TRUE
    > x > z
    [1] TRUE
    > y >= 10
    [1] TRUE
    > z < w
    [1] TRUE
    > 10 <= y
    [1] TRUE

Working with data types and structures

Data types are sometimes confused with data structures. Each variable in the program memory has a data type. Sure, you can get away with having several variables in your program and still be manageable. But that probably won't work so well if you have hundreds (or thousands) of variables; you have to give every variable a name so you can access it. It's more efficient to store all those variables in a logical collection.

For example, if you have a million customers, you can use the vector data structure (as shown in the “Data structures” section later in this chapter) to store the names of all your customers and give the vector a name of customer. That's one variable (customer) with a data type of character that stores all million customer names. You can use array notation to access each customer. (For more about array notation, flip to the upcoming subsection's discussion of data structures.)

DATA TYPES

Like other full-fledged programming languages, R offers many data types and data structures. You'll only be working with just a handful of them in this chapter. There is no need to specify the type that you're assigning to a variable; the interpreter will do that for you. However, you can specify or convert the type if the need arises; this is called casting. You'll see examples of casting when you load the dataset in a later section of this chapter. The data types that we'll be dealing with in this chapter are as follows:

  • Numerical: These are your typical decimal numbers. These are called floats (short for floating-point numbers) or doubles in other languages.
  • Characters: These are your strings formed with combinations of letters, characters, and numbers. They aren't meant to have any numerical meaning. These are called strings in other languages.
  • Logical: TRUE or FALSE. Always capitalize these values in R. These values are called Booleans in other languages.

technicalstuff Comparing a string of numbers to a numerical number results in the interpreter converting the string of numbers into a numerical and then doing a numerical comparison.

Examples of data types are as follows:

> i <- 10 # numeric
> j <- 10.0 # numeric
> k <- "10" # character
> m <- i == j # logical
> n <- i == k # logical

After you execute those lines of code, you can find out their values and types by using the structure function str(). It displays the data type and value of the variable. That operation looks like this:

> str(i)
num 10
> str(j)
num 10
> str(k)
chr "10"
> str(m)
logi TRUE
> str(n)
logi TRUE

technicalstuff The expression in the n assignment is an example of the interpreter temporarily converting the data type of k into a numeric to do the evaluation between numeric i and character k.

DATA STRUCTURES

R will need a place to store groups of data types in order to work with it efficiently. These are called data structures. A real-life example of this concept is a parking garage: It's a structure that stores automobiles efficiently. It's designed to park as many automobiles as possible, and allows for automobiles to efficiently enter and exit the structure (in theory, at least). Also, no other objects besides automobiles should be parked in a parking structure. The data structures we use in this chapter are as follows:

  • Vectors: Vectors store a set of values of a single data type. Think of it as a weekly pillbox. Each compartment in the pillbox can only store a certain type of object. After you put some pills in one of the compartments, all the other compartments must also be filled with either zero pills or more pills. You can't put coins in that same box; you have to use a different “pill box” (vector) for that. Likewise, once you store a number in a vector, all future values should also be numbers. Otherwise the interpreter converts all your numbers to characters.
  • Matrices: A matrix looks like an Excel spreadsheet: Essentially it's a table consisting of rows and columns. The data populates the empty cells by row or column order, in which you specify when you create the matrix.

    remember All columns in a matrix must have the same data type.

  • Data frames: A data frame is similar to a matrix, except a data frame's columns can contain different data types. The datasets used in predictive modeling are loaded into data frames and stored there for use in the model.
  • Factors: A factor is like a vector with a limited number of distinct values. The number of distinct values is referred to as its level. You can use factors to treat a column that has a limited and known number of values as categorical values. By default, character data is loaded into data frames as factors.

remember You access vectors, matrices, and data frames by using array notation. For example, you would type v[5] to access the fifth element of vector v. For a two-dimensional matrix and data frame, you put in the row number and column number, separated by a comma, inside the square brackets. For example, you type m[2,3] to access the second row, third column value for matrix m.

remember R vectors, matrices, and data frames start indexing at 1. Python and many other programming languages index at 0.

Data structures are an advanced subject in computer science. For now, we're sticking to the practical. Just remember that data structures were built to store specific types of data and they have functions for data insertion, deletion, and retrieval.

Calling a function

Functions are lines of code that do something useful and concrete. Because these operations are often repeated, they are usually saved with a name so you can call (use) them again. Typically a function takes an input parameter, does something with it, and outputs a value. You save functions in your own programs or libraries for future use.

There are many built-in functions in the R programming language for developers to use. You've already used the str() function to find out the structure of a variable. It can be used to find the structure of data objects like data frames, vectors, and matrices, not just simple data types like numerical, characters, and logicals. The following section shows an example of using the structure function on a data frame.

You can use the built-in help function to find out more about R commands and functions. In the RStudio console, typing in help with the parameter of str will bring up documentation on the str function in the bottom-right pane. The command looks like this:

> help(str)

remember Calling a function will always be in the same form: the function name followed by parentheses. (Inside the parentheses is the parameter list that may or may not be empty).

Whenever you start to suspect that the operation you're doing must be common, check to see whether a function exists already that does it for you. Getting to know all the built-in functions in R will greatly improve your productivity and your code. Having a cheat sheet or reference card of all the built-in functions is handy when you're learning a new language. Here is a link to an R reference card:

http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Making Predictions Using R

We use R to make three predictive models in this chapter. All the R commands are entered into the RStudio console (the bottom-left pane onscreen) for execution. The first predictive model uses a regression algorithm to predict an automobile's fuel economy as miles per gallon. The second and third predictive models use decision tree and random forest classifiers to predict which category of wheat a particular seed belongs to.

Predicting using regression

A crucial task in predictive analytics is to predict the future value of something — such as the value of a house, the price of a stock at a future date, or sales you want to forecast for a product. You can do such tasks by using regression analysis — a statistical method that investigates the relationship between variables.

Introducing the data

The dataset we use to make a prediction on is the Auto-MPG dataset. This dataset has 398 observations and nine attributes plus the label. The label is the expected outcome; it's used to train and evaluate the accuracy of the predictive model. The outcome that we're trying to predict is the expected mpg (attribute 1) of an automobile when given the values of the attributes.

Here are the attributes and label in the column order in which they are provided:

  1. mpg
  2. cylinders
  3. displacement
  4. horsepower
  5. weight
  6. acceleration
  7. model year
  8. origin
  9. car name

You can get the dataset from the UCI machine-learning repository at

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

To get the dataset from the UCI repository and load it into memory, type the following command into the console:

> autos <-
read.csv("http://archive.ics.uci.edu/ml/machine-
learning-databases/auto-mpg/auto-mpg.data",
header=FALSE, sep="", as.is=TRUE)

You'll see that the dataset was loaded into memory as the data frame variable autos, by looking at your workspace pane (the top-right pane). Click the autos variable to view the data values in the source pane (the top-left pane). Figure 14-3 shows how the data looks in the source pane.

image

FIGURE 14-3: View of the autos data loaded into memory.

technicalstuff The data is from Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Using the head and tail functions can come in handy sometimes if you just want to see the first and last five rows of the data. This is also a quick way to verify that you actually loaded the correct file and it was read correctly. The summary function can give you basic statistics on each column of the data. You can copy and paste the following three lines of code, separately, into the source pane and have the output shown in the console:

head(autos,5)
tail(autos,5)
summary(autos)

The output of head and tail isn't shown here. This is the output of the summary function:

V1 V2 V3 V4
Min. : 9.00 Min. :3.000 Min. : 68.0 Length:398
1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 Class :character
Median :23.00 Median :4.000 Median :148.5 Mode :character
Mean :23.51 Mean :5.455 Mean :193.4
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
Max. :46.60 Max. :8.000 Max. :455.0
V5 V6 V7 V8
Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
Median :2804 Median :15.50 Median :76.00 Median :1.000
Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
V9
Length:398
Class :character
Mode :character

Preparing the data

You have to get the data into a form that the algorithm can use to build a model. To do so, you have to take some time to understand the data and to know the structure of the data. Type in the str function to find out the structure of the autos data. The command and its output look like this:

> str(autos)
'data.frame': 398 obs. of 9 variables:
$ V1: num 18 15 18 16 17 15 14 14 14 15 …
$ V2: int 8 8 8 8 8 8 8 8 8 8 …
$ V3: num 307 350 318 304 302 429 454 440 455 390 …
$ V4: chr "130.0" "165.0" "150.0" "150.0" …
$ V5: num 3504 3693 3436 3433 3449 …
$ V6: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 …
$ V7: int 70 70 70 70 70 70 70 70 70 70 …
$ V8: int 1 1 1 1 1 1 1 1 1 1 …
$ V9: chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" …

From looking at the structure, we can tell that there is some data preparation and cleanup to do. Here's a list of the needed tasks:

  • Rename the column names.

    This isn't strictly necessary, but for the purposes of this example, it's better to use column names we can understand and remember.

  • Change the data type of predictor variable V4 (horsepower) to a numeric data type.

    In this example, horsepower is supposed to be a continuous numerical value and not a character data type that the function loaded it as by default.

  • Handle missing values.

    Here horsepower has six missing values. You can find this by looking for “?” in the V4 column. Here is the code and output for that:

    > sum(autos["V4"] == "?")
    [1] 6

  • Change the attributes that have discrete values to factors.

    Here cylinders, modelyear, and origin have discrete values.

  • Discard the V9 (carname) attribute.

    Here carname doesn't add value to the model that we're creating. If the origin attribute weren't given, we could have derived the origin from the carname attribute.

To rename the columns, type in the following code:

> colnames(autos) <-
c("mpg","cylinders","displacement","horsepower",
"weight","acceleration","modelYear","origin",
"carName")

Next, we change the data type of horsepower to numeric with the following code:

> autos$horsepower <- as.numeric(autos$horsepower)

The program will complain because not all the values in horsepower were string representations of numbers. There were some missing values that were represented as the “?” character. That's fine for now because R converts each instance of ? into NA. The following line of code confirms that six missing values were converted into NA:

> sum(is.na(autos["horsepower"]))
[1] 6

A common way to handle the missing values of continuous variables is to replace each missing value with the mean of the entire column. The following line of code does that:

> autos$horsepower[is.na(autos$horsepower)] <-
mean(autos$horsepower,na.rm=TRUE)

It's important to have na.rm=TRUE in the mean function. It tells the function not to use columns with null values in its computation. Without it, the function will return NA. The following line of code verifies that the operation worked there are no missing values for the horsepower attribute.

> sum(is.na(autos["horsepower"]))
[1] 0

Next, we change the attributes with discrete values to factors. We have identified three attributes as discrete. The following three lines of code change the attributes.

> autos$origin <- factor(autos$origin)
> autos$modelYear <- factor(autos$modelYear)
> autos$cylinders <- factor(autos$cylinders)

Finally, we remove the carname attribute from the data frame with this line of code:

> autos$carName <- NULL

At this point, the data is prepared for the modeling process. The following is a view of the structure after the data-preparation process:

> str(autos)
'data.frame': 398 obs. of 8 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 …
$ cylinders : Factor w/ 5 levels "3","4","5","6",..:
5 5 5 5 5 5 5 5 5 5 …
$ displacement: num 307 350 318 304 302 429 454 440 455
390 …
$ horsepower : num 130 165 150 150 140 198 220 215 225
190 …
$ weight : num 3504 3693 3436 3433 3449 …
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5

$ modelYear : Factor w/ 13 levels "70","71","72",..:
1 1 1 1 1 1 1 1 1 1 …
$ origin : Factor w/ 3 levels "1","2","3":
1 1 1 1 1 1 1 1 1 1 …

In the preceding output, you can see that the attributes cylinders, modelYear, and origin are now factors and the attribute carName has been removed.

Creating the model

We want to create a model that we can evaluate by using known outcomes. To do that, we're going to split our autos dataset into two sets: one for training the model and one for testing the model. A 70/30 split between training and testing datasets will suffice. The next two lines of code calculate and store the sizes of each set:

> trainSize <- round(nrow(autos) * 0.7)
> testSize <- nrow(autos) - trainSize

To output the values, type in the name of the variable used to store the value and press Enter. Here is the output:

> trainSize
[1] 279
> testSize
[1] 119

This code determines the sizes of the datasets that we intend to make our training and test datasets. We still haven't actually created those sets yet. Also, we don't want simply to call the first 279 observations the training set and call the last 119 observations the test set. That would create a bad model because the autos dataset appears ordered. Specifically, the modelYear column is ordered from smallest to largest.

From examining the data, you can see that most of the heavier, eight-cylinder, larger-displacement, greater-horsepower autos reside on the top of the dataset. From this observation, without having to run any algorithms on the data, you can already tell that (in general for this dataset) older cars compared to newer cars as follows:

  • Are heavier
  • Have eight cylinders
  • Have larger displacement
  • Have greater horsepower

Okay, obviously many people know something about automobiles, so a guess as to what the correlations are won't be too farfetched after you see the data. Someone with a lot of automobile knowledge may have already known this without even looking at the data. This is just a simple example of a domain (cars) that many people can relate to. If this was data about cancer, however, most people wouldn't immediately understand what each attribute means.

tip This is where a domain expert and a data modeler are vital to the modeling process. Domain experts may have the best knowledge of which attributes may be the most (or least) important — and how attributes correlate with each other. They can suggest to the data modeler which variables to experiment with. They can give bigger weights to more important attributes and/or smaller weights to attributes of least importance.

So we have to make a training dataset and a test dataset that are truly representative of the entire set. One way to do so is to create the training set from a random selection of the entire dataset. Additionally, we want to make this test reproducible so we can learn from the same example. Thus we set the seed for the random generator so we'll have the same “random” training set. The following code does that task:

> set.seed(123)
> training_indices <- sample(seq_len(nrow(autos)),
size=trainSize)
> trainSet <- autos[training_indices, ]
> testSet <- autos[-training_indices, ]

Because the same seed was used to create the random sample of training indices, your output should be like this:

> training_indices
[1] 115 313 162 349 371 18 208 395 216 178 372 176 262 221 40 345
[17] 95 17 125 362 337 386 241 373 246 265 203 385 107 55 355 332
[33] 253 291 9 174 275 79 398 84 52 148 357 131 54 49 83 164
[49] 94 300 16 154 277 43 193 71 44 257 305 127 225 32 130 92
[65] 273 150 269 367 263 145 248 206 232 1 155 72 123 197 113 36
[81] 78 212 132 249 33 137 308 278 379 369 41 201 106 307 98 57
[97] 237 29 141 153 179 99 329 282 142 261 268 389 120 383 270 87
[113] 393 271 205 312 324 331 351 320 181 89 85 61 102 281 42 25
[129] 39 186 166 239 180 196 138 363 390 327 255 114 81 306 3 47
[145] 215 59 310 20 62 183 211 322 96 252 28 382 139 53 370 376
[161] 290 359 347 346 302 380 172 358 243 245 144 272 375 168 333 264
[177] 352 118 344 128 184 68 299 303 219 103 254 238 192 189 58 67
[193] 204 214 210 230 222 133 31 360 48 190 394 101 361 170 70 56
[209] 314 321 91 198 283 233 364 129 65 75 356 165 51 171 315 121
[225] 10 69 202 374 119 295 104 323 90 330 289 182 209 134 285 284
[241] 287 338 27 297 353 111 229 147 146 109 288 396 158 266 77 279
[257] 24 392 207 175 159 242 335 387 293 228 45 46 256 163 112 325
[273] 298 21 177 188 220 317 218

The training set trainSet contains 279 observations, along with the outcome (mpg) of each observation. The regression algorithm uses the outcome to train the model by looking at the relationships between the predictor variables (any of the seven attributes) and the response variable (mpg).

The test set testSet contains the rest of the data (that is, the portion not included in the training set). You should notice that the test set also includes the response (mpg) variable. When you use the predict function (from the model) with the test set, it ignores the response variable and only uses the predictor variables as long as the column names are the same as those in the training set.

To create a linear regression model, we will use the lm function, which stands for linear models. The lm function uses the mpg attribute as the response variable and all the other variables as predictor variables, type in the following line of code:

> model <- lm(formula=trainSet$mpg ~ . , data=trainSet)

Explaining the results

To see some useful information about the model you just created, type in the following code:

> summary(model)

The output provides information that you can explore if you want to tweak your model further. For now, we'll leave the model as it is. Here are the last two lines of the output:

Multiple R-squared: 0.8741, Adjusted R-squared: 0.8633
F-statistic: 80.82 on 22 and 256 DF, p-value: < 2.2e-16

A couple of data points stand out here:

  • The Multiple R-squared value tells you how well the regression line fits the data (goodness of fit). A value of 1 means that it's a perfect fit. So an r-squared value of 0.874 is good; it says that 87.4 percent of the variability in mpg is explained by the model.
  • The p-value tells you how significant the predictor variables affect the response variable. A p-value of less than (typically) 0.05 means that we can reject the null hypothesis that the predictor variables collectively have no effect on the response variable (mpg). The p-value of 2.2e-16 (that is, 2.2 with 16 zeroes in front of it) is much smaller than 0.05, so the predictors clearly have an effect on the response.

With the model created, we can make predictions against it with the test data we partitioned from the full dataset. To use this model to predict the mpg for each row in the test set, you issue the following command:

> predictions <- predict(model, testSet,
interval="predict", level=.95)

This is the code and output of the first six predictions:

> head(predictions)
fit lwr upr
2 16.48993 10.530223 22.44964
4 18.16543 12.204615 24.12625
5 18.39992 12.402524 24.39732
6 12.09295 6.023341 18.16257
7 11.37966 5.186428 17.57289
8 11.66368 5.527497 17.79985

The output is a matrix that shows the predicted values in the fit column and the prediction interval in the lwr and upr columns — with a confidence level of 95 percent. The higher the confidence level, the wider the range, and vice versa. The predicted value is in the middle of the range; so changing the confidence level doesn't change the predicted value. The first column is the row number of the full dataset.

To see the actual and predicted values side by side so we can easily compare them, you can type in the following lines of code:

> comparison <- cbind(testSet$mpg, predictions[,1])
> colnames(comparison) <- c("actual", "predicted")

The first line creates a two-column matrix with the actual and predicted values. The second line changes the column names to actual and predicted. Type in the first line of code to get the output of the first six lines of comparison, as follows:

> head(comparison)
actual predicted
2 15 16.48993
4 16 18.16543
5 17 18.39992
6 15 12.09295
7 14 11.37966
8 14 11.66368

We also want to see a summary of the two columns to compare their means. This is the code and output of the summary:

> summary(comparison)
actual predicted
Min. :10.00 Min. : 8.849
1st Qu.:16.00 1st Qu.:17.070
Median :21.50 Median :22.912
Mean :22.79 Mean :23.048
3rd Qu.:28.00 3rd Qu.:29.519
Max. :44.30 Max. :37.643

Next we use the mean absolute percent error (mape), to measure the accuracy of our regression model. The formula for mean absolute percent error is

(Σ(|Y-Y'|/|Y|)/N)*100

where Y is the actual score, Y' is the predicted score, and N is the number of predicted scores. After plugging the values into the formula, we get an error of only 10.94 percent. Here is the code and the output from the R console:

> mape <- (sum(abs(comparison[,1]-comparison[,2]) /
abs(comparison[,1]))/nrow(comparison))*100
> mape
[1] 10.93689

The following code enables you to view the results and errors in a table view:

> mapeTable <- cbind(comparison, abs(comparison[,1]-
comparison[,2])/comparison[,1]*100)
> colnames(mapeTable)[3] <- "absolute percent error"
> head(mapeTable)
actual predicted absolute percent error
2 15 16.48993 9.932889
4 16 18.16543 13.533952
5 17 18.39992 8.234840
6 15 12.09295 19.380309
7 14 11.37966 18.716708
8 14 11.66368 16.688031

Here's the code that enables you to see the percent error again:

> sum(mapeTable[,3])/nrow(comparison)
[1] 10.93689

Making new predictions

To make predictions with new data, you simply use the predict function with a list of the attribute values. We don’t have any new test observations so we’ll just make up some values for the attributes. The following code does that job:

> newPrediction <- predict(model,
list(cylinders=factor(4), displacement=370,
horsepower=150, weight=3904, acceleration=12,
modelYear=factor(70), origin=factor(1)),
interval="predict", level=.95)

This is the code and output of the new prediction value:

> newPrediction
fit lwr upr
1 14.90128 8.12795 21.67462

What you have here is your first real prediction from the regression model. The model predicts that this fictional car will have around 14.90 miles per gallon. Because it's from unseen data and you don't know the outcome, you can't compare it against anything else to find out whether it is correct.

After you've evaluated the model with the testing dataset, and you're happy with its accuracy, you’d be confident that you built a good predictive model. You'll have to wait for business results to measure the effectiveness of your predictive model.

tip There may be optimizations you can make to build a better and more efficient predictive model. By experimenting, you may find the best combination of predictors to create a faster and more accurate model. One way to construct a subset of the features is to find the correlation between the variables and remove the highly correlated variables. Removing the redundant variables that add nothing (or add very little information) to the fit, you can increase the speed of the model. This is especially true when you're dealing with many observations (rows of data) where processing power or speed might be an issue. For a big dataset, more attributes in a row of data will slow down the processing. So you should try to eliminate as much redundant information as possible.

Using classification to predict

Another task in predictive analytics is to classify new data by predicting what class a target item of data belongs to, given a set of independent variables. You can, for example, classify a customer by type — say, as a high-value customer, a regular customer, or a customer who is ready to switch to a competitor — by using a decision tree.

Setting up the environment

The party package is one of several packages in R that creates decision trees. It isn't installed by default in the base package of R. Other common decision-tree packages include rpart, tree, and randomForest. The randomForest package is used at the end of this chapter. The first step is to set up the environment by installing the party package and load it into the R session.

Type the following lines of code to install and load the party package:

> install.packages("party")
> library(party)

The party package may take a couple of minutes to download and install. You will see a message on the console after the package has been downloaded and loaded into RStudio.

Introducing the data

The dataset we use to make a prediction on is the Seeds dataset. This dataset has 210 observations and 7 attributes plus the label. The label is the expected outcome and is used to train and evaluate the accuracy of the predictive model. The outcome that we're trying to predict is the type of seed (attribute 8), given the values of the seven attributes. The three possible values for the seed type are labeled 1, 2, and 3, and represent the Kama, Rosa, and Canadian varieties of wheat.

The attributes in the column order they are provided are the following:

  1. area
  2. perimeter
  3. compactness
  4. length of kernel
  5. width of kernel
  6. asymmetry coefficient
  7. length of kernel groove
  8. class of wheat

You can get the dataset from the UCI machine-learning repository at

http://archive.ics.uci.edu/ml/datasets/seeds

To get the dataset from the UCI repository and load it into memory, type the following command into the console:

> seeds <-
read.csv("http://archive.ics.uci.edu/ml/machine
-learning-databases/00236/seeds_dataset.txt",
header=FALSE, sep="", as.is=TRUE)

You see that the dataset was loaded into memory as the data frame variable seeds, by looking at your workspace pane (the top-right). Click the seeds variable to see the data values in the source pane (the top-left). Figure 14-4 shows how the data looks in the source pane.

image

FIGURE 14-4: View of the seeds data loaded into memory.

You can find more information about the data you just loaded by using the summary() function.

> summary(seeds)
V1 V2 V3
Min. :10.59 Min. :12.41 Min. :0.8081
1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569
Median :14.36 Median :14.32 Median :0.8734
Mean :14.85 Mean :14.56 Mean :0.8710
3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878
Max. :21.18 Max. :17.25 Max. :0.9183

Preparing the data

You have to get the data into a form that the algorithm can use to build a model. To do that, you have to take some time to understand the data and to know its structure. Type in the str function to find out the structure of the seeds data. Here's what it looks like:

> str(seeds)
'data.frame': 210 obs. of 8 variables:
$ V1: num 15.3 14.9 14.3 13.8 16.1 …
$ V2: num 14.8 14.6 14.1 13.9 15 …
$ V3: num 0.871 0.881 0.905 0.895 0.903 …
$ V4: num 5.76 5.55 5.29 5.32 5.66 …
$ V5: num 3.31 3.33 3.34 3.38 3.56 …
$ V6: num 2.22 1.02 2.7 2.26 1.35 …
$ V7: num 5.22 4.96 4.83 4.8 5.17 …
$ V8: int 1 1 1 1 1 1 1 1 1 1 …

From looking at the structure, we can tell that the data needs one pre-processing step and one convenience step:

  • Rename the column names. Once again, this isn't strictly necessary, but it makes sense to provide meaningful column names that we can understand and remember.
  • Change the attribute with categorical values to a factor. The label has three possible categories.

To rename the columns, type in the following code:

> colnames(seeds) <-
c("area","perimeter","compactness","length",
"width","asymmetry","length2","seedType")

Next, we change the attribute that has categorical values to a factor. (We've identified the label as categorical.) The following code changes the data type to a factor:

> seeds$seedType <- factor(seeds$seedType)

This command finishes the preparation of the data for the modeling process. The following is a view of the structure after the data-preparation process:

> str(seeds)
'data.frame': 210 obs. of 8 variables:
$ area : num 15.3 14.9 14.3 13.8 16.1 …
$ perimeter : num 14.8 14.6 14.1 13.9 15 …
$ compactness: num 0.871 0.881 0.905 0.895 0.903 …
$ length : num 5.76 5.55 5.29 5.32 5.66 …
$ width : num 3.31 3.33 3.34 3.38 3.56 …
$ asymmetry : num 2.22 1.02 2.7 2.26 1.35 …
$ length2 : num 5.22 4.96 4.83 4.8 5.17 …
$ seedType : Factor w/ 3 levels "1","2","3":
1 1 1 1 1 1 1 1 1 1 …

Creating the model

We want to create a model that we can evaluate using known outcomes. To do that, we're going to split our seeds dataset into two sets: one for training the model and one for testing the model. We’ll use a 70/30 split between training and testing datasets. The next two lines of code calculate and store the sizes of each dataset:

> trainSize <- round(nrow(seeds) * 0.7)
> testSize <- nrow(seeds) - trainSize

To output the values, type in the name of the variable that we used to store the value and press Enter. Here is the output:

> trainSize
[1] 147
> testSize
[1] 63

This code determines the sizes for the training and testing datasets. We haven't actually created the sets yet. Also, we don't just want the first 147 observations to be the training set and the last 63 observations to be the test set. That would create a bad model because the seeds dataset is ordered in the label column.

Thus we have to make both the training set and the test set representative of the entire dataset. One way to do that is create the training set from a random selection of the entire dataset. Additionally, we want to make this test reproducible so we can learn from the same example. We can do that by setting the seed dataset for the random generator so we have the same “random” training set, like this:

> set.seed(123)
> training_indices <- sample(seq_len(nrow(seeds)),
size=trainSize)
> trainSet <- seeds[training_indices, ]
> testSet <- seeds[-training_indices, ]

Because the the same seed was used to create the random sample of training indices, your output should be like this:

> training_indices
[1] 61 165 86 183 194 10 108 182 112 92 192 91 135 113 21
[16] 176 48 9 63 207 170 131 121 186 122 132 101 109 53 27
[31] 174 162 123 141 5 84 185 38 55 40 25 71 70 62 26
[46] 23 39 76 44 139 8 169 127 20 88 33 157 116 137 57
[61] 100 15 151 41 119 66 117 144 197 200 106 156 99 1 65
[76] 30 51 82 47 149 32 87 54 184 13 172 178 110 191 22
[91] 16 78 147 77 173 188 90 11 203 148 145 37 202 103 52
[106] 94 96 177 42 189 105 198 6 201 168 14 140 89 128 187
[121] 59 29 28 154 130 175 126 160 12 56 50 159 204 111 171
[136] 80 210 58 79 208 115 93 152 85 81 120 136

The training set trainSet we get from this code contains 147 observations along with an outcome (seedType) of each observation. When we create the model, we'll tell the algorithm which variable is the outcome. The classification algorithm uses those outcomes to train the model by looking at the relationships between the predictor variables (any of the seven attributes) and the label (seedType).

The test set testSet contains the rest of the data, that is, all data not included in the training set. Notice that the test set also includes the label (seedType). When you use the predict function (from the model) with the test set, it ignores the label and only uses the predictor variables, as long as the column names are the same as they are in the training set.

Now it's time to train the model. The next step is to use the party package to create a decision-tree model, using seedType as the target variable and all the other variables as predictor variables. Type in the following line of code:> model <- ctree(seedType ~ . , data=trainSet)

Explaining the results

To see some useful information about the model you just created, type in the following code:

> summary(model)
Length Class Mode
1 BinaryTree S4

The Class column tells you that you've created a decision tree. To see how the splits are being determined, you can simply type in the name of the variable in which you assigned the model, in this case model, like this:

> model
Conditional inference tree with 6 terminal nodes
Response: seedType
Inputs: area, perimeter, compactness, length, width,
asymmetry, length2
Number of observations: 147
1) area <= 16.2; criterion = 1, statistic = 123.423
2) area <= 13.37; criterion = 1, statistic = 63.549
3) length2 <= 4.914; criterion = 1, statistic = 22.251
4)* weights = 11
3) length2 > 4.914
5)* weights = 45
2) area > 13.37
6) length2 <= 5.396; criterion = 1, statistic = 16.31
7)* weights = 33
6) length2 > 5.396
8)* weights = 8
1) area > 16.2
9) length2 <= 5.877; criterion = 0.979, statistic =
8.764
10)* weights = 10
9) length2 > 5.877
11)* weights = 40

Even better, you can visualize the model by creating a plot of the decision tree with this code:

> plot(model)

You can see the output in the plots tab (bottom-right) of RStudio. (Click the zoom button to make it look more to scale.)

Figure 14-5 shows a graphical representation of a decision tree. You can see that the overall shape mimics that of a real tree. It's made of nodes (the circles and rectangles) and links or edges (the connecting lines). The very first node (starting at the top) is called the root node and the nodes at the bottom of the tree (rectangles) are called terminal nodes. There are five decision nodes and six terminal nodes.

image

FIGURE 14-5: Decision tree of the seeds model.

At each node, the model makes a decision based on the criteria in the circle and the links, and chooses a way to go. When the model reaches a terminal node, a verdict or a final decision is reached. In this particular case, two attributes, the length2 and the area, are used to decide whether a given seed type is in class 1, 2 or 3.

For example, take observation #2 from the seeds dataset. It has a length2 of 4.956 and an area of 14.88. We can use the tree we just built to decide which particular seed type this observation belongs to. Here's the sequence of steps:

  1. We start at the root node, which is node 1 (the number is shown in the small square at the top of the circle). We decide based on the area attribute: Is the area of observation #2 less than or equal to (denoted by <=) 16.2? The answer is yes, so we move along the path to node 2.
  2. At node 2, the model asks: Is the area <= 13.37? The answer is no, so we try the next link which asks: Is the area > 13.37? The answer is yes, so we move along the path to node 6. At this node the model asks: Is the length2 <= 5.396? It is, and we move to terminal node 7 and the verdict is that observation #2 is of seed type 1. And it is, in fact, seed type 1.

    The model does that process for all other observations to predict their classes.

  3. To find out whether we trained a good model, we check it against the training data. We can view the results in a table with the following code:

    > table(predict(model),trainSet$seedType)
    1 2 3
    1 45 4 3
    2 3 47 0
    3 1 0 44

    The results show that the error (or misclassification rate) is 11 out of 147, or 7.48 percent.

  4. With the results calculated, the next step is to read the table.

    The correct predictions are the ones that show the column and row numbers as the same. Those results show up as a diagonal line from top-left to bottom-right; for example, [1,1], [2,2], [3,3] are the number of correct predictions for that class. So for seed type 1, the model correctly predicted it 45 times, while misclassifying the seed 7 times (4 times as seed type 2, and 3 times as type 3). For seed type 2, the model correctly predicted it 47 times, while misclassifying it 3 times. For seed type 3, the model correctly predicted it 44 times, while misclassifying it only once.

We find that this is a good model. So now we evaluate it with the test data. Here is the code that uses the test data to predict and store it in a variable (testPrediction) for later use:

> testPrediction <- predict(model, newdata=testSet)

To evaluate how the model performed with the test data, we view it in a table and calculate the error, for which the code looks like this:

> table(testPrediction, testSet$seedType)
testPrediction 1 2 3
1 18 4 4
2 0 15 0
3 3 0 19

The results show that the error is 11 out of 63, or 17.46 percent. This is a bit worse than what was produced with the training data.

Making new predictions

To make predictions with new data, you simply use the predict function with a list of the seven attribute values for the seed type. It was just a coincidence that we are using seven attributes as we did for the autos dataset. There can be any number of attributes in a dataset. The following code does that:

> newPrediction <- predict(model, list(area=11,
perimeter=13, compactness=0.855, length=5,
width=2.8, asymmetry=6.5, length2=5))

This is the code and output of the new prediction value.

> newPrediction

[1] 3
Levels: 1 2 3

The prediction was seed type 3, which isn't surprising because values were deliberately chosen that were close to observation #165.

We can also get this prediction value by walking through the decision tree shown in Figure 14-5. Here are the steps:

  1. Start at the root node and look at what attribute the node wants to evaluate. Is area > 16.2 or area <= 16.2. The value of area is 11, so the answer is <=. So we go to the left child, node #2.
  2. At node #2, we are asked if area is > 13.37 or area <= 13.37. The answer is <=. So we go to the left child, node #3.
  3. At node #3, we are asked if length2 is > 4.914 or length2 <= 4.914. The value of length2 is 5, so the answer is >. So we go to the right child, node #5.
  4. At terminal node #5, you can see in the bar chart, that the bar is the highest for 3. So the decision tree model predicts that the data it was given belongs to seed type 3.

Classification by random forest

The last model in this chapter is the random forest model. The random forest algorithm uses an ensemble technique to build its model. (To learn more about ensemble techniques, refer to Chapter 7.) This example continues from the preceding section by creating a random forest model to classify the Seeds dataset.

You’ve already done the all the hard work in setting up the data in the last section on decision trees, so this section mostly repeat that process. Trying a different classification algorithm is just a matter of changing a few lines of code. The steps are much the same from model to model. Most time will be spent on data preparation. However, you need to know the required parameters for each algorithm. To get better performance, you need to dive deeper into the algorithms to tune them.

Preparing the data

Before we can create the model, let’s take the steps to get the data ready for modeling. The steps will look similar to the preceding section on decision trees:

  1. Install the randomForest package and load it into RStudio.

    > install.packages("randomForest")
    > library(randomForest)

  2. Load the data.

    > seeds <-
    read.csv("http://archive.ics.uci.edu/ml/machine
    -learning-databases/00236/seeds_dataset.txt",
    header=FALSE, sep="", as.is=TRUE)

  3. Prepare the data.

    Rename the variables to meaningful names.

    > colnames(seeds) <-
    c("area","perimeter","compactness","length",
    "width","asymmetry","length2","seedType")

    Change the label to a factor data type.

    > seeds$seedType <- factor(seeds$seedType)

  4. Split the data into training and test sets.

    > trainSize <- round(nrow(seeds) * 0.7)
    > testSize <- nrow(seeds) - trainSize
    > set.seed(123)
    > training_indices <- sample(seq_len(nrow(seeds)),
    size=trainSize)
    > trainSet <- seeds[training_indices, ]
    > testSet <- seeds[-training_indices, ]

With the preprocessing steps completed, you're ready to create the random forest model.

Creating the model

To create the random forest model, enter the following code into the console:

> set.seed(123)
> model <- randomForest(seedType ~ . , data=trainSet)

There are many tuning parameters for the random forest algorithm. The most commonly used parameter is ntree, which specifies the number of trees to grow (or number of voters). The default value for ntree is 500. To learn more about the ntree parameter, refer to the random forest section (n_estimators parameter) in Chapter 12. You can find a list of all the parameters for the random forest algorithm by using the help function, as shown below.

> help(randomForest)

Evaluating the results

When the random forest model has been fitted, you can enter the model variable into the console to find the estimate of error rate:

> model
Call:
randomForest(formula = seedType ~ ., data = trainSet)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 8.16%
Confusion matrix:
1 2 3 class.error
1 43 2 4 0.12244898
2 2 49 0 0.03921569
3 4 0 43 0.08510638

The model estimates that it will have an error rate of 8.16 percent (or be 91.8 percent accurate, if you're a positive sort of person). It also gives the error rate of each class in the confusion matrix. Each class is represented by a row:

  • Class 1 has an error rate of 12.2 percent.
  • Class 2 has an error rate of 3.9 percent.
  • Class 3 has an error rate of 8.5 percent.

The random forest algorithm allows the error rate of the trees to be plotted, as shown in Figure 14-6. The plot shows the number of trees on the x-axis and the error rate on the y-axis. Because the forest has 500 trees, the limit on the x-axis is 500. There are four color-coded curves that show the error for each class and the overall error rate, from top to bottom:

  • Class 1 (the highest error rate)
  • Class 3 (about 8.51 percent).
  • Overall error rate (about 8.16 percent)
  • Class 2 (the lowest error rate)
image

FIGURE 14-6: Error rate plot of random forest model.

To create the plot, enter the following code:

> plot(model)

Now you can check the test dataset. Enter the following code into the console:

> testPrediction <- predict(model, newdata=testSet)
> table(testPrediction, testSet$seedType)
testPrediction 1 2 3
1 20 1 0
2 0 18 0
3 1 0 23

The results show that the error is 2 out of 63, or 3.17 percent. This is much better than expected, as the error rate was predicted to be around 8.16 percent. The results of the random forest model also is better than the results of the decision tree previously created in this chapter with the party package.

Making new predictions

After you have created the model and then determined that it’s the model you will deploy, you’ll feed it with new and unlabeled data to get predictions on what type of seed it is. Because we don’t have any more new data, we’ll use the same attribute values that we made up for the decision tree in the earlier section of this chapter. This will give us the opportunity to make a simple comparison of the decision tree model and the random forest model.

The following code makes a new prediction:

> newPrediction <- predict(model, list(area=11,
perimeter=13, compactness=0.855, length=5,
width=2.8, asymmetry=6.5, length2=5))

This is the code and output of the new prediction value:

> newPrediction
1
3
Levels: 1 2 3

The prediction was seed type 3, as shown in the second row of the output. (You can ignore the value in the first row and the last row; it's the row number of the list and the possible outcomes, respectively.) The result is the same as the prediction from the party decision tree. That isn't surprising as this should be an easy prediction because the attribute values were very close to the attributes values of observation #165. Of the two classification tree models created in this chapter, the random forest model performed better with the default parameters on the Seeds dataset, because it has a lower error rate.

Making new predictions on seeds data may seem a bit abstract and it may be difficult to understand how it relates to solving business problems. A real-life example of using classification you may be able to relate to is email spam detection.

This is a simplified example of creating and deploying a spam detection model. Suppose you're the operator of an email service and the business problem you want to solve is to determine if the email a customer receives is spam. Customers hate receiving spam email because it's annoying, wastes time, and it wastes disk space. So you want a model that produces the least amount of errors.

You create a spam detection model using a labeled spam email dataset. After you're confident with the model’s results on the test data, you will deploy it to the production servers. Now, for every new email that is sent to the customer, the model will make a prediction on whether the email is spam before it's delivered. So the email will be either put into the customer’s inbox or spam folder (if there are no other predefined filters already set). If spam email lands in the inbox or if valid email lands in the spam folder, it would be considered an error or misclassification.

By allowing customers to mark email as spam when they land in the inbox or not-spam if it lands in the spam folder, the customers are essentially labeling the data for you to measure the models results. Obviously not all customers will spend the time to mark any or all misclassified emails, but you have enough to work with and you can also use internal emails to measure the results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.27.234