One of the key challenges in supervised learning is the generation or extraction of an appropriate training set. Despite the effort and best intentions of the data scientist, the labeled data is not directly usable.
Let's take, for example, the problem of predicting the click through rate for an online display. 95-99% of data is labeled with a no-click event (negative classification class) while 1-5% of events are labeled as clicked (positive class). The unbalanced training set may produce an erroneous model unless the negatively-labeled events are reduced through sampling.
This chapter deals with the need, role, and some common methods of sampling a dataset. It covers the following topics:
Although random generators are of critical importance in statistics and machine learning, they are not covered in this book. There is a wealth of references regarding the benefits and pitfalls of various schemes for generating uniform random values [8:1].
Sampling is the process to extract a subset of a dataset that is chosen to draw inferences about the properties of this dataset. It is not always practical to use an entire dataset for the following reasons:
The most commonly-cited benefits of sampling are reduction of computation cost and latency of execution.
The challenge is to devise a procedure to generate a sample that represents accurately the original dataset so that any inference derived from the sample applies equally to the original dataset.
3.142.166.31