Chapter 8. Monte Carlo Inference

One of the key challenges in supervised learning is the generation or extraction of an appropriate training set. Despite the effort and best intentions of the data scientist, the labeled data is not directly usable.

Let's take, for example, the problem of predicting the click through rate for an online display. 95-99% of data is labeled with a no-click event (negative classification class) while 1-5% of events are labeled as clicked (positive class). The unbalanced training set may produce an erroneous model unless the negatively-labeled events are reduced through sampling.

This chapter deals with the need, role, and some common methods of sampling a dataset. It covers the following topics:

  • Generation of random samples from a given distribution
  • Application of Monte Carlo numerical sampling to approximation
  • Bootstrapping
  • Markov Chain Monte Carlo for estimating parametric distribution

Although random generators are of critical importance in statistics and machine learning, they are not covered in this book. There is a wealth of references regarding the benefits and pitfalls of various schemes for generating uniform random values [8:1].

The purpose of sampling

Sampling is the process to extract a subset of a dataset that is chosen to draw inferences about the properties of this dataset. It is not always practical to use an entire dataset for the following reasons:

  • Dataset is too large
  • Dataset is not available in a timely fashion
  • Extraction of complex features is very computationally intensive
  • A very large percentage of the training data is labeled to one of the classes which require down-sampling
  • Data is a continuous signal

The most commonly-cited benefits of sampling are reduction of computation cost and latency of execution.

Note

Independent and identical distribution

It is generally assumed that the original dataset reflects an independent and identically distributed population (i.i.d).

The challenge is to devise a procedure to generate a sample that represents accurately the original dataset so that any inference derived from the sample applies equally to the original dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.166.31