Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Monte Carlo Inference

One of the key challenges in supervised learning is the generation or extraction of an appropriate training set. Despite the effort and best intentions of the data scientist, the labeled data is not directly usable.

Let's take, for example, the problem of predicting the click through rate for an online display. 95-99% of data is labeled with a no-click event (negative classification class) while 1-5% of events are labeled as clicked (positive class). The unbalanced training set may produce an erroneous model unless the negatively-labeled events are reduced through sampling.

This chapter deals with the need, role, and some common methods of sampling a dataset. It covers the following topics:

Generation of random samples from a given distribution
Application of Monte Carlo numerical sampling to approximation
Bootstrapping
Markov Chain Monte Carlo for estimating parametric distribution

Although random generators are of critical importance in statistics and machine learning, they are not covered in this book. There is a wealth of references regarding the benefits and pitfalls of various schemes for generating uniform random values [8:1].

The purpose of sampling

Sampling is the process to extract a subset of a dataset that is chosen to draw inferences about the properties of this dataset. It is not always practical to use an entire dataset for the following reasons:

Dataset is too large
Dataset is not available in a timely fashion
Extraction of complex features is very computationally intensive
A very large percentage of the training data is labeled to one of the classes which require down-sampling
Data is a continuous signal

The most commonly-cited benefits of sampling are reduction of computation cost and latency of execution.

Note

Independent and identical distribution

It is generally assumed that the original dataset reflects an independent and identically distributed population (i.i.d).

The challenge is to devise a procedure to generate a sample that represents accurately the original dataset so that any inference derived from the sample applies equally to the original dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Monte Carlo Inference

Create new playlist

Sign In

Sign Up

Chapter 8. Monte Carlo Inference

The purpose of sampling

Note

Table of Contents for
8. Monte Carlo Inference