Pseudorandom numbers

Now, we will show you how to simulate random numbers, or more accurately called pseudorandom numbers, because as you will see, unlike true random numbers, which truly can't be predicted, these numbers can be predicted using random number generator algorithms. Let's review how to generate a series of pseudorandom numbers from 0 to 1. One of the simplest methods is to simulate independent uniform random variables using a multiplicative congruential pseudorandom number generator. Since there are many different methods used to generate pseudorandom numbers, let's consider the following example of a simple multiplicative congruential pseudorandom number generator that can be used to simulate random variables. Let m be a large prime integer and k be another integer less than 10, preferably close to the square root of m. We can generate pseudorandom numbers by applying the following formula:

Pseudorandom numbers

In the preceding formula, i = 0, 1, 2, 3, … , n.

Now let's apply the previous formula to generate six pseudorandom numbers. Before we begin, we need to set the seed, which in this equation corresponds to Pseudorandom numbers when i = 0. Ideally, we should set the seed Pseudorandom numbers to any value between 1 and m. Then, once the seed has been chosen, we can generate each pseudorandom number with the previous equation. To translate this mathematical equation into a function, we can run the getRandomNbs() function that we write in R as follows:

> getRandomNbs <- function(n, m, seed){
# create a numeric vector to store the numbers
pseudorandom.numbers <- numeric(n)

# set k near square root of m
k = round(sqrt(m)) -2 

# use a for loop to generate the numbers
for(i in 1:n){
seed <- (k*seed) %% m
pseudorandom.numbers[i] <-seed/m
}
return(pseudorandom.numbers)
}

Now let's use our function to generate five pseudorandom variables, as shown in the following code, with the seed set to 27000 and 334753 as our large prime integer. You can verify that 334,753 is a prime number at http://www.onlineconversion.com/prime.htm.

> getRandomNbs(5, 334753, 27000)
[1] 0.5387913 0.8825731 0.2446909 0.1866272 0.6838684

When writing your own pseudorandom number generator, it is important to make sure that the numbers generated follow a uniform distribution and that these values are independent of each other. We can test that the pseudorandom numbers generated with the getRandomNbs(5, 334753, 27000) function follow a uniform distribution using a Chi-square test. Instead of the chisq.test() function available in R, we will use the rng.chisq() function that was specifically written to test random number generators for uniformity. To add this function to your R session, enter the following code:

> rng.chisq <- function(x, m) {
Obs <- trunc(m*x)/m
 Obs <- table(Obs)
 p <- rep(1,m)/m
 Exp <- length(x)*p
 chisq <- sum((Obs-Exp)^2/Exp)
 pvalue <- 1-pchisq(chisq, m-1)
 results <- list(test.statistic=chisq, p.value=pvalue, df=m-1)
 return(results)
}

In the rng.chisq() function, x is the output from a pseudorandom number generator, where the output is in the [0, 1] interval and m is the number of subintervals to use for the Chi-square test. To test our random number generator for uniformity, we will use our getRandomNb() function to simulate 1,000 pseudorandom variables, and use the rng.chisq() function to test for uniformity using five subintervals for the test. Let's take a look at this in the following lines of code:

> v <- getRandomNbs(1000, 334753, 27000)
> rng.chisq(v, m=5)
$test.statistic
[1] 2.9

$p.value
[1] 0.5746972

$df
[1] 4

Since our p value is large, we don't have sufficient evidence to reject the null hypothesis that our random variables follow a uniform distribution. Next, we want to make sure that the pseudorandom numbers generated are independent of each other. A simple way we can test whether the values we generated are truly random is to plot a lag plot with the lag.plot() function. Random data should not show any underlying structure in the lag plot. To inspect our random numbers for independency, let's plot the lag plot for the 1,000 pseudorandom numbers we generated with the getRandomNb() function as follows:

> lag.plot(v)

The result is shown in the following plot:

Pseudorandom numbers

Overall, the data looks randomly distributed. There are more formal ways to test for independence that are beyond the scope of this book, but if you are interested in learning more on the topic, we suggest you read about Spectral Tests at http://en.wikipedia.org/wiki/Spectral_test.

The runif() function

Instead of generating your own pseudorandom number generator, you can use the runif() function to generate random numbers from the uniform distribution that lies between the intervals a and b, where min = a and max = b. The runif() function selects a seed internally and uses a different formula than the one we used in our getRandomNbs() function to generate pseudorandom numbers. Let's take a look at this in the following example:

> runif(n=5, min=0, max=1)
[1] 0.4562942 0.1861085 0.4779453 0.6313259 0.9768385

Each time you run the runif() function, R will select a different seed. So if you would like to use a specific seed throughout your session, then you can set the seed using the set.seed() function. By setting the seed, you will be able to generate a sequence of numbers that look random, but you will be able to reproduce them when you call the set.seed() function with the same seed. Here is an example to illustrate this point.

First, let's generate three pseudorandom numbers without setting the seed as follows:

> runif(3)
[1] 0.9744210 0.4709912 0.1204069

Now generate three pseudorandom numbers after setting the seed to 27000 as follows:

> set.seed(27000)
> runif(3)
[1] 0.5522500 0.5538553 0.4528518

Now generate three more random numbers as follows:

> runif(3)
[1] 0.6177212 0.4572295 0.4544682

Notice you do not get the same numbers from the previous run. To get the same three numbers from the time you set the seed to 27000, you will need to rerun the set.seed(27000) function before entering runif(3) as follows:

> set.seed(27000)
> runif(3)
[1] 0.5522500 0.5538553 0.4528518

Let's generate five random numbers, as follows:

> runif(5)
[1] 0.6177212 0.4572295 0.4544682 0.9808293 0.5509730

You will also notice that the next string of numbers has the same first three numbers after the second command following the set.seed() from above. Now, say we wanted to continue from the three numbers that were initially generated after setting the seed to 27000. All we need to do is enter set.seed(27000) and then runif(5) as follows:

> set.seed(27000)
> runif(5)
[1] 0.5522500 0.5538553 0.4528518 0.6177212 0.4572295

As you can see, we get the same first three numbers as we got when we ran runif(3) after setting the seed to 27000 because, similar to the getRandomNbs() function that we wrote, the runif() function uses a random number generation algorithm. Therefore, by setting the seed, you will obtain a predefined sequence of numbers. Hence, we refer to these values as pseudorandom numbers because they can be predicted using a specific algorithm, though they appear random to the untrained eye.

We just showed you how to generate random numbers using predefined set seeds so that you can regenerate the same random numbers in a latter session to make your script reproducible for others. However, what if you don't want to use a preselected seed in your script but you still want to be able to reproduce the data in a later session? You can save the state of the random number generator of your session stored in the global environment variable .Random.seed in a new object that you use to reset the .Random.seed variable at a later point in your script. Let's go through an example to illustrate this point.

First, we generate five random numbers to create the .Random.seed variable because when you start a new session, this variable doesn't exist until you generate random numbers for the first time. Let's take a look at this in the following code:

> runif(5)
[1] 0.49442371 0.48252765 0.44946379 0.96708434 0.04600508

Next, we save the state of the current random number generator in a separate object, as shown in the following code. This will allow us to save the seeds used for the subsequent simulations:

> saved.seed <- .Random.seed 

Now, we can simulate other numbers as follows:

> runif(5)
[1] 0.41319284 0.57805579 0.11691655 0.09548216 0.75445132
> runif(2)
[1] 0.04699241 0.82974142

Now, say we want to regenerate the same numbers after the first time we saved the state of the random number generator. All we need to do is update the .Random.seed variable with the saved.seed variable, as shown in the following code, and rerun the commands that we ran earlier:

>.Random.seed <- saved.seed 

# Rerun the commands from earlier 
> runif(5)
[1] 0.41319284 0.57805579 0.11691655 0.09548216 0.75445132
> runif(2)
[1] 0.04699241 0.82974142

As you can see, you get the same values you obtained earlier because R reuses the same sequence of seeds saved in the original .Random.seed variable. However, it is important to specify that the same data will be resimulated in the same R session, but since the seed was not set with the set.seed() function before we saved the .Random.seed variable, you will not get the same values if you were to run the code in a new R session. This is because the starting seed will be taken from that session, which is different each time you start a new R session. Therefore, a better way to save the state of the random number generator would be to set the seed first with the set.seed() function. This way, the pseudorandom numbers you generate will be reproducible in other R sessions as follows:

> set.seed(245)
> .Random.seed <- saved.seed 
> runif(5)
[1] 0.92701730 0.48499598 0.23385692 0.67666045 0.02424925
> runif(2)
[1] 0.2860802 0.9330553
> .Random.seed <- saved.seed 
> runif(5)
[1] 0.92701730 0.48499598 0.23385692 0.67666045 0.02424925
> runif(2)
[1] 0.2860802 0.9330553

When writing a function to check or restore seed values to the .Random.seed variable, it is important to remember that we need to inspect or change the variable in the global environment and not just the local variable. To retrieve and assign values to global environment variables in R, we need to use the get() and assign() functions. For example, let's write a function that will return 10 random numbers using a user-defined seed, and then reset the state of the random number generator by restoring the .Random.seed variable to its original value set in the global environment. Let's take a look at this in the following example:

returnRandomNbs <- function(n, a, b){

# By default we assume the .Random.seed variable was not set in the global environment
seed.found  <- FALSE
if (exists(".Random.seed"))  { 
saved.seed <- get(".Random.seed", .GlobalEnv) 
seed.found <- TRUE
}
v <- runif(n, min=a, max=b)
if(seed.found) {
assign(".Random.seed", saved.seed, .GlobalEnv)
}

return(v)

}

Now we can set the seed and run our function as follows:

> set.seed(753)
> returnRandomNbs(10, 0, 2)
 [1] 1.0074840 1.7143867 1.0060674 1.0500559 0.6218600 0.3472834 0.7659655 1.1762890 0.7091655 0.6026619
> returnRandomNbs(10, 0, 2)
 [1] 1.0074840 1.7143867 1.0060674 1.0500559 0.6218600 0.3472834 0.7659655 1.1762890 0.7091655 0.6026619
> returnRandomNbs(10, 0, 2)
 [1] 1.0074840 1.7143867 1.0060674 1.0500559 0.6218600 0.3472834 0.7659655 1.1762890 0.7091655 0.6026619

Notice that our function will always return the same values because the seed is always reset to the initial random number generator state, which is set.seed(753):

> runif(10, 0, 2)
 [1] 1.0074840 1.7143867 1.0060674 1.0500559 0.6218600 0.3472834 0.7659655 1.1762890 0.7091655 0.6026619

Now the next time we run the runif() function, a different seed will be used as follows:

> runif(10, 0, 2)
 [1] 1.7727796 0.1779565 0.4772016 0.3180717 0.7163699 0.1385670 0.3976083 1.1769520 1.5818106 0.8082054
> runif(10, 0, 2)
 [1] 0.4378311 0.5187945 0.6013744 1.5483450 1.0333229 1.2363457 1.8626718 0.5275801 0.2906055 0.8752832

Bernoulli random variables

We can also simulate other random variables such as Bernoulli random variables. A Bernoulli trial has only two possible outcomes, that is, pass or fail. We can simulate guessing the right answer on a test using the runif() function. For example, let's simulate what score a high school student would get if he guessed all 30 questions on the test when the probability of getting the answer right answer is 0.25. Let's take a look at this in the following function:

> set.seed(23457)

Since each question can be considered as an independent Bernoulli trial, we can use the runif() function to simulate the student's answer to each question as follows:

> guessed.correctly <- runif(30)

If the number is less than 0.25, the student guesses correctly because the probability that a uniform random variable is less than 0.25 is exactly 0.25. Let's take a look at the results in the following code:

>  table(guessed.correctly < 0.25)

FALSE  TRUE 
   24     6

As you can see, the simulated high school student score would be 6 divided by 30, or 20 percent.

Alternatively, we could use the rbern() function to simulate the number of questions the student would answer correctly by using the p argument to specify the probability of success, or in this case, that he guesses the correct answer. The rbern() function returns 1 for success with the probability (p) defined by the p argument, and 0 for failure with the probability of 1 - p. Therefore, a value of 1 means he answered correctly, and 0 means he did not. Let's take a look at the following code:

> set.seed(23457)
> guessed.correctly <- rbern(n=30, p=.25)
> guessed.correctly
 [1] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0

Now, we can total the number of correctly answered questions as follows:

> sum(guessed.correctly)
[1] 6

Binomial random variables

We can also use the rbinom() function to simulate Binomial random variables. The Bernoulli distribution is the success of one trial, whereas the Binomial distribution represents the sum of all the successes of repeated Bernoulli trials. For example, we could have used the rbinom() function to simulate the number of questions the student guessed correctly by setting the size argument to 1 to specify one trial, as follows:

> set.seed(23457)
> rbinom(n=30, size=1, p=.25)
 [1] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0

Now let's simulate the number of cracked beer bottles per hour in a manufacturing plant that produces 100 bottles an hour if the probability that a bottle cracks is 0.05. If the plant is open 10 hours a day, we can simulate the number of cracked beer bottles each hour with the rbinom() function as follows:

> set.seed(23457)
> rbinom(n=10, size=100, p=0.05)
 [1] 7 4 6 4 5 1 6 8 5 5

Poisson random variables

Poisson random variables are often used to model count data that occurs in an interval of time by simulating the total number of counts. For example, we could simulate the number of shower-related injuries in the United States for the next 15 years using the rpois() function. We will assume there are approximately 43,600 cases per year. Let's take a look at the following function:

> rpois(15, 43600)
 [1] 43700 43476 43770 43928 43546 43443 43512 43627 43637 43795 43778 43799 43400 43959 43870

Exponential random variables

Exponential random variables are often used to simulate situations that model the time until something happens. For example, if we assume the mean time to failure of a computer is 6 years, we can simulate the lifetime of 25 computers in a classroom using the rexp() function. In this case, we would set the rate argument to 1/6, where rate = 1/mean time to failure. Let's take a look at the following function:

> set.seed(453)
> computer.lifetime <- rexp(25, 1/6)

We can plot a histogram of these results with a theoretical density curve using the dexp() function as follows:

> hist(computer.lifetime, probability=TRUE, col="gray", main="Exponential curve for computers with a mean time to failure of 6 years", cex.lab=1.5, cex.main=1.5)

We can add the theoretical density curve to the histogram plot with the curve function, with the add argument set to TRUE, as follows:

> curve(dexp(x, 1/6), add=T)

The result is shown in the following plot:

Exponential random variables
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.187.62