Example 9.9 Selecting Observations at Random from a Data Set without Replacement

Goal

Create a subset of a data set by randomly choosing observations from the data set. Do not allow an observation to be selected more than once. This method is commonly referred to as simple random sampling without replacement.

Examples 9.9 and 9.10 show different methods of sampling observations from data sets.

Example Features

Featured StepPROC SURVEYSELECT
Featured Step Options and StatementsPROC SURVEYSELECT statement options: METHOD=SRS, OUT=, SAMPSIZE=, SEED= options
Related Technique 1DATA step with RANUNI function and PROC SORT
Related Technique 2DATA step with RANUNI function, OUTPUT and STOP statements, SET statement with NOBS= option

Input Data Set

Data set STUDY_POOL contains demographic data on 443 patients with a disease who are eligible to be included in a study. The first 25 observations are shown.

               STUDY_POOL (first 25 observations)

                               disease_
 Obs studyid       dob  dx_age   years  duration     gender  age
   1 PK773Q  10/26/1960   40       9    late           F      49
   2 XU878R  05/06/1961   47       1    early          F      48
   3 FV465N  10/20/1977   23       9    late           F      32
   4 JP518G  04/16/1973   35       1    early          F      36
   5 PV818S  05/25/1974   25      10    late           M      35
   6 PV176X  07/19/1973   35       1    early          M      36
   7 QJ683W  11/23/1957   45       7    late           F      52
   8 QF971X  06/15/1975   33       1    early          M      34
   9 XW767L  05/27/1955   53       1    early          F      54
  10 IB738B  08/18/1962   45       2    intermediate   M      47
  11 DQ570Z  03/02/1978   30       1    early          M      31
  12 SU192H  02/14/1987   20       2    intermediate   M      22
  13 SW758U  03/05/1955   51       3    intermediate   F      54
  14 AW080T  05/04/1962   45       2    intermediate   F      47
  15 QO391M  11/04/1971   33       5    intermediate   F      38
  16 KK571Z  09/08/1962   45       2    intermediate   F      47
  17 GM905F  08/05/1959   45       5    intermediate   F      50
  18 GJ437V  01/12/1986   22       1    early          F      23
  19 OG052C  06/11/1971   37       1    early          M      38
  20 YP992Q  06/30/1972   36       1    early          M      37
  21 UN028V  04/22/1979   20      10    late           F      30
  22 SE317H  01/21/1948   55       6    late           M      61
  23 GZ756K  08/25/1953   55       1    early          M      56
  24 RZ619Q  12/03/1960   48       1    early          F      49
  25 VK658B  11/12/1986   20       3    intermediate   F      23

Resulting Data Set

Output 9.9a SAMPLE25 Data Set

Example 9.9 SAMPLE25 Data Set Created with PROC SURVEYSELECT

                               disease_
 Obs studyid        dob dx_age   years  duration     gender age
   1 IB738B  08/18/1962   45       2    intermediate   M     47
   2 SW758U  03/05/1955   51       3    intermediate   F     54
   3 YP992Q  06/30/1972   36       1    early          M     37
   4 QX762Q  06/17/1948   59       2    intermediate   M     61
   5 MF193R  09/14/1954   52       3    intermediate   M     55
   6 AG642K  01/18/1987   21       1    early          F     22
   7 CH513Z  04/07/1951   57       1    early          M     58
   8 FT520R  11/25/1951   53       5    intermediate   F     58
   9 BT232M  07/11/1985   20       4    intermediate   M     24
  10 LS696T  10/18/1981   26       2    intermediate   F     28
  11 EI188W  06/12/1945   61       3    intermediate   M     64
  12 QN271N  06/22/1985   23       1    early          M     24
  13 QI177T  08/27/1948   60       1    early          M     61
  14 EC092I  02/16/1987   21       1    early          F     22
  15 NR337X  06/25/1955   53       1    early          F     54
  16 AO783K  01/05/1964   40       5    intermediate   F     45
  17 AT949T  08/14/1970   27      12    late           F     39
  18 LQ219I  01/27/1948   59       2    intermediate   F     61
  19 JV329F  05/12/1945   63       1    early          F     64
  20 DU189X  05/02/1969   39       1    early          M     40
  21 IQ901W  08/09/1956   52       1    early          F     53
  22 JB893X  11/17/1947   57       5    intermediate   F     62
  23 YV984F  08/24/1949   57       3    intermediate   M     60
  24 AT193G  05/29/1984   17       8    late           M     25
  25 UU780P  07/12/1945   62       2    intermediate   M     64


Example Overview

The following PROC SURVEYSELECT step illustrates simple random sampling without replacement. A specific number of observations are selected at random from a data set with the condition that an observation can be selected only once.

PROC SURVEYSELECT has multiple options that can customize the sampling of a data set. This example specifies two options, METHOD= and SAMPSIZE=, that tell the procedure to select 25 observations at random from data set STUDY_POOL without replacement. A third option, OUT=, saves the selected observations in data set SAMPLE25. A fourth option, SEED=, specifies a seed value to the random number generator so that the sampling is reproducible.

Program

Select observations at random from STUDY_POOL. Select the observations by using simple random sampling, which is selection with equal probability and without replacement. Save the selected observations in data set SAMPLE25. Select 25 observations. Specify an initial seed for random number generation so that the selection is reproducible. If you want SURVEYSELECT to use the clock to initialize the seed, do not specify the SEED= option or assign the SEED= option a value of 0 or a negative number.

proc surveyselect data=study_pool

                  method=srs


                  out=sample25

                  sampsize=25
                   seed=2525;






run;

Related Technique 1

The following program randomly selects 25 observations from a data set without replacement. It starts by assigning a random number to every observation in data set STUDY_POOL. The resulting data set is sorted by the random numbers. A DATA step then creates the random sample of 25 observations by selecting the first 25 observations from the data set sorted by the random numbers.

While PROC SURVEYSELECT performs simple random sampling without replacement, you might want to use a DATA step instead if you need to modify the selected observations with programming statements.

The program in Related Technique 2 also uses a DATA step to select 25 observations at random. It does not require sorting of the data set and reads the data set only once. While the DATA step in Related Technique 1 is easy to understand, depending on your computer resources, it might be more efficient to use Related Technique 2 when you need to select a small sample from a very large data set.

Create data set STUDY_POOL2. Read the observations from STUDY_POOL. Assign to variable RANDOM a random number drawn from the uniform distribution. Specify a value greater than 0 for the seed so that the stream of numbers is reproducible. If you want to produce a stream of numbers that is not reproducible, specify the seed as 0 so that the random number generators initialize the generation process by using the clock time.

Sort the observations by the random numbers.

Create the sample data set. Save the first 25 observations from STUDY_POOL2 in SAMPLE25. Do not save the variable that contains the random numbers.

data study_pool2;
  set study_pool;

  random=ranuni(54918);







run;
proc sort data=study_pool2;
  by random;
run;
data sample25;
  set study_pool2(obs=25 drop=random);


run;

Output 9.9b SAMPLE25 Data Set Created with PROC SORT and DATA Step

Example 9.9 SAMPLE25 Related Technique 1 Data Set Created with PROC SORT and DATA Step

                                       disease_
  Obs  studyid          dob   dx_age     years    duration      gender   age

    1  VK162Z    12/31/1963     34        12      late             F      46
    2  YB377F    02/21/1950     58         1      early            M      59
    3  IB557W    11/29/1968     40         1      early            M      41
    4  CS710N    08/10/1960     48         1      early            F      49
    5  WI301T    09/17/1958     40        11      late             M      51
    6  EW595I    02/07/1986     21         2      intermediate     F      23
    7  VZ638C    08/16/1947     61         1      early            M      62
    8  JB893X    11/17/1947     57         5      intermediate     F      62
    9  QL460E    06/13/1973     33         3      intermediate     M      36
   10  BG914M    01/19/1984     24         1      early            F      25
   11  UV157Y    09/10/1967     41         1      early            F      42
   12  RM904X    10/16/1979     28         2      intermediate     M      30
   13  DT459U    01/28/1977     30         2      intermediate     M      32
   14  KP962X    11/04/1970     33         6      late             M      39
   15  VK949Q    11/13/1959     45         5      intermediate     F      50
   16  AY000Y    09/20/1963     42         4      intermediate     F      46
   17  SJ380M    03/23/1968     37         4      intermediate     M      41
   18  KN379R    10/28/1971     35         3      intermediate     M      38
   19  TP259E    03/31/1987     19         3      intermediate     F      22
   20  UW581J    10/24/1958     49         2      intermediate     M      51
   21  QF115M    03/20/1971     29         9      late             F      38
   22  IN550Y    04/01/1983     18         8      late             M      26
   23  CO796N    09/11/1971     37         1      early            M      38
   24  ME619D    09/03/1959     46         4      intermediate     M      50
   25  UL539V    05/30/1971     33         5      intermediate     M      38


Related Technique 2

The following DATA step reads all the observations from STUDY_POOL and computes a probability conditional on the number of observations that remain in the data set and the number that is needed to complete the sample. For example, if 50 observations remain to be processed in the data set and you need five more in your sample, you select the next observation with probability 5/50.

The DATA step assigns the total number of observations in STUDY_POOL to variable N. Variable K tracks how many observations still need to be selected. When an observation is selected, an assignment statement decrements the value of K by 1. The value of N is decremented by 1 with each iteration of the DATA step. The probability of selecting the observation that is currently being processed is K/N. If the value that is returned by the RANUNI function is less than or equal to that value, the observation is selected and output to SAMPLE25.

Create data set SAMPLE25. Read the observations in STUDY_POOL. Assign the total number of observations in STUDY_POOL to temporary variable TOTAL.

Retain the values of variables K and N across iterations of the DATA step. Initialize K to be the number of observations to select from STUDY_POOL. On the first iteration of the DATA step, copy the value of temporary variable TOTAL that has the total number of observations in STUDY_POOL to data set variable N. Compute the probability of selecting the observation that is currently being processed by dividing the number of observations that need to be selected (K) by the number of observations that remain to be processed, including the current observation (N). Obtain a random number from the uniform distribution not inclusively between the values 0 and 1. If the random number is less than or equal to the probability, select the observation for output. Write the selected observation to the output data set. Decrement the value of K by 1 because one less observation is needed to be selected from STUDY_POOL.

Decrement the number of observations yet to process in STUDY_POOL by 1 with each iteration of the DATA step after the processing of the current observation has been completed. When all the observations needed in the sample have been selected, stop the DATA step.

data sample25;
  set study_pool nobs=total;



  drop k n;
  retain k 25 n;



  if _n_=1 then n=total;



  if ranuni(381705) <= k/n then do;









    output;

    k=k-1;

  end;
  n=n-1;



  if k=0 then stop;

run;

Output 9.9c displays the results of the preceding program.

Output 9.9c SAMPLE25 Data Set Created with DATA Step

   Example 9.9 SAMPLE25 Related Technique 2 Data Set Created with DATA Step

                                        disease_
  Obs   studyid          dob   dx_age     years    duration       gender  age

    1   PV176X    07/19/1973     35         1      early            M      36
    2   OG052C    06/11/1971     37         1      early            M      38
    3   YP992Q    06/30/1972     36         1      early            M      37
    4   SE317H    01/21/1948     55         6      late             M      61
    5   WH928R    01/01/1951     57         1      early            F      58
    6   GA537T    10/13/1945     62         2      intermediate     F      64
    7   FO764G    08/14/1959     48         2      intermediate     F      50
    8   ZD500L    10/13/1973     35         1      early            M      36
    9   VD401X    08/30/1964     44         1      early            M      45
   10   GA105Z    03/21/1961     45         3      intermediate     F      48
   11   JX876R    09/05/1954     54         1      early            F      55
   12   AO783K    01/05/1964     40         5      intermediate     F      45
   13   SP922W    10/29/1978     20        11      late             F      31
   14   DC238T    09/16/1954     53         2      intermediate     M      55
   15   BF950R    12/14/1953     52         4      intermediate     F      56
   16   WW597U    08/15/1949     59         1      early            F      60
   17   KA204V    01/24/1971     36         2      intermediate     F      38
   18   NM423D    06/04/1954     53         2      intermediate     M      55
   19   WL948D    05/03/1985     21         3      intermediate     M      24
   20   JE523E    02/13/1968     37         4      intermediate     F      41
   21   LC395W    04/26/1968     40         1      early            M      41
   22   GR407P    11/02/1950     53         6      late             F      59
   23   TP462Q    07/11/1956     52         1      early            M      53
   24   VX080R    10/14/1953     52         4      intermediate     M      56
   25   UU780P    07/12/1945     62         2      intermediate     M      64


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.234.80