Example 9.10 Selecting Equal-Sized Samples from Different Groups

Goal

Randomly select the same number of observations from different groups in a data set. Do not allow an observation to be selected more than once. This method is commonly referred to as stratified random sampling without replacement and with equal allocation.

Additionally, create replicate samples by repeating the process of randomly selecting the same number of observations from different groups.

Examples 9.9 and 9.10 show different methods of sampling observations from data sets.

Example Features

Featured StepPROC SURVEYSELECT
Featured Step Options and StatementsPROC SURVEYSELECT statement options: METHOD=SRS, OUT=, REPS=, SAMPSIZE=, SEED= STRATA statement
Related TechniquePROC SORT and DATA step with RANUNI function

Input Data Set

Data set JINGLE_WINNERS contains information on 208 winners of an advertising jingle contest for three products. The first 30 observations are shown.

             JINGLE_WINNERS (first 30 observations)

Obs   contestant     address                         category
  1   Adams BK       Osseo, WI 54758               Energy Bar
  2   Adams OT       Morganfield, KY 42437         Energy Bar
  3   Alexander IE   Columbus, WI 53925            Energy Bar
  4   Allen WN       Norfolk, VA 23518             Sport Drink
  5   Anderson EL    McIntire, IA 50455            Energy Bar
  6   Anderson FW    Reelsville, IN 46171          Energy Bar
  7   Anderson GC    Augusta, ME 04330             Sport Drink
  8   Anderson VU    Greensboro, NC 27416          Sport Drink
  9   Bailey DA      Branson, MO 65615             Protein Shake
 10   Bailey IA      Glen Mills, PA 19342          Sport Drink
 11   Bailey NJ      Dover, OK 73734               Protein Shake
 12   Bailey VI      Atkinson, NE 68713            Protein Shake
 13   Bailey ZO      Fleetwood, PA 19522           Sport Drink
 14   Baker KE       Oxford, MD 21654              Sport Drink
 15   Bell FK        Temple, ME 04984              Sport Drink
 16   Bell MQ        Burdick, KS 66838             Protein Shake
 17   Bennett QB     Mount Enterprise, TX 75681    Protein Shake
 18   Brown FU       Browning, MO 64630            Protein Shake
 19   Brown QD       Catlettsburg, KY 41129        Energy Bar
 20   Brown TF       Carnelian Bay, CA 96140       Protein Shake
 21   Brown TX       Walthourville, GA 31333       Sport Drink
 22   Brown WG       Crenshaw, MS 38621            Energy Bar
 23   Brown ZZ       Tilden, IL 62292              Protein Shake
 24   Campbell ZB    Beaverton, OR 97077           Protein Shake
 25   Carter KC      Antimony, UT 84712            Protein Shake
 26   Clark DA       Cawker City, KS 67430         Protein Shake
 27   Clark DO       Lake Worth, FL 33466          Sport Drink
 28   Collins QN     West Memphis, AR 72303        Protein Shake
 29   Cooper GP      Claudville, VA 24076          Sport Drink
 30   Cooper PM      Jackson, NC 27845             Sport Drink

Input Frequencies

Output 9.10a Frequency of CATEGORY in JINGLE_WINNERS

                         JINGLE_WINNERS

                       The FREQ Procedure

                                        Cumulative   Cumulative
 category        Frequency    Percent    Frequency     Percent
 --------------------------------------------------------------
 Energy Bar            56      26.92           56       26.92
 Protein Shake         82      39.42          138        66.35
 Sport Drink           70      33.65          208       100.00


Resulting Data Set

Output 9.10b GRANDPRIZES Data Set

             Example 9.10 GRANDPRIZES Data Set Created by PROC SURVEYSELECT

                                                                      Selection Sampling
 Obs   category    Replicate contestant    address                       Prob    Weight

   1 Energy Bar        1     Adams OT      Morganfield, KY 42437       0.089286   11.2
   2 Energy Bar        1     Alexander IE  Columbus, WI 53925          0.089286   11.2
   3 Energy Bar        1     Smith UC      Raymond, IA 50667           0.089286   11.2
   4 Energy Bar        1     Ward EN       Elberfeld, IN 47613         0.089286   11.2
   5 Energy Bar        1     Wright BR     Silver Star, MT 59751       0.089286   11.2
   6 Energy Bar        2     Adams BK      Osseo, WI 54758             0.089286   11.2
   7 Energy Bar        2     Green KX      Milwaukee, WI 53220         0.089286   11.2
   8 Energy Bar        2     Harris EL     Le Grand, IA 50142          0.089286   11.2
   9 Energy Bar        2     Hernandez GB  Scott, MS 38772             0.089286   11.2
  10 Energy Bar        2     Jones IL      Pontotoc, MS 38863          0.089286   11.2
  11 Protein Shake     1     Bennett QB    Mount Enterprise, TX 75681  0.060976   16.4
  12 Protein Shake     1     Garcia FR     Aimwell, LA 71401           0.060976   16.4
  13 Protein Shake     1     Green DN      Seattle, WA 98111           0.060976   16.4
  14 Protein Shake     1     Johnson XJ    Larkspur, CO 80118          0.060976   16.4
  15 Protein Shake     1     Lewis RE      Imperial, MO 63052          0.060976   16.4
  16 Protein Shake     2     Hill LH       Sherrill, AR 72152          0.060976   16.4
  17 Protein Shake     2     Martin IZ     Dallas, TX 75379            0.060976   16.4
  18 Protein Shake     2     Ramirez QF    Sparks, NV 89435            0.060976   16.4
  19 Protein Shake     2     Robinson CL   Lincoln, NE 68522           0.060976   16.4
  20 Protein Shake     2     Smith TM      Camp, AR 72520              0.060976   16.4
  21 Sport Drink       1     Jackson CP    New Kingstown, PA 17072     0.071429   14.0
  22 Sport Drink       1     Rivera BK     West Willow, PA 17583       0.071429   14.0
  23 Sport Drink       1     Robinson KO   Auburn, NY 13021            0.071429   14.0
  24 Sport Drink       1     Smith HQ      Newtonville, MA 02460       0.071429   14.0
  25 Sport Drink       1     Washington FG Kingston, PA 18704          0.071429   14.0
  26 Sport Drink       2     Garcia IK     West Poland, ME 04291       0.071429   14.0
  27 Sport Drink       2     Johnson TS    Colts Neck, NJ 07722        0.071429   14.0
  28 Sport Drink       2     Miller SV     Hanover, PA 17333           0.071429   14.0
  29 Sport Drink       2     Thomas OF     New York, NY 10003          0.071429   14.0
  30 Sport Drink       2     Thompson YV   Allison Park, PA 15101      0.071429   14.0


Resulting Output

Output 9.10c Frequency of CATEGORY in GRANDPRIZES

        Example 9.10 GRANDPRIZES Data Set Created by PROC SURVEYSELECT

                               The FREQ Procedure

                                                       Cumulative    Cumulative
category         replicate    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------------------------
Energy Bar               1           5       16.67             5        16.67
Energy Bar               2           5       16.67            10        33.33
Protein Shake            1           5       16.67            15        50.00
Protein Shake            2           5       16.67            20        66.67
Sport Drink              1           5       16.67            25        83.33
Sport Drink              2           5       16.67            30       100.00


Example Overview

The following PROC SURVEYSELECT step illustrates stratified random sampling without replacement and with equal allocation. A specific number of observations are selected at random from each stratum in a data set with the condition that an observation can be selected only once.

PROC SURVEYSELECT has multiple options that can customize the sampling of a data set. The STRATA statement names the variable whose values partition the data set into groups that do not overlap.

Input data set JINGLE_WINNERS has 208 observations. The observations contain the names of winners of a jingle contest for three product categories identified by the values of variable CATEGORY. Two rounds of prizes will be awarded with five prizes awarded in each category each time.

Variable CATEGORY has three values. Therefore, a total of 30 observations will be selected. The process of selecting five observations for each value of CATEGORY is performed twice.

In the following program, the two PROC SURVEYSELECT statement options, METHOD= and SAMPSIZE=, in conjunction with the STRATA statement tell the procedure to select five observations at random without replacement from each group that is defined by the three values of variable CATEGORY in data set STUDY_POOL. A third option, OUT=, saves the selected observations in data set GRANDPRIZES. A fourth option, REPS=, tells the procedure to perform the complete selection process twice. A fifth option, SEED=, specifies a seed value to the random number generator so that the sampling is reproducible.

Within one replicate of selection, observations are selected without replacement. However, the observations are not selected without replacement across the replicates. Therefore, it is possible for a winner in the first round to also be a winner in the second round.

The input data set must be sorted by the variables that are named in the STRATA statement before executing PROC SURVEYSELECT. The procedure does have additional statements and options that can be used to tell the procedure to do the sorting. For more information about the sampling methods, statements, and options that allow this, see SAS documentation.

Program

Sort JINGLE_WINNERS by the variable named in the STRATA statement. Select observations at random. Within each stratum, select the observations with simple random sampling, which is selection with equal probability and without replacement. Save the selected observations in data set GRANDPRIZES. Perform the stratified random sampling process twice. Select five observations from each stratum that is defined by the values of variable CATEGORY. Specify an initial seed for random number generation so that the selection is reproducible. If you want SURVEYSELECT to use the clock to initialize the seed, do not specify the SEED= option or specify it as 0 or a negative number. Name the variable whose values define the groups in which to randomly select the five observations.

proc sort data=jingle_winners;
  by category;
run;
proc surveyselect data=jingle_winners
                   method=srs


                  out=grandprizes

                  reps=2

                   sampsize=5

                  seed=65;





  strata category;

run;

Related Technique

The following program randomly selects without replacement 10 observations from each BY group. The first five observations selected within a BY group are assigned to replicate 1, and the second five observations are assigned to replicate 2.

The first DATA step assigns a random number to every observation in data set JINGLE_WINNERS. The resulting data set is sorted by the random numbers. The second DATA step then selects from the data set that was sorted by the random numbers the first 10 observations in each of the three BY groups.

The preceding PROC SURVEYSELECT step selected observations without replacement only within the replicate. An observation could be selected again in the second replicate. The following program does not select a person more than once across the replicates.

Create data set WINNERS2. Read the observations from JINGLE_WINNERS. Assign to variable RANDOM a random number drawn from the uniform distribution. Specify a value greater than 0 for the seed so that the stream of numbers is reproducible. If you want to produce a stream of numbers that is not reproducible, specify the seed as 0 so that the random number generators initialize the generation process by using the clock time.

Sort by the random numbers within each BY group.

Create the sample data set.

Process WINNERS2 in BY groups.

Tally the number of observations selected with the COUNT variable. Initialize COUNT to 0 at the start of each BY group. Increment the COUNT variable by 1 each time an observation is processed. Assign the first five observations selected to replicate 1. Assign the remaining observations to replicate 2. Save the first 10 observations in each BY group.

data winners2;
  set jingle_winners;

  rn=ranuni(6209193);



run;
proc sort data=winners2;
  by category rn;
run;
data grandprizes;
  set winners2;
  by category;
  drop count rn;
  if first.category then count=0;


  count+1;

  if count le 5 then replicate=1;

  else replicate=2;

  if count le 10 then output;

run;

Output 9.10d displays the results of the preceding program.

Output 9.10d GRANDPRIZES Data Set Created by PROC SORT and DATA Step

   Example 9.10 GRANDPRIZES Related Technique Data Set Created by PROC SORT and DATA Step

     Obs    contestant       address                        category       replicate

       1    Green KX         Milwaukee, WI 53220          Energy Bar           1
       2    Moore AH         Twin Brooks, SD 57269        Energy Bar           1
       3    Miller IC        Vadnais Heights, MN 55127    Energy Bar           1
       4    Adams OT         Morganfield, KY 42437        Energy Bar           1
       5    King FT          Rockford, IL 61103           Energy Bar           1
       6    Rivera JT        Bath, OH 44210               Energy Bar           2
       7    Davis VH         Ten Mile, TN 37880           Energy Bar           2
       8    Brown QD         Catlettsburg, KY 41129       Energy Bar           2
       9    Johnson CJ       Edmonton, KY 42129           Energy Bar           2
      10    Johnson BT       Alexandria, SD 57311         Energy Bar           2
      11    Morris LI        Wamego, KS 66547             Protein Shake        1
      12    Miller SZ        Ferriday, LA 71334           Protein Shake        1
      13    Sanchez DJ       Geronimo, TX 78115           Protein Shake        1
      14    Moore GG         San Rafael, CA 94903         Protein Shake        1
      15    Hernandez GS     St. Louis, MO 63115          Protein Shake        1
      16    Lee JJ           Garrison, UT 84728           Protein Shake        2
      17    Carter KC        Antimony, UT 84712           Protein Shake        2
      18    Hill LH          Sherrill, AR 72152           Protein Shake        2
      19    Edwards SX       Bosler, WY 82051             Protein Shake        2
      20    Torres JL        Lacey, WA 98503              Protein Shake        2
      21    Clark DO         Lake Worth, FL 33466         Sport Drink          1
      22    Smith UL         Minden, WV 25879             Sport Drink          1
      23    Flores ZA        Patten, ME 04765             Sport Drink          1
      24    Garcia IW        Springwater, NY 14560        Sport Drink          1
      25    Washington FG    Kingston, PA 18704           Sport Drink          1
      26    Johnson TS       Colts Neck, NJ 07722         Sport Drink          2
      27    Washington NV    Tampa, FL 33672              Sport Drink          2
      28    Richardson QY    Sparkill, NY 10976           Sport Drink          2
      29    Cooper GP        Claudville, VA 24076         Sport Drink          2
      30    Jones WQ         Milton, WV 25541             Sport Drink          2


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.198.170