Randomly select the same number of observations from different groups in a data set. Do not allow an observation to be selected more than once. This method is commonly referred to as stratified random sampling without replacement and with equal allocation.
Additionally, create replicate samples by repeating the process of randomly selecting the same number of observations from different groups.
Examples 9.9 and 9.10 show different methods of sampling observations from data sets.
Featured Step | PROC SURVEYSELECT |
Featured Step Options and Statements | PROC SURVEYSELECT statement options: METHOD=SRS, OUT=, REPS=, SAMPSIZE=, SEED= STRATA statement |
Related Technique | PROC SORT and DATA step with RANUNI function |
Data set JINGLE_WINNERS contains information on 208 winners of an advertising jingle contest for three products. The first 30 observations are shown.
JINGLE_WINNERS (first 30 observations) Obs contestant address category 1 Adams BK Osseo, WI 54758 Energy Bar 2 Adams OT Morganfield, KY 42437 Energy Bar 3 Alexander IE Columbus, WI 53925 Energy Bar 4 Allen WN Norfolk, VA 23518 Sport Drink 5 Anderson EL McIntire, IA 50455 Energy Bar 6 Anderson FW Reelsville, IN 46171 Energy Bar 7 Anderson GC Augusta, ME 04330 Sport Drink 8 Anderson VU Greensboro, NC 27416 Sport Drink 9 Bailey DA Branson, MO 65615 Protein Shake 10 Bailey IA Glen Mills, PA 19342 Sport Drink 11 Bailey NJ Dover, OK 73734 Protein Shake 12 Bailey VI Atkinson, NE 68713 Protein Shake 13 Bailey ZO Fleetwood, PA 19522 Sport Drink 14 Baker KE Oxford, MD 21654 Sport Drink 15 Bell FK Temple, ME 04984 Sport Drink 16 Bell MQ Burdick, KS 66838 Protein Shake 17 Bennett QB Mount Enterprise, TX 75681 Protein Shake 18 Brown FU Browning, MO 64630 Protein Shake 19 Brown QD Catlettsburg, KY 41129 Energy Bar 20 Brown TF Carnelian Bay, CA 96140 Protein Shake 21 Brown TX Walthourville, GA 31333 Sport Drink 22 Brown WG Crenshaw, MS 38621 Energy Bar 23 Brown ZZ Tilden, IL 62292 Protein Shake 24 Campbell ZB Beaverton, OR 97077 Protein Shake 25 Carter KC Antimony, UT 84712 Protein Shake 26 Clark DA Cawker City, KS 67430 Protein Shake 27 Clark DO Lake Worth, FL 33466 Sport Drink 28 Collins QN West Memphis, AR 72303 Protein Shake 29 Cooper GP Claudville, VA 24076 Sport Drink 30 Cooper PM Jackson, NC 27845 Sport Drink
Output 9.10a Frequency of CATEGORY in JINGLE_WINNERSJINGLE_WINNERS The FREQ Procedure Cumulative Cumulative category Frequency Percent Frequency Percent -------------------------------------------------------------- Energy Bar 56 26.92 56 26.92 Protein Shake 82 39.42 138 66.35 Sport Drink 70 33.65 208 100.00 |
Output 9.10b GRANDPRIZES Data SetExample 9.10 GRANDPRIZES Data Set Created by PROC SURVEYSELECT Selection Sampling Obs category Replicate contestant address Prob Weight 1 Energy Bar 1 Adams OT Morganfield, KY 42437 0.089286 11.2 2 Energy Bar 1 Alexander IE Columbus, WI 53925 0.089286 11.2 3 Energy Bar 1 Smith UC Raymond, IA 50667 0.089286 11.2 4 Energy Bar 1 Ward EN Elberfeld, IN 47613 0.089286 11.2 5 Energy Bar 1 Wright BR Silver Star, MT 59751 0.089286 11.2 6 Energy Bar 2 Adams BK Osseo, WI 54758 0.089286 11.2 7 Energy Bar 2 Green KX Milwaukee, WI 53220 0.089286 11.2 8 Energy Bar 2 Harris EL Le Grand, IA 50142 0.089286 11.2 9 Energy Bar 2 Hernandez GB Scott, MS 38772 0.089286 11.2 10 Energy Bar 2 Jones IL Pontotoc, MS 38863 0.089286 11.2 11 Protein Shake 1 Bennett QB Mount Enterprise, TX 75681 0.060976 16.4 12 Protein Shake 1 Garcia FR Aimwell, LA 71401 0.060976 16.4 13 Protein Shake 1 Green DN Seattle, WA 98111 0.060976 16.4 14 Protein Shake 1 Johnson XJ Larkspur, CO 80118 0.060976 16.4 15 Protein Shake 1 Lewis RE Imperial, MO 63052 0.060976 16.4 16 Protein Shake 2 Hill LH Sherrill, AR 72152 0.060976 16.4 17 Protein Shake 2 Martin IZ Dallas, TX 75379 0.060976 16.4 18 Protein Shake 2 Ramirez QF Sparks, NV 89435 0.060976 16.4 19 Protein Shake 2 Robinson CL Lincoln, NE 68522 0.060976 16.4 20 Protein Shake 2 Smith TM Camp, AR 72520 0.060976 16.4 21 Sport Drink 1 Jackson CP New Kingstown, PA 17072 0.071429 14.0 22 Sport Drink 1 Rivera BK West Willow, PA 17583 0.071429 14.0 23 Sport Drink 1 Robinson KO Auburn, NY 13021 0.071429 14.0 24 Sport Drink 1 Smith HQ Newtonville, MA 02460 0.071429 14.0 25 Sport Drink 1 Washington FG Kingston, PA 18704 0.071429 14.0 26 Sport Drink 2 Garcia IK West Poland, ME 04291 0.071429 14.0 27 Sport Drink 2 Johnson TS Colts Neck, NJ 07722 0.071429 14.0 28 Sport Drink 2 Miller SV Hanover, PA 17333 0.071429 14.0 29 Sport Drink 2 Thomas OF New York, NY 10003 0.071429 14.0 30 Sport Drink 2 Thompson YV Allison Park, PA 15101 0.071429 14.0 |
Output 9.10c Frequency of CATEGORY in GRANDPRIZESExample 9.10 GRANDPRIZES Data Set Created by PROC SURVEYSELECT The FREQ Procedure Cumulative Cumulative category replicate Frequency Percent Frequency Percent ------------------------------------------------------------------------------- Energy Bar 1 5 16.67 5 16.67 Energy Bar 2 5 16.67 10 33.33 Protein Shake 1 5 16.67 15 50.00 Protein Shake 2 5 16.67 20 66.67 Sport Drink 1 5 16.67 25 83.33 Sport Drink 2 5 16.67 30 100.00 |
The following PROC SURVEYSELECT step illustrates stratified random sampling without replacement and with equal allocation. A specific number of observations are selected at random from each stratum in a data set with the condition that an observation can be selected only once.
PROC SURVEYSELECT has multiple options that can customize the sampling of a data set. The STRATA statement names the variable whose values partition the data set into groups that do not overlap.
Input data set JINGLE_WINNERS has 208 observations. The observations contain the names of winners of a jingle contest for three product categories identified by the values of variable CATEGORY. Two rounds of prizes will be awarded with five prizes awarded in each category each time.
Variable CATEGORY has three values. Therefore, a total of 30 observations will be selected. The process of selecting five observations for each value of CATEGORY is performed twice.
In the following program, the two PROC SURVEYSELECT statement options, METHOD= and SAMPSIZE=, in conjunction with the STRATA statement tell the procedure to select five observations at random without replacement from each group that is defined by the three values of variable CATEGORY in data set STUDY_POOL. A third option, OUT=, saves the selected observations in data set GRANDPRIZES. A fourth option, REPS=, tells the procedure to perform the complete selection process twice. A fifth option, SEED=, specifies a seed value to the random number generator so that the sampling is reproducible.
Within one replicate of selection, observations are selected without replacement. However, the observations are not selected without replacement across the replicates. Therefore, it is possible for a winner in the first round to also be a winner in the second round.
The input data set must be sorted by the variables that are named in the STRATA statement before executing PROC SURVEYSELECT. The procedure does have additional statements and options that can be used to tell the procedure to do the sorting. For more information about the sampling methods, statements, and options that allow this, see SAS documentation.
Sort JINGLE_WINNERS by the variable named in the STRATA statement. Select observations at random. Within each stratum, select the observations with simple random sampling, which is selection with equal probability and without replacement. Save the selected observations in data set GRANDPRIZES. Perform the stratified random sampling process twice. Select five observations from each stratum that is defined by the values of variable CATEGORY. Specify an initial seed for random number generation so that the selection is reproducible. If you want SURVEYSELECT to use the clock to initialize the seed, do not specify the SEED= option or specify it as 0 or a negative number. Name the variable whose values define the groups in which to randomly select the five observations.
proc sort data=jingle_winners; by category; run; proc surveyselect data=jingle_winners method=srs out=grandprizes reps=2 sampsize=5 seed=65; strata category; run;
The following program randomly selects without replacement 10 observations from each BY group. The first five observations selected within a BY group are assigned to replicate 1, and the second five observations are assigned to replicate 2.
The first DATA step assigns a random number to every observation in data set JINGLE_WINNERS. The resulting data set is sorted by the random numbers. The second DATA step then selects from the data set that was sorted by the random numbers the first 10 observations in each of the three BY groups.
The preceding PROC SURVEYSELECT step selected observations without replacement only within the replicate. An observation could be selected again in the second replicate. The following program does not select a person more than once across the replicates.
Create data set WINNERS2. Read the observations from JINGLE_WINNERS. Assign to variable RANDOM a random number drawn from the uniform distribution. Specify a value greater than 0 for the seed so that the stream of numbers is reproducible. If you want to produce a stream of numbers that is not reproducible, specify the seed as 0 so that the random number generators initialize the generation process by using the clock time.
Sort by the random numbers within each BY group.
Create the sample data set.
Process WINNERS2 in BY groups.
Tally the number of observations selected with the COUNT variable. Initialize COUNT to 0 at the start of each BY group. Increment the COUNT variable by 1 each time an observation is processed. Assign the first five observations selected to replicate 1. Assign the remaining observations to replicate 2. Save the first 10 observations in each BY group.
data winners2; set jingle_winners; rn=ranuni(6209193); run; proc sort data=winners2; by category rn; run; data grandprizes; set winners2; by category; drop count rn; if first.category then count=0; count+1; if count le 5 then replicate=1; else replicate=2; if count le 10 then output; run;
Output 9.10d displays the results of the preceding program.
Output 9.10d GRANDPRIZES Data Set Created by PROC SORT and DATA StepExample 9.10 GRANDPRIZES Related Technique Data Set Created by PROC SORT and DATA Step Obs contestant address category replicate 1 Green KX Milwaukee, WI 53220 Energy Bar 1 2 Moore AH Twin Brooks, SD 57269 Energy Bar 1 3 Miller IC Vadnais Heights, MN 55127 Energy Bar 1 4 Adams OT Morganfield, KY 42437 Energy Bar 1 5 King FT Rockford, IL 61103 Energy Bar 1 6 Rivera JT Bath, OH 44210 Energy Bar 2 7 Davis VH Ten Mile, TN 37880 Energy Bar 2 8 Brown QD Catlettsburg, KY 41129 Energy Bar 2 9 Johnson CJ Edmonton, KY 42129 Energy Bar 2 10 Johnson BT Alexandria, SD 57311 Energy Bar 2 11 Morris LI Wamego, KS 66547 Protein Shake 1 12 Miller SZ Ferriday, LA 71334 Protein Shake 1 13 Sanchez DJ Geronimo, TX 78115 Protein Shake 1 14 Moore GG San Rafael, CA 94903 Protein Shake 1 15 Hernandez GS St. Louis, MO 63115 Protein Shake 1 16 Lee JJ Garrison, UT 84728 Protein Shake 2 17 Carter KC Antimony, UT 84712 Protein Shake 2 18 Hill LH Sherrill, AR 72152 Protein Shake 2 19 Edwards SX Bosler, WY 82051 Protein Shake 2 20 Torres JL Lacey, WA 98503 Protein Shake 2 21 Clark DO Lake Worth, FL 33466 Sport Drink 1 22 Smith UL Minden, WV 25879 Sport Drink 1 23 Flores ZA Patten, ME 04765 Sport Drink 1 24 Garcia IW Springwater, NY 14560 Sport Drink 1 25 Washington FG Kingston, PA 18704 Sport Drink 1 26 Johnson TS Colts Neck, NJ 07722 Sport Drink 2 27 Washington NV Tampa, FL 33672 Sport Drink 2 28 Richardson QY Sparkill, NY 10976 Sport Drink 2 29 Cooper GP Claudville, VA 24076 Sport Drink 2 30 Jones WQ Milton, WV 25541 Sport Drink 2 |
18.221.251.169