Create a subset of a data set by randomly choosing observations from the data set. Do not allow an observation to be selected more than once. This method is commonly referred to as simple random sampling without replacement.
Examples 9.9 and 9.10 show different methods of sampling observations from data sets.
Featured Step | PROC SURVEYSELECT |
Featured Step Options and Statements | PROC SURVEYSELECT statement options: METHOD=SRS, OUT=, SAMPSIZE=, SEED= options |
Related Technique 1 | DATA step with RANUNI function and PROC SORT |
Related Technique 2 | DATA step with RANUNI function, OUTPUT and STOP statements, SET statement with NOBS= option |
Data set STUDY_POOL contains demographic data on 443 patients with a disease who are eligible to be included in a study. The first 25 observations are shown.
STUDY_POOL (first 25 observations) disease_ Obs studyid dob dx_age years duration gender age 1 PK773Q 10/26/1960 40 9 late F 49 2 XU878R 05/06/1961 47 1 early F 48 3 FV465N 10/20/1977 23 9 late F 32 4 JP518G 04/16/1973 35 1 early F 36 5 PV818S 05/25/1974 25 10 late M 35 6 PV176X 07/19/1973 35 1 early M 36 7 QJ683W 11/23/1957 45 7 late F 52 8 QF971X 06/15/1975 33 1 early M 34 9 XW767L 05/27/1955 53 1 early F 54 10 IB738B 08/18/1962 45 2 intermediate M 47 11 DQ570Z 03/02/1978 30 1 early M 31 12 SU192H 02/14/1987 20 2 intermediate M 22 13 SW758U 03/05/1955 51 3 intermediate F 54 14 AW080T 05/04/1962 45 2 intermediate F 47 15 QO391M 11/04/1971 33 5 intermediate F 38 16 KK571Z 09/08/1962 45 2 intermediate F 47 17 GM905F 08/05/1959 45 5 intermediate F 50 18 GJ437V 01/12/1986 22 1 early F 23 19 OG052C 06/11/1971 37 1 early M 38 20 YP992Q 06/30/1972 36 1 early M 37 21 UN028V 04/22/1979 20 10 late F 30 22 SE317H 01/21/1948 55 6 late M 61 23 GZ756K 08/25/1953 55 1 early M 56 24 RZ619Q 12/03/1960 48 1 early F 49 25 VK658B 11/12/1986 20 3 intermediate F 23
Output 9.9a SAMPLE25 Data SetExample 9.9 SAMPLE25 Data Set Created with PROC SURVEYSELECT disease_ Obs studyid dob dx_age years duration gender age 1 IB738B 08/18/1962 45 2 intermediate M 47 2 SW758U 03/05/1955 51 3 intermediate F 54 3 YP992Q 06/30/1972 36 1 early M 37 4 QX762Q 06/17/1948 59 2 intermediate M 61 5 MF193R 09/14/1954 52 3 intermediate M 55 6 AG642K 01/18/1987 21 1 early F 22 7 CH513Z 04/07/1951 57 1 early M 58 8 FT520R 11/25/1951 53 5 intermediate F 58 9 BT232M 07/11/1985 20 4 intermediate M 24 10 LS696T 10/18/1981 26 2 intermediate F 28 11 EI188W 06/12/1945 61 3 intermediate M 64 12 QN271N 06/22/1985 23 1 early M 24 13 QI177T 08/27/1948 60 1 early M 61 14 EC092I 02/16/1987 21 1 early F 22 15 NR337X 06/25/1955 53 1 early F 54 16 AO783K 01/05/1964 40 5 intermediate F 45 17 AT949T 08/14/1970 27 12 late F 39 18 LQ219I 01/27/1948 59 2 intermediate F 61 19 JV329F 05/12/1945 63 1 early F 64 20 DU189X 05/02/1969 39 1 early M 40 21 IQ901W 08/09/1956 52 1 early F 53 22 JB893X 11/17/1947 57 5 intermediate F 62 23 YV984F 08/24/1949 57 3 intermediate M 60 24 AT193G 05/29/1984 17 8 late M 25 25 UU780P 07/12/1945 62 2 intermediate M 64 |
The following PROC SURVEYSELECT step illustrates simple random sampling without replacement. A specific number of observations are selected at random from a data set with the condition that an observation can be selected only once.
PROC SURVEYSELECT has multiple options that can customize the sampling of a data set. This example specifies two options, METHOD= and SAMPSIZE=, that tell the procedure to select 25 observations at random from data set STUDY_POOL without replacement. A third option, OUT=, saves the selected observations in data set SAMPLE25. A fourth option, SEED=, specifies a seed value to the random number generator so that the sampling is reproducible.
Select observations at random from STUDY_POOL. Select the observations by using simple random sampling, which is selection with equal probability and without replacement. Save the selected observations in data set SAMPLE25. Select 25 observations. Specify an initial seed for random number generation so that the selection is reproducible. If you want SURVEYSELECT to use the clock to initialize the seed, do not specify the SEED= option or assign the SEED= option a value of 0 or a negative number.
proc surveyselect data=study_pool method=srs out=sample25 sampsize=25 seed=2525; run;
The following program randomly selects 25 observations from a data set without replacement. It starts by assigning a random number to every observation in data set STUDY_POOL. The resulting data set is sorted by the random numbers. A DATA step then creates the random sample of 25 observations by selecting the first 25 observations from the data set sorted by the random numbers.
While PROC SURVEYSELECT performs simple random sampling without replacement, you might want to use a DATA step instead if you need to modify the selected observations with programming statements.
The program in Related Technique 2 also uses a DATA step to select 25 observations at random. It does not require sorting of the data set and reads the data set only once. While the DATA step in Related Technique 1 is easy to understand, depending on your computer resources, it might be more efficient to use Related Technique 2 when you need to select a small sample from a very large data set.
Create data set STUDY_POOL2. Read the observations from STUDY_POOL. Assign to variable RANDOM a random number drawn from the uniform distribution. Specify a value greater than 0 for the seed so that the stream of numbers is reproducible. If you want to produce a stream of numbers that is not reproducible, specify the seed as 0 so that the random number generators initialize the generation process by using the clock time.
Sort the observations by the random numbers.
Create the sample data set. Save the first 25 observations from STUDY_POOL2 in SAMPLE25. Do not save the variable that contains the random numbers.
data study_pool2; set study_pool; random=ranuni(54918); run; proc sort data=study_pool2; by random; run; data sample25; set study_pool2(obs=25 drop=random); run;
Output 9.9b SAMPLE25 Data Set Created with PROC SORT and DATA StepExample 9.9 SAMPLE25 Related Technique 1 Data Set Created with PROC SORT and DATA Step disease_ Obs studyid dob dx_age years duration gender age 1 VK162Z 12/31/1963 34 12 late F 46 2 YB377F 02/21/1950 58 1 early M 59 3 IB557W 11/29/1968 40 1 early M 41 4 CS710N 08/10/1960 48 1 early F 49 5 WI301T 09/17/1958 40 11 late M 51 6 EW595I 02/07/1986 21 2 intermediate F 23 7 VZ638C 08/16/1947 61 1 early M 62 8 JB893X 11/17/1947 57 5 intermediate F 62 9 QL460E 06/13/1973 33 3 intermediate M 36 10 BG914M 01/19/1984 24 1 early F 25 11 UV157Y 09/10/1967 41 1 early F 42 12 RM904X 10/16/1979 28 2 intermediate M 30 13 DT459U 01/28/1977 30 2 intermediate M 32 14 KP962X 11/04/1970 33 6 late M 39 15 VK949Q 11/13/1959 45 5 intermediate F 50 16 AY000Y 09/20/1963 42 4 intermediate F 46 17 SJ380M 03/23/1968 37 4 intermediate M 41 18 KN379R 10/28/1971 35 3 intermediate M 38 19 TP259E 03/31/1987 19 3 intermediate F 22 20 UW581J 10/24/1958 49 2 intermediate M 51 21 QF115M 03/20/1971 29 9 late F 38 22 IN550Y 04/01/1983 18 8 late M 26 23 CO796N 09/11/1971 37 1 early M 38 24 ME619D 09/03/1959 46 4 intermediate M 50 25 UL539V 05/30/1971 33 5 intermediate M 38 |
The following DATA step reads all the observations from STUDY_POOL and computes a probability conditional on the number of observations that remain in the data set and the number that is needed to complete the sample. For example, if 50 observations remain to be processed in the data set and you need five more in your sample, you select the next observation with probability 5/50.
The DATA step assigns the total number of observations in STUDY_POOL to variable N. Variable K tracks how many observations still need to be selected. When an observation is selected, an assignment statement decrements the value of K by 1. The value of N is decremented by 1 with each iteration of the DATA step. The probability of selecting the observation that is currently being processed is K/N. If the value that is returned by the RANUNI function is less than or equal to that value, the observation is selected and output to SAMPLE25.
Create data set SAMPLE25. Read the observations in STUDY_POOL. Assign the total number of observations in STUDY_POOL to temporary variable TOTAL.
Retain the values of variables K and N across iterations of the DATA step. Initialize K to be the number of observations to select from STUDY_POOL. On the first iteration of the DATA step, copy the value of temporary variable TOTAL that has the total number of observations in STUDY_POOL to data set variable N. Compute the probability of selecting the observation that is currently being processed by dividing the number of observations that need to be selected (K) by the number of observations that remain to be processed, including the current observation (N). Obtain a random number from the uniform distribution not inclusively between the values 0 and 1. If the random number is less than or equal to the probability, select the observation for output. Write the selected observation to the output data set. Decrement the value of K by 1 because one less observation is needed to be selected from STUDY_POOL.
Decrement the number of observations yet to process in STUDY_POOL by 1 with each iteration of the DATA step after the processing of the current observation has been completed. When all the observations needed in the sample have been selected, stop the DATA step.
data sample25; set study_pool nobs=total; drop k n; retain k 25 n; if _n_=1 then n=total; if ranuni(381705) <= k/n then do; output; k=k-1; end; n=n-1; if k=0 then stop; run;
Output 9.9c displays the results of the preceding program.
Output 9.9c SAMPLE25 Data Set Created with DATA StepExample 9.9 SAMPLE25 Related Technique 2 Data Set Created with DATA Step disease_ Obs studyid dob dx_age years duration gender age 1 PV176X 07/19/1973 35 1 early M 36 2 OG052C 06/11/1971 37 1 early M 38 3 YP992Q 06/30/1972 36 1 early M 37 4 SE317H 01/21/1948 55 6 late M 61 5 WH928R 01/01/1951 57 1 early F 58 6 GA537T 10/13/1945 62 2 intermediate F 64 7 FO764G 08/14/1959 48 2 intermediate F 50 8 ZD500L 10/13/1973 35 1 early M 36 9 VD401X 08/30/1964 44 1 early M 45 10 GA105Z 03/21/1961 45 3 intermediate F 48 11 JX876R 09/05/1954 54 1 early F 55 12 AO783K 01/05/1964 40 5 intermediate F 45 13 SP922W 10/29/1978 20 11 late F 31 14 DC238T 09/16/1954 53 2 intermediate M 55 15 BF950R 12/14/1953 52 4 intermediate F 56 16 WW597U 08/15/1949 59 1 early F 60 17 KA204V 01/24/1971 36 2 intermediate F 38 18 NM423D 06/04/1954 53 2 intermediate M 55 19 WL948D 05/03/1985 21 3 intermediate M 24 20 JE523E 02/13/1968 37 4 intermediate F 41 21 LC395W 04/26/1968 40 1 early M 41 22 GR407P 11/02/1950 53 6 late F 59 23 TP462Q 07/11/1956 52 1 early M 53 24 VX080R 10/14/1953 52 4 intermediate M 56 25 UU780P 07/12/1945 62 2 intermediate M 64 |
3.143.115.131