Chapter 6: INTO

6.1 SELECT Executes Immediately

6.2 Numeric-to-Character Conversion

6.3 SELECTing Multiple Values

6.4 DISTINCT Differences

6.5 How Many Macro Variables?

6.6 Zero Incoming Observations

6.6.1 Programming Challenge #5

6.6.2 Solution

6.7 An Unusual Application: Separated by Else

6.8 Dictionary Tables

6.9 Extremely Long Lists

6.10 Blanks vs. Nulls

If you like CALL SYMPUT, you’ll love the INTO : operator in PROC SQL. It also transfers information from a data set to macro variables. But it possesses unique capabilities, such as creating a series of macro variables rather than just one, and extracting data from a collection of observations into a single macro variable. Sections 6.1 through 6.5 illustrate these basic principles, while subsequent material explores more subtle nuances.

6.1 SELECT Executes Immediately

Because SELECT statements execute immediately, this code works:

%let tot_sales=;

proc sql noprint;

   %do quarter=1 %to 4;

       select sum(sales) into : q_sales

       from full_year where quarter=&quarter; 

       %let tot_sales = &tot_sales &q_sales;

   %end;

quit;

During each iteration of the %DO loop, a SELECT statement populates &Q_SALES. Next, the %LET statement immediately utilizes that value of &Q_SALES. Thus, this section of a macro concatenates the total sales from each quarter into a single macro variable.

6.2 Numeric-to-Character Conversion

Although macro variables hold text strings, INTO : often passes numeric values. This forces SQL to convert numeric values to character. To illustrate, consider this simple example where the incoming data set contains a single observation:

data test;

   pi=3.14;

run;

proc sql noprint; 

   select pi               into : pi           from test; 

   select pi + 2        into : pi_plus_2 from test; 

   select ceil(pi)      into : ceil_pi      from test; 

   select mean(pi)   into : mean_pi   from test; 

quit;

These statements display the results:

%put**&pi**;                 **    3.14**

%put**&pi_plus_2**;     **    5.14**

%put**&ceil_pi**;          **         4**

%put**&mean_pi**;       **    3.14**

Numeric-to-character conversion takes place even though there is no such message on the log. And PROC SQL makes the conversion using the best8 format, which is different from both CALL SYMPUT and CALL SYMPUTX (best12 format).

6.3 SELECTing Multiple Values

Many INTO : applications select a set of values, rather than a single value. There are two approaches for these cases. Either:

•    Create just one macro variable, stringing together information from many incoming observations, or

•    Create a set of macro variables

The following example creates a single macro variable. It takes all distinct values for STATE, trims off trailing blanks, puts double quotes around them, and inserts commas between:

proc sql noprint;

     select distinct("trim(state)") separated by ', ' 

     into : state_list 

     from popfile;

quit;

One possible result:

"New Jersey", "New York", "Oregon"

This result might be inserted into a later IN operator:

where state in (&state_list);

The INTO : operator can create a set of macro variables as well:

data id_list;

   do idvar=20 to 1 by -1;

      output;

   end;

run;

proc sql noprint; 

   select idvar into : v1 - : v20 from id_list;

   %put **&v1**;     **20**

   %put **&v2**;     **19**

   %put **&v19**;   **2**

   %put **&v20**;   **1**

quit;

INTO : still transforms numeric values into character strings. So where did the leading blanks go? When creating a set of macro variables, the leading blanks are automatically removed. In the unlikely event that the program must preserve these leading blanks, add the NOTRIM option:

proc sql noprint; 

   select idvar into : v1 - : v20 NOTRIM from id_list; 

   %put **&v1**;     **      20**

   %put **&v2**;     **      19**

   %put **&v19**;   **       2**

   %put **&v20**;   **       1**

quit;

However, be sure to add NOTRIM in the right place. The next variation removes the leading blanks, without generating an error message:

proc sql noprint; 

   select idvar into : v1 - : v20 from id_list NOTRIM; 

   %put **&v1**;     **20**

   %put **&v2**;     **19**

   %put **&v19**;   **2**

   %put **&v20**;   **1**

quit;

6.4 DISTINCT Differences

When SELECT and SELECT DISTINCT create a set of macro variables, the order may be different:

proc sql noprint; 

   select idvar into : v1 - : v20 from id_list;

   %put **&v1**;     **20**

   %put **&v2**;     **19**

   %put **&v19**;   **2**

   %put **&v20**;   **1**

   select distinct(idvar) into : v1 - : v20 from id_list; 

   %put **&v1**;     **1**

   %put **&v2**;     **2**

   %put **&v19**;   **19**

   %put **&v20**;   **20**

quit;

The first SELECT statement copies the first 20 values of IDVAR into 20 macro variables. But the second SELECT statement has to hold those values aside so it can determine which values are distinct. As a result:

•    SELECT DISTINCT assigns values in ascending order

•    SELECT assigns values according to their order in the data

6.5 How Many Macro Variables?

What if the number of macro variables does not match the number of data values that PROC SQL selects?

proc sql noprint;

     select distinct(trim(state)) into : state1 - : state80

     from popfile;

quit;

This program creates as many as 80 macro variables (&STATE1 through &STATE80), depending on the number of STATE values in the incoming data. When the incoming data set contains more than 80 states, only 80 get assigned to macro variables. The rest get ignored. When the incoming data set contains fewer than 80 states, the program creates just the right number of macro variables. (It does not create extra macro variables with null values.) How do you know the number of macro variables that were created? You could try to calculate it. For example, the program could add this statement:

select count(distinct(state)) into : n_states from popfile;

But this approach has its drawbacks. It must process the data. And the number might be more than 80, even though the number of macro variables is limited to 80. Regardless, this statement is absolutely unnecessary. The software counts for you, using the original SELECT statement. An automatic variable, &SQLOBS, contains the number of macro variables created. So the very next statement inside a macro could be:

%do i=1 %to &sqlobs;

Or, simply:

%let n_states = &sqlobs;

6.6 Zero Incoming Observations

A final pitfall: zero incoming observations. This code is trickier than it looks:

proc sql noprint;

   select count(*) into : pop_count

          from population; 

   select distinct(trim(state)) 

          into : state_list

          separated by ' '

          from population;

quit; 

When the incoming data set POPULATION contains zero observations, the two SELECT statements behave differently. The first SELECT statement accurately assigns 0 as the value for &POP_COUNT. But the second SELECT statement does not execute at all. So if &STATE_LIST already has a value (whether null or not), that value remains unchanged. Statistical functions such as COUNT always execute. But data extraction does not execute when the source data set contains zero observations.

6.6.1 Programming Challenge #5

Where does this program go wrong as it attempts to create two macro variables?

proc sql noprint;

     select count(distinct(state)) into : n_states

     from popfile; 

     select distinct(trim(state)) into : all&n_states

     separated by ' ' from popfile;

quit;

Ultimately the intent is to create a single macro variable holding a list of all STATE values in the incoming data. However, the name of the macro variable should reflect the number of STATEs found. For example, if there were 55 STATEs, the macro variable should be named &ALL55.

One feature might appear suspect: creating &N_STATES in one statement and referring to it in the next statement. This is actually not a problem. As shown in Section 6.1, PROC SQL executes its SELECT statements immediately and automatically. The first SELECT statement runs before the second begins. So &N_STATES exists in time. Take a minute to look for the problem, before reading on.

6.6.2 Solution

The problem lies in the value assigned to &N_STATES. A numeric-to-character conversion uses the best8 format to create &N_STATES, assigning it (for example) a value of six blanks followed by 55. So the second SELECT statement sees:

   select distinct(trim(state)) into : all      55

   separated by ' ' from popfile;

Clearly, this is not a legitimate SELECT statement. Here is one viable fix:

proc sql noprint;

     select count(distinct(state)) into : n_states

     from popfile; 

     %let n_states = &n_states;

     select distinct(trim(state)) into : all&n_states

     separated by ' ' from popfile;

quit;

The %LET statement ignores any leading and trailing blanks and changes &N_STATES to the digits only. Alternatively, ignore &N_STATES and use &SQLOBS instead. The software automatically removes any leading or trailing blanks when assigning a value to &SQLOBS:

proc sql noprint;

     select count(distinct(state)) into : n_states

     from popfile; 

     select distinct(trim(state)) into : all&sqlobs

     separated by ' ' from popfile;

quit;

6.7 An Unusual Application: Separated by Else

Let’s revisit a problem from Section 3.3. The data set CUTOFFS holds key information (and is in order by descending SAT_CUTOFF):

Obs

SAT_cutoff

Group_name

1

1200

Honor students

2

900

Regular track

3

300

Challenged

Macro language must use these data values to construct the following program:

data student_groups;

     set all_students;

     length group $ 14;

     if score >= 1200 then group='Honor students';

     else if score >= 900 then group='Regular track';

     else if score >= 300 then group='Challenged';

run;

Believe it or not, PROC SQL can help! Here, it assembles all the IF / THEN / ELSE statements into a single macro variable:

proc sql noprint;

   select 'if score >= ' || put(sat_cutoff, 4.) ||

          " then group='" || trim(group_name) || "';"

          into : logic

          separated by 'else '

          from cutoffs;

run;

data student_groups;

     set all_students;

     length group $ 14;

     &logic

run;

The macro variable &LOGIC contains the entire set of IF / THEN / ELSE statements making this a viable solution to the original problem.

6.8 Dictionary Tables

Explaining dictionary tables is beyond the scope of this book. On the other hand, SQL is the only SAS procedure that can access dictionary tables, and a fair number of applications that use INTO : will read from dictionary tables. As a compromise, then, this section illustrates a few dictionary table applications. But it will tread lightly on the syntax, focusing more on the objectives and strategies.

The first example captures the name of every SAS data set in a library. The macro below would need one SELECT statement or the other (not both). The intent is to create a single macro variable holding the names of every SAS data set. The only parameter is the name assigned by an earlier LIBNAME statement:

%macro all_datasets (library=);

   %global ds_list;

   proc sql noprint; 

      select trim(memname) into : ds_list

         separated by ' '

         from dictionary.tables 

         where libname="%upcase(&library)"; 

      select distinct(trim(memname)) into : ds_list

         separated by ' '

         from dictionary.columns

         where libname="%upcase(&library)";

   quit;

%mend all_datasets;

All dictionary tables hold information about the current SAS program or session. DICTIONARY.TABLES contains one observation for each SAS data set, but DICTIONARY.COLUMNS contains one observation for each variable. Therefore, by reading one observation per variable, the second SELECT statement must add DISTINCT (and takes longer to execute as well). If necessary, this macro could examine the variable MEMTYPE to distinguish between data sets, views, and catalogs. That level of complexity extends beyond the focus of this section.

Why do this? What good is a list of all data sets in a library? The list helps when the objective is to process every data set with the same logic. Data cleaning would be a typical application. Here are some steps that might apply to every single data set:

•    Find any variables that are always missing.

•    Find any variables that have a nonmissing value but never change.

•    Get a brief report showing the distribution for each variable.

•    Find any character variables that contain leading blanks.

To address the first item on this list (locating variables that are always missing), the following approach would work for a single data set:

proc format;

   value there  low-high='There'

                other='Missing';

   value $there ' '='Missing'

                other='There';

run;

proc freq data=suspect.dataset;

   tables _all_ / missing;

   format _numeric_ there.

          _character_ $there.;

run;

Note that the FORMAT statement requires each data set to contain at least one numeric variable and at least one character variable. Overcoming that limitation is beyond the scope of this section.

Macro enhancements could embellish the results in a number of ways:

•    Keep these tables in the .lst file but produce a separate output file listing just those variables that are always missing.

•    Expand the process to cover all data sets in the library, adding titles to indicate which data set is the source for each set of tables.

•    Add a cutoff parameter. When using a parameter value of 0.98, the macro would report variables that are missing more than 98% of the time.

One final example of using dictionary tables concerns the variable PATIENT_ID. A folder holds many SAS data sets, some of which contain PATIENT_ID. A program must locate which data sets contain PATIENT_ID and then check each one to find any PATIENT_IDs that do not appear on a master list. (A similar, practical objective not addressed here: Verify that PATIENT_ID has the same length, type, label, and format in all data sets.)

The initial objective is to locate all SAS data sets that contain PATIENT_ID and place that list of data set names into a macro variable. DICTIONARY.COLUMNS already identifies every variable in every SAS data set. The trick is to select the proper subset:

•    Data sets in just one library

•    Variables named PATIENT_ID

A macro might obtain that subset like this:

proc sql noprint; 

   select trim(memname) 

          into : datasets_with_patient_id

          separated by ' '

   from dictionary.columns 

        where libname="%upcase(&library)"

          and upcase(name)= 'PATIENT_ID'; 

quit;

Next, process each data set to find patient IDs not on the master list. Because PROC SQL populates &SQLOBS, the macro definition might continue with:

%do i=1 %to &sqlobs;

    %let next_dataset = %scan(&datasets_with_patient_id, &i, %str( ));

    proc sort data=&library..&next_dataset (keep=patient_id) 

              out=patients NODUPKEY;

       by patient_id; 

    run;

    data nomatch;

       merge patients

             master_list (in=in_master);

       by patient_id;

       if in_master=0;

    run;

    proc print data=nomatch;

       var patient_id;

       title "Patient IDs in &next_dataset but NOT in Master_List";

    run;

%end;

Each time the loop iterates, the %LET statement selects the name of the next SAS data set that contains PATIENT_ID. From that point, it is straightforward to sort and merge with the master list and to identify the mismatches.

6.9 Extremely Long Lists

Rarely, a list of names becomes so long that it will not fit into a macro variable. Consider this program, for example:

proc sql noprint; 

   select trim(name) 

          into : unformatted_numerics

          separated by ' '

   from dictionary.columns 

        where libname="%upcase(&library)"

          and upcase(memname)="%upcase(&memname)"

          and format=" " and type="num"; 

quit;

The intent is to extract the names of all unformatted numeric variables from a single SAS data set. Normally, this works smoothly. But data sets exist with tens of thousands of numeric variables. It is possible that the selected string will exceed the longest possible length for a macro variable. What can be done about that?

Let's take a simple case where the objective is to simply subset the variables. The intention would have been to add:

data nums;

   set &library..&memname (keep=&unformatted_numerics);

run;

While the documentation states that the MVARSIZE option can increase the maximum length of a macro variable, the maximum value for MVARSIZE is 64,534 bytes. PROC SQL can still extract the names of all the unformatted numerics, but those names may be too long to fit into a single macro variable. Instead, change the SELECT statement slightly to create an output table:

proc sql noprint; 

   create table unformatted_numerics as 

   select name 

   from dictionary.columns 

        where libname="%upcase(&library)"

          and upcase(memname)="%upcase(&memname)"

          and format=" " and type="num"; 

quit;

Then use CALL EXECUTE to construct the final program:

data _null_;

   call execute("data nums; set &library..&memname (keep=");

   do until (done);

      set unformatted_numerics end=done;

      call execute(name);

   end;

   call execute ("); run;"); 

   stop;

run;

Notice how the STOP statement is necessary here. By halting the DATA step, it prevents execution of the first CALL EXECUTE an extra time.

You may never encounter a data set with 20,000 variables. But if you do, this workaround can become a vital tool.

6.10 Blanks vs. Nulls

These statements all generate the same results:

if state=' ';

if state='     ';

if state='';

In most cases, the number of blanks between quotes does not matter. However, the opposite is true for SEPARATED BY. The SELECT clauses below generate different results:

proc sql noprint; 

   select distinct(trim(state)) 

          into : one_blank

          separated by ' '

   from popfile; 

   select distinct(trim(state)) 

          into : three_blanks

          separated by '   '

   from popfile; 

   select distinct(trim(state)) 

          into : one_null

          separated by ''

   from popfile;

quit;

A series of %PUT statements illustrates the difference:

%put **&one_blank**;        **Alaska Maine Vermont**

%put **&three_blanks**;    **Alaska   Maine   Vermont**

%put **&one_null**;           **AlaskaMaineVermont**

While it is unusual to need a null separator, it is possible. Section 8.5 contains a realistic example.

INTO : applications cover a tremendously broad spectrum. Whenever the need to extract data overlaps with the need to create macro variables, INTO : comes into play. Expect INTO : to be your ticket OUT OF many programming problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.178.165