Chapter 2: Introduction to the DS2 Language

2.1 Introduction

2.2 DS2 Programming Basics

2.2.1 General Considerations

2.2.2 Program Structure

2.2.3 Procedure Options and Global Statements

2.2.4 Program Blocks

2.2.5 Methods

2.2.6 User-Defined Methods

2.2.7 Variable Identifiers and Scope

2.2.8 Data Program Execution

2.3 Converting a SAS DATA Step to a DS2 Data Program

2.3.1 A Traditional SAS DATA Step

2.3.2 Considerations

2.3.3 The Equivalent DS2 Data Program

2.3.4 More Complex Data Program Processing

2.3.5 Automatic Conversion with PROC DSTODS2

2.4 Review of Key Concepts

2.1 Introduction

In this chapter, we will describe the basic components and construction of DS2 programs. Along the way, we’ll note similarities and differences between DS2 data programs and traditional Base SAS DATA steps. We’ll also convert an existing DATA step to a DS2 data program and execute our first DS2 program using PROC DS2.

2.2 DS2 Programming Basics

2.2.1 General Considerations

I like to describe DS2 as a next-generation language that combines the flexibility, control, and power of DATA step programming; the rich data palette of ANSI SQL; and the benefits of object-based code modularity. At first glance, the DS2 language is comfortingly similar to the DATA step. It is fundamentally a high-level imperative, procedural language that is designed for manipulating rectangular data sets and that includes features for working with arrays, hash objects, and matrices. Like the DATA step, most DS2 data program statements begin with a keyword, and all statements end with a semicolon. However, there are significant differences, highlighted in Table 2.1.

Table 2.1: DATA Step versus DS2

DATA Step DS2
There are almost no reserved words. All keywords are reserved words.
Data rows are processed individually and sequentially in a single compute thread. Several data rows can be processed in parallel, using multiple concurrent compute threads.
All variables referenced in a DATA step are global in scope. Variables referenced in a data program can be global or local in scope.
All variables referenced in a DATA step are in the program data vector (PDV) and will become part of the result set unless explicitly dropped. Variables with local scope are not added to the PDV and are never part of the result set.
Creating reusable code with variable encapsulation requires the use of a separate procedure, PROC FCMP, which has its own syntax. Reusable code modules with variable encapsulation are possible using standard PROC DS2 syntax in a package program.
The DATA step can consume a table produced by an SQL query as input to the SET statement. DS2 can directly accept the result set of an SQL query as input to the SET statement.
The DATA step can process only double-precision numeric or fixed-width character data. DBMS ANSI SQL data types must be converted to one of these data types before processing can occur. DS2 processes most ANSI SQL data types in their native format at full precision.

2.2.2 Program Structure

A quick comparison of a DATA step and the equivalent DS2 data program clearly shows that the languages are closely related, but that DS2 data programs are more rigidly structured:

data _null_;

   Message='Hello World!';

   put Message=;

run;

proc ds2;

data _null_;

   method run();

      Message='Hello World!';

      put Message=;

   end;

enddata;

run;

quit;

The primary structural difference is that DS2 programs are written in code blocks. In Base SAS, the DS2 language is invoked with a PROC DS2 block, which begins with a PROC DS2 statement and ends with a QUIT statement:

proc ds2;

   <ds2 program blocks>

quit;

Within a PROC DS2 block, you can define and execute three fundamental types of program blocks, which are described in Table 2.2.

Table 2.2: DS2 Program Blocks

Program Block Brief Description
Data The heart of the DS2 language, data programs manipulate input data sets to produce output result sets. They can accept input from tables, thread program result sets, or SQL query result sets.
Package Package programs create collections of variables and methods stored in SAS libraries, enabling an object-oriented approach to development. Easy and effective reuse of proven code modules can ensure standardization of important proprietary processes, decrease time required to write new programs, and improve overall code quality.
Thread Thread programs manipulate input data sets to produce output result sets that are returned to a data program. Used to simultaneously process several rows of data in parallel, threads can accept input from tables or SQL queries.

A more detailed description of each of these program blocks is provided in Section 2.2.4. Each program block is delimited by the appropriate DATA, PACKAGE, or THREAD statement and the corresponding ENDDATA, ENDPACKAGE, or ENDTHREAD statement. DS2 uses RUN-group processing and requires an explicitly coded RUN statement to cause the preceding program block to execute. A program block may be preceded by a global DS2 statement to modify the subsequent program block’s behavior:

proc ds2;

   <global DS2 statement(s)>

   package package_name;

      <DS2 programming statements to create the package here>

   endpackage;

   run;

   <global DS2 statement(s)>

   thread thread_name;

      <DS2 programming statements to create the thread here>

   endthread;

   run;

   <global DS2 statement(s)>

   data output_dataset_name;

      <DS2 programming statements to process data here>

   enddata;

    run;

quit;

Each program block consists of a combination of global declarative statements, followed by one or more uniquely named executable method blocks. In DS2, executable statements are valid only in the context of a method block. Method blocks are delimited by METHOD and END statements:

proc ds2;

   <global DS2 statement(s)>

   data output_dataset_name;

      <global declarative statements>

      method method_name(<method parameters>);

         <local variable declarations>

          <executable DS2 programming statements>

      end;

   enddata;

    run;

quit;

2.2.3 Procedure Options and Global Statements

The DS2 procedure includes procedure options. Procedure options are listed in the PROC DS2 statement and affect how the procedure operates. Examples of DS2 procedure options include ANSIMODE, DS2ACCEL, and NOLIBS. These options are discussed in detail later in the book, when the context will best demonstrate their usefulness.

In addition, the DS2 language includes a few global statements. Global statements are valid only outside of a program block, execute immediately, and either immediately perform a specific action or modify the behavior of the DATA, THREAD, or PACKAGE program block immediately following the global statement. Examples of DS2 global statements include DROP PACKAGE, DROP THREAD, and DS2_OPTIONS.

Let’s start with a discussion of how PROC DS2 makes data connections and how we can exercise some control over the process with the DS2 procedure options NOLIBS and LIBS=.

2.2.3.1 DS2 Procedure Options NOLIBS and LIBS=

When PROC DS2 is invoked, the default behavior is to scan the SAS session metadata for any LIBREFs currently available and set up connections to those libraries using the appropriate DS2 driver. If your SAS session has a lot of LIBREFs active, it might take more time that you would like to make all of those connections.

DS2 drivers can’t access concatenated LIBREFs. These LIBREFs refer to a collection of SAS data files located in more than one physical location. SASHELP is a good example of a concatenated LIBREF. Trying to access a data set in a concatenated library in a DS2 program results in an error:

ERROR: BASE driver, schema name SASHELP was not found for this

       connection

To access data from these libraries in DS2, you could locate the individual physical location where the data set of interest resides and set up a separate LIBREF pointing to that location. Another option would be to use the NOLIBS option and write your own connection string. This gives you precise control over what data connections the DS2 procedure makes. It can also speed up the DS2 invocation process when a large number of LIBREFs are pre-defined in the SAS session. As connection strings were not well documented before SAS 9.4M5, the easiest way to determine how to write one was to make DS2 show you the connection strings it automatically generated using the SAS system option MSGLEVEL=I:

options msglevel=i;

proc ds2;

quit;

The SAS log shows the connection strings generated by PROC DS2:

NOTE: Connection string:

NOTE: DRIVER=DS2;CONOPTS=(

      DRIVER=FEDSQL;CONOPTS=(

(DRIVER=BASE;CATALOG=SAS_DATA; SCHEMA=(NAME=SAS_DATA;

  PRIMARYPATH={Z:ds2_wranglerdata})); (DRIVER=BASE;CATALOG=MAPS;SCHEMA=(NAME=MAPS;

  PRIMARYPATH={C:Program FilesSASHomeSASFoundation9.4maps}));

(DRIVER=BASE;CATALOG=MAPSSAS;SCHEMA=(NAME=MAPSSAS;

  PRIMARYPATH={C:Program FilesSASHomeSASFoundation9.4maps})); (DRIVER=BASE;CATALOG=MAPSGFK;SCHEMA=(NAME=MAPSGFK;

 PRIMARYPATH={C:Program FilesSASHomeSASFoundation9.4mapsgfk}));

(DRIVER=BASE;CATALOG=SASUSER;SCHEMA=(NAME=SASUSER;

  PRIMARYPATH={C:UsersmyIDDocumentsMy SAS Files9.4}));

(DRIVER=BASE;CATALOG=WORK;SCHEMA=(NAME=WORK;

  PRIMARYPATH={<path to your work library>}))))

Wow! That’s a bit intimidating, even after reformatting the log output to make it easier to read. Note that the connection information is supplied as a single text string—hence the name “connection string.” A connection string consists of three tokens: a DRIVER token specifying which DS2 driver to use to access the data, a CATALOG token specifying the LIBREF that will be used to access the data, and a SCHEMA token specifying the catalog name and the physical path to be associated with it.

Let’s steal the connection string for the SAS_DATA LIBREF and try to invoke DS2 using only that library:

proc ds2 NOLIBS conn="(

         DRIVER=BASE;CATALOG=SAS_DATA;     

         SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={&pathdata}))";

quit;

The SAS log shows that this was successful:

NOTE: Connection string:

NOTE: DRIVER=DS2;CONOPTS= (DRIVER=FEDSQL;CONOPTS=

    ((DRIVER=base;CATALOG=SAS_DATA;

      SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={Z:ds2_wranglerdata}))))

NOTE: Current catalog set to SAS_DATA

That worked! But what if I wanted to access more than one (but not all) of the LIBREFs in my PROC DS2 session? The original SAS log generated by the MSGLEVEL=I showed a connection string with connections to more than one LIBREF. Let’s try to add the libref WORK to my PROC DS2 session:

proc ds2 NOLIBS conn="(

         DRIVER=BASE;CATALOG=SAS_DATA;

         SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={&pathdata});

         DRIVER=BASE;CATALOG=WORK;

         SCHEMA=(NAME=WORK;

         PRIMARYPATH={<path to work library>}))";

quit;

The SAS log shows that this was not successful:

ERROR: Connection string is malformed.  Multiple CATALOG= specifications occur in the same scope:

       Catalog SAS_DATA and catalog WORK.

ERROR: PROC DS2 initialization failed.

Well! Rather than digging around for more information, let’s just try stealing the whole connection string syntax from the original log, and delete the LIBREFS that we don’t want:

proc ds2 nolibs conn="DRIVER=DS2;CONOPTS=(DRIVER=FEDSQL;CONOPTS=(

        (DRIVER=BASE;CATALOG=SAS_DATA;

        SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={D:ds2_wrangler_2nddata}));

       (DRIVER=BASE;CATALOG=WORK;

        SCHEMA=(NAME=WORK;PRIMARYPATH={<path to work library>}))))";

quit;

The SAS log shows that this was successful:

NOTE: Connection string:

NOTE: DRIVER=DS2;CONOPTS=(DRIVER=FEDSQL;CONOPTS=((

      (DRIVER=BASE;CATALOG=SAS_DATA;

        SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={D:ds2_wrangler_2nddata}));

       (DRIVER=BASE;CATALOG=WORK;

        SCHEMA=(NAME=WORK;PRIMARYPATH={<path to work library>}))))";

NOTE: Current catalog set to SAS_DATA

The online documentation for PROC DS2 contains extensive documentation for building connection strings, including connection strings for accessing data using drivers for data sources other than Base SAS data sets. But if all of this seems a bit complex for day-to-day use, rejoice! The SAS 9.4M4 release brings us the LIBS= option.

With the LIBS= option, instead of telling PROC DS2 to disregard all LIBREFs in the system and writing our own connection strings, we can tell PROC DS2 what LIBREFs we want to use in this DS2 session, and let the procedure write the connection strings for us.

proc ds2 LIBS=(SAS_DATA WORK);

quit;

The SAS log shows that this was successful:

NOTE: Connection string:

NOTE: DRIVER=DS2;CONOPTS=(DRIVER=FEDSQL;CONOPTS=(

     (DRIVER=BASE;CATALOG=SAS_DATA;

      SCHEMA=(NAME=SAS_DATA;PRIMARYPATH={D:ds2_wrangler_2nddata}));

     (DRIVER=BASE;CATALOG=WORK;

      SCHEMA=(NAME=WORK;PRIMARYPATH={<path to work library>}))))

I’m excited about this new option—it’s going to make connection control in DS2 so much easier!

2.2.3.2 DROP PACKAGE Statement

The DROP PACKAGE statement deletes a package from a SAS library. For example, the following program creates a package named work.mypackage then deletes the package when it is no longer required. Note that the DROP PACKAGE statement requires a RUN statement to execute:

proc ds2;

/*create the package*/

package work.mypackage;

   method mymethod();

    /* More DS2 Statements */

    end;

endpackage;

run;

/*delete the package*/

drop package work.mypackage;

run;

quit;;

2.2.3.3 DROP THREAD Statement

The DROP THREAD statement deletes a thread from a SAS library. For example, the following program creates a thread named work.mythread then deletes the thread when it is no longer required. Note that the DROP THREAD statement also requires a RUN statement to execute:

proc ds2;

/*create the thread*/

thread work.mythread;

   dcl double num;

   method run();

      /* More DS2 Statements */

   end;

endthread;

run;

/*delete the thread*/

drop thread work.mythread;

run;

quit;

2.2.3.4 DS2_OPTIONS Statement

The DS2_OPTIONS statement modifies the behavior of the program block that immediately follows the DS2_OPTIONS statement. After execution of the subsequent DATA, PACKAGE, or THREAD program block, behavior of subsequent program blocks reverts to the default behavior unless another DS2_OPTIONS statement is issued.

Table 2.3: DS2_OPTIONS Statement Options

Options Brief Description
DIVBYZERO=ERROR | IGNORE When division by zero occurs, DS2 generates an unknown value, writes an error to the SAS log, and stops processing. Setting this option to IGNORE suppresses the error condition and SAS log entry, allowing processing to continue.
MISSING_NOTE By default, an error message is written to the SAS log when an invalid function argument generates a missing value. This option produces a NOTE in the SAS log instead of an error message.
SAS When DS2 is operating in ANSIMODE, unknown values are not represented by a SAS missing value—instead, these values are NULL. The SAS option overrides ANSIMODE and causes unknown values to be processed as SAS missing values instead of NULLs.
SCOND=ERROR|WARNING|NOTE|NONE The level of message displayed when undeclared variables are encountered in a DS2 program is controlled by the SAS system option DS2_SCOND or the DS2 procedure option SCOND. This DS2_OPTIONS statement overrides those settings for the next program block. WARNING—writes a WARNING to the SAS log.NOTE—writes notes to the SAS log.NONE—ignores undeclared variables.
TRACE Valid only when using SAS In-Database Code Accelerator, this option provides a lot of information on how the statements were executed in the database. Output is most useful when working with SAS Technical Support.
TYPEWARN There are so many data types in DS2 that implicit data type conversions are the norm. To avoid adverse performance impacts from writing a large number of messages to the SAS log, implicit data type conversion in DS2 occurs without a note. This can make troubleshooting an inadvertent loss of numeric precision due to implicit conversion a bit difficult. TYPEWARN prints a warning in the SAS log to aid in troubleshooting whenever the code causes an implicit data type conversion.

Here is an example of a DS2_OPTIONS statement in context in a DS2 program:

proc ds2;

ds2_options TYPEWARN;

data work.mydata;

  /* This is a pseudocode sample*/

enddata;

run;

quit;

2.2.4 Program Blocks

A brief description of each of the three program blocks is provided here to help you interpret the simple programs included in this chapter. Most of this book is dedicated to the data program. Package programs are discussed in detail in Chapter 5, and thread programs are discussed in Chapter 6.

2.2.4.1 DS2 Data Programs

A DS2 data program begins with a DATA statement, ends with an ENDDATA statement, includes at least one system method definition, and can generate a result set. It is the fundamental programming tool in the DS2 language. As in a Base SAS DATA step, the DS2 data program DATA statement normally lists the name(s) of the table(s) to which the result set will be written. Using the special table name _NULL_ to suppress the result set is optional. If no destination table is named in a Base SAS DATA step, SAS directs the result set to the WORK library, using an automatically generated data set name (DATA1, DATA2, and so on). A DS2 data program without a destination table name sends its results set to the Output Delivery System (ODS) for rendering as a report, much like an SQL query.

data;

   set sas_data.banks;

run;

proc ds2;

data;

   method run();

      set sas_data.banks;

   end;

enddata;

run;

quit;

The SAS log for the traditional DATA step indicates that the result set was written to a data set named DATA1 in the WORK library:

NOTE: There were 3 observations read from the data set SAS_DATA.BANKS.

NOTE: The data set WORK.DATA1 has 3 observations and 2 variables.

The output from the DS2 data program appears in the Results tab instead, as shown in Figure 2.1.

Figure 2.1: Output of the DS2 Data Program

image

2.2.4.2 DS2 Package Programs

A DS2 package program begins with a PACKAGE statement, ends with an ENDPACKAGE statement, and generates a package as a result. DS2 packages are used to store reusable code, including user-defined methods and variables. Packages are stored in SAS libraries and look like data sets. However, the contents of the package are merely a couple of rows of clear text header information followed by more rows containing encrypted source code. Packages make creating and sharing platform-independent reusable code modules easy and secure, and they provide an excellent means for users to extend the capabilities of the DS2 language.

Packages can be used for more than just sharing user-defined methods—they are the “objects” of the DS2 programming language. Global package variables (variables declared outside the package methods) are private to the package instance and can act as state variables. Each time you instantiate a package, the instance has a set of private global variables accessible by any method within the package instance and can use those variables to keep track of its state. Packages can also accept constructor arguments to initialize the package when it is instantiated and can include custom destructor methods. DS2 packages enable SAS users to easily create and reuse objects in their DS2 programs. DS2 packages are covered in detail in Chapters 4 and 5 of this book.

2.2.4.3 DS2 Thread Programs

A DS2 thread program begins with a THREAD statement, ends with an ENDTHREAD statement, and generates a thread as a result. Much like DS2 packages, threads are stored in SAS libraries as data sets and their contents consist of clear text header information followed by encrypted source code. Threads are structured much like a DS2 data program in that they contain at least one system method definition and can include package references and user-defined methods.  

Once a thread is created, it can be executed from a DS2 data program using the SET FROM statement. The THREADS= option in the SET FROM statement enables several copies of the thread program to run in parallel on the SAS compute platform for easy parallel processing, with each thread returning processed observations to the data program as soon as computations are complete. Thread programs are covered in detail in Chapter 6 of this book.

2.2.5 Methods

Methods are named code blocks within a DS2 program, delimited by a METHOD statement and an END statement. Method blocks cannot contain nested method blocks, and all method identifiers (names) must be unique within their DS2 data, package, or thread program block. There are two types of methods:

1.   system methods execute automatically only at prescribed times in a DS2 program. They cannot be called by name.

2.   user-defined methods execute only when called by name.

2.2.5.1 System Methods

There are three system methods that are included in every DS2 data program, either implicitly or explicitly: INIT, RUN, and TERM. These methods provide a DS2 data program with a more structured framework than the SAS DATA step. In the Base SAS DATA step, the entire program is included in the implicit, data-driven loop. In a DS2 data program, the RUN method provides the implicit, data-driven loop that will be most familiar to the traditional DATA step programmer. The INIT and TERM methods are not included in the loop, and provide a place to execute program initialization and finalization code.

System methods execute automatically and do not accept parameters. You must explicitly define at least one of these methods into your data or thread program or the program will not execute. If you do not write explicit code for one or more system method blocks, the DS2 compiler creates an empty version of the missing system method for you at compile time. An empty method contains only the appropriate METHOD statement followed by an END statement.  

2.2.5.1.1 The INIT Method

The INIT method executes once, and only once, immediately upon commencement of program execution. It provides a standard place to execute program initialization routines. The following DATA step and DS2 data programs produce the same results, but the DS2 data program does not require any conditional logic:

DATA step:

data _null_;

   if _n_=1 then do;

      put ’Execution is beginning’;

   end;

run;

DS2 data program:

proc ds2;

data _null_;

   method init();

      put ’Execution is beginning’;

   end;

enddata;

run;

quit;

2.2.5.1.2 The RUN Method

The RUN method best emulates the performance of a traditional SAS DATA step. It begins operation as soon as the INIT method has completed execution and acts as a data-driven loop. The RUN method iterates once for every data row (observation) in the input data set. The RUN method is the only method that includes an implicit output at the END statement. This DATA step and DS2 data program produce the same results:

data new_data;

   if _n_=1 then do;

      put ’Execution is beginning’;

   end;

   set sas_data.one_day;

run;

 

proc ds2;

data new_data;

   method init();

      put ’Execution is beginning’;

   end;

   method run();

      set sas_data.one_day;

   end;

enddata;

run;

quit;

2.2.5.1.3 The TERM Method

The TERM method executes once, and only once, immediately after the RUN method completes execution and before the data or thread program terminates execution. It provides an appropriate place to execute program finalization code. This DATA step and DS2 data program would produce the same results, but the DATA step requires the use of the END= SET statement option, the associated automatic variable, and a conditional logic decision to accomplish what the DS2 data program does without requiring any additional resources or code:

data new_data;

   if _n_=1 then do;

      put ’Execution is beginning’;

   end;

   set sas_data.one_day end=last;

   if last=1 then do;

      put ’Execution is ending’;

   end;

run;

 

proc ds2;

data _null_;

   method init();

      put ’Execution is beginning’;

   end;

   method run();

      set sas_data.one_day;

   end;

   method term();

      put ’Execution is ending’;

   end;

enddata;

run;

quit;

2.2.6 User-Defined Methods

In DS2, you can easily define and use your own reusable code blocks. These code blocks are called user-defined methods, and they can accept parameter values either by reference or by value. When all parameters are passed into a method by value, the values are available inside the method for use in calculations, and the method can return a single value to the calling process—much like a Base SAS function. This data program uses a user-defined method to convert temperatures from Celsius to Fahrenheit:

proc ds2;

/* No output DATA set. Results returned as a report (like SQL) */

data;

   dcl double DegC DegF;

   /* Method returns a value */

   method c2f(double Tc) returns double;

   /* Celsius to Fahrenheit */

      return (((Tc*9)/5)+32);

   end;

   method init();

      do DegC=0 to 30 by 15;

         DegF=c2f(DegC);

         output;

      end;

   end;

enddata;

run;

quit;

Figure 2.2: Output of Temperature Conversion

image

If one or more parameters are passed by reference, the values are available inside the method for use in calculations, and those values can be modified by the method at the call site, much like a Base SAS call routine. In DS2, parameters passed by reference are called IN_OUT parameters. A method that has IN_OUT parameters can modify several of its IN_OUT parameters during execution. In earlier versions of DS2, a user-defined method with IN_OUT parameters could not return a value, so you had to choose one or the other. SAS 9.4M5 brought with it the ability to create a user-defined method with IN_OUT parameters that also return a value. This is a very useful update, allowing a method that modifies values at the call site to also provide a return code.

The IN_OUT modifier for a parameter instructs the method to expect this parameter to be passed in by reference. During execution, we’ll need to supply the name of a variable for this parameter—a reference to the memory location where the value to be processed is located—instead of an actual value. With the location information available, the method can both read and modify the value stored there. Now, this could certainly complicate things a bit. When we call the method, we’ll probably pass in a different variable name each time it’s used. In order to make internal coding of the method possible, we won’t try to guess or restrict the names of the variables that we’re going to pass in, instead we’ll give it an alias (nickname) which we can use while processing internally.  

The following data program uses user-defined method f2c to convert temperatures from Fahrenheit to Celsius. In the f2c method, the IN_OUT parameter is given the alias T. When we call method f2c, we pass in variable Tc as the parameter. While the method is executing, every reference to T is actually processing the value stored in Tc:

proc ds2;

data;

   dcl double Tf Tc;

   /* Method modifies a value at the call site */

   method f2c(in_out double T);

   /* Fahrenheit to Celsius (Rounded) */

      T=round((T-32)*5/9);

   end;

   method init();

      do Tf=0 to 212 by 100;

         Tc=Tf;

         f2c(Tc);

         output;

      end;

   end;

enddata;

run;

quit;

Figure 2.3: Output of Temperature Conversion Program

image

When calling this type of method, you must supply a variable name for IN_OUT parameters. Constant values produce a syntax error:

proc ds2;

/* No output DATA set. Results returned as a report (like SQL) */

data;

   dcl double Tf Tc;

   /* Method modifies a value at the call site */

   method f2c(in_out double T);

   /* Fahrenheit to Celsius (Rounded) */

      T=round((T-32)*5/9);

   end;

   method init();

   /* Method f2c requires a variable as a parameter */

   /* Passing in a constant causes an error         */

      f2c(37.6);

   end;

enddata;

run;

quit;

SAS Log:

ERROR: Compilation error.

ERROR: In call of f2c: argument 1 is 'in_out'; therefore, the argument must be a modifiable value.

Here is an example of a user-defined method with IN_OUT parameters that also returns a value:

proc ds2;

data _null_;

   dcl double t1 t2 ;

   dcl int rc;

   method test(in_out double x, double y) returns int;

      dcl double last_x;

      last_x=x;

      x=x**y;

      /* Return code:

         0 – value not changed

         1 - value changed */

      if x=last_x then return 0;

      else return 1;

   end;

   method init();

      t1=5;

      rc=test(t1,t2);

      put rc= t1= t2=;

      do t2=1 to 5 by 2;

         t1=5;

         rc=test(t1,t2);

         put rc= t1= t2=;

      end;

   end;

enddata;

run;

quit;

SAS Log:

rc=1 t1=. t2=.

rc=0 t1=5 t2=1

rc=1 t1=125 t2=3

rc=1 t1=3125 t2=5

A method of this type can be safely called while ignoring the return code, making this type of user-defined method quite flexible in use. In the example below, user-defined method test returns a value. In a traditional SAS DATA step program, a similar function would have to be called as part of an expression that evaluated or stored the value returned. In the DS2 program, no error is generated when the method is called without an expression to handle the return code—the return code is merely ignored:

proc ds2;

data _null_;

   dcl double t1 t2 ;

   method test(in_out double x, double y) returns int;

      dcl double last_x;

      last_x=x;

      x=x**y;

      /* Return code:

         0 – value not changed

         1 - value changed */

      if x=last_x then return 0;

      else return 1;

   end;

   method init();

      t1=5;

      test(t1,t2);

      put t1= t2=;

      do t2=1 to 5 by 2;

         t1=5;

         test(t1,t2);

         put t1= t2=;

      end;

   end;

enddata;

run;

quit;

SAS Log:

t1=. t2=.

t1=5 t2=1

t1=125 t2=3

t1=3125 t2=5

2.2.7 Variable Identifiers and Scope

In a DS2 program, all objects, variables, and code blocks must have identifiers (names). Within the DS2 program, an identifier’s scope is either global or local, and identifiers are unique within their scope. An identifier’s scope determines where in the program that identifier can be successfully referenced. Figure 2.4 shows a DS2 data program with variable identifiers that are both global and local in scope and indicates which values will be returned when referenced.

Figure 2.4: DS2 Data Program Variable Scope

image

If the program illustrated in Figure 2.4 is executed, the following results are produced in the SAS log:

INIT method  var1=Global

INIT method  var2=Global

RUN method   var1=Local

RUN method   var2=Global

TERM method  var1=Global

TERM method  var2=Global

Variable identifiers are also global in scope if the variable is introduced to the program via a SET statement or the variable is undeclared. In DS2, an undeclared variable is created whenever a variable is first referenced in a program statement other than in a SET or DECLARE statement, such as an assignment statement. The use of undeclared variables in DS2 is discouraged; doing so will produce warnings in the SAS log. Although this might at first seem strange to a SAS DATA step programmer, if you’ve ever executed a long-running SAS program only to be greeted by the dreaded NOTE: Variable var is uninitialized in the SAS log, you will understand the benefit of this new behavior. Generally, that message means you’ve mistyped a variable name somewhere in your code, and the processing time for this run of the DATA step has been wasted.  

In DS2, the default behavior is to issue a WARNING when undeclared variables are encountered at compile time, but you can control this behavior to suit your own programming style and preferences.

Table 2.3: Controlling DS2 Behavior for Undeclared Variables

Duration Control method
SAS Session SAS system option DS2SCOND=
Current PROC DS2 invocation DS2 procedure option SCOND=
Next DS2 program block DS2_OPTIONS statement with SCOND= option

The default behavior is WARNING, but NONE, NOTE, and ERROR are all valid settings for these options. When writing, prototyping, or testing code, I personally prefer the ERROR setting so that if my DS2 programs contain undeclared variables they will fail to compile and execute, and will instead produce this message in the SAS log:

ERROR: Compilation error.

ERROR: Line nn: No DECLARE for referenced variable var; creating it as a global variable of type double.

In a DS2 data program, variable scope plays one additional important role: only global variables are included in the PDV, and only PDV variables are eligible to become part of the data program result set. You can explicitly remove global variables from the program result set using the DROP or KEEP statements. Variables with local scope are never included in the PDV, so there is no need to drop them. Only variables in the PDV can be included in the program result set.  

For example, in the following DS2 program, the variables Total and Count are declared globally and have global scope. The variables Payee and Amount are introduced via the SET statement, so they also have global scope. All of these variables can be referenced in both the RUN and TERM methods, and all are included in the program result set.

proc ds2;

   data;

      dec double Total Count;

      method run();

         set sas_data.one_day (keep=(Payee Amount));

         Total+Amount;

         Count+1;

      end;

      method term();

         put Total= Count=;

      end;

   enddata;

   run;

quit;

SAS Log:

Total=7230.5 Count=6

Figure 2.5: Report Produced by the Data Program

image

In the next DS2 program, the variables Total and Count are declared locally in the RUN method. As a result, they have scope that is local to RUN and can be referenced only by the RUN method. When the TERM method attempts to reference variables Total and Count, they are not available in the PDV, so the DS2 compiler treats these as new, undeclared variables. Warning messages are produced in the SAS log and, because undeclared variables have global scope, Total and Count are included in the PDV and in the program result set. However, because the global versions of these variables were never assigned a value, Total and Count contain missing values in the output:

proc ds2;

   data;

      method run();

         declare double Total Count;

         set sas_data.one_day (keep=(Payee Amount));

         Total+Amount;

         Count+1;

      end;

      method term();

         put Total= Count=;

      end;

   enddata;

   run;

quit;

SAS Log:

Total=. Count=.

WARNING: Line nn: No DECLARE for referenced variable total; creating it as a global variable of type double.

WARNING: Line nn: No DECLARE for referenced variable count; creating it as a global variable of type double.

Figure 2.6: Report Produced by the Data Program Showing Missing Values

image

If we delete the TERM method from the program, the only reference to the variables Total and Count are the local variables in the RUN method, so they will not be included in the PDV at all. No warnings about undeclared variables are issued in the SAS log, and the result set contains only the global variables Payee and Amount:

proc ds2;

   data;

      method run();

         declare double Total Count;

         set sas_data.one_day (keep=(Payee Amount));

         Total+Amount;

         Count+1;

      end;

   enddata;

   run;

quit;

Figure 2.7: Report Produced by the Data Program with Local Variables Excluded

image

User-defined methods can accept parameters. Parameters passed by value are treated as variables with local scope within the method. For example, in the following program, the user-defined method fullname has two parameters, first and last, which act as local variables. There is also one locally declared variable, FinalText. The main data program has three globally declared variables, WholeName, GivenName, and Surname, which will be included in the PDV. The resulting data set test contains only the global variables WholeName, GivenName, and Surname.

proc ds2 ;

   data;

      declare varchar(100) WholeName;

      method fullname(varchar(50) first, varchar(50) last)

             returns varchar(100);

         dcl varchar(100) FinalText;

         FinalText=catx(', ', last, first);

         Return FinalText;

      end;

      method run();

         if _n_=4 then stop;

         set sas_data.customer (keep=(GivenName Surname));

         WholeName=fullname(GivenName, Surname);

      end;

   enddata;

   run;

quit;

Figure 2.8: Report Produced by the Data Program

image

If you have ever stored snippets of code in a SAS program file for inclusion in a traditional DATA step, you have probably experienced what I refer to as PDV contamination. When the included code references a variable that already exists in the main program, PDV values for existing variables can inadvertently be modified by the included code. When the code includes new variable references, unwanted variables can be added to the PDV and appear in the output data set.

When reusing DS2 methods, the method’s local variables never affect the PDV, a concept often referred to as variable encapsulation. Because method parameters and locally declared variables are local in scope, they are encapsulated in your method code and won’t contaminate the PDV. In Chapter 4, we will store our user-defined methods in a DS2 package for simple reuse in future programs. Because of variable encapsulation, you will never need to worry about PDV contamination when reusing properly written DS2 methods.

2.2.8 Data Program Execution

DS2 data programs are delivered to the DS2 compiler for syntax checking, compilation, and execution. At compile time, resources are reserved for the PDV, the code is compiled for execution and, if an output data set is being produced, the output data set descriptor is written. After compilation, execution begins with the INIT method code, and it is automatically followed by the RUN and TERM method code. Only system methods execute automatically; any user-defined methods must be called from the INIT, RUN, or TERM methods or else the user-defined method will not be executed.

2.3 Converting a SAS DATA Step to a DS2 Data Program

2.3.1 A Traditional SAS DATA Step

Here is a traditional SAS DATA step with three subsections, which we will convert into a DS2 data program:

data _null_;

   /* Section 1 */

   if _n_=1 then

      do;

         put '**********';

         put 'Starting';

         put '**********';

      end;

 

   /* Section 2 */

   set sas_data.banks end=last;

   put Bank Rate;

 

   /* Section 3 */

   if last then

      do;

         put '**********';

         put 'Ending';

         put '**********';

      end;

run;

2.3.2 Considerations

1.   Section 1 consists of a DO group of statements that will be executed only when _N_=1. The automatic variable _N_ counts the number of times that the DATA step has iterated, so this block will execute only one time, when the DATA step first starts execution.

2.   Section 2 consists of unconditionally executed statements. These statements should execute once for every observation in the input data set. In this section, the SET statement uses the END= option to create last, a temporary variable containing an end-of-file indicator. The variable last is initialized to zero and remains 0 until the SET statement reads the last observation of the last data set listed, when it is set to 1. As an automatic variable, last is automatically flagged for DROP, and will not appear in the output data set.

3.   Section 3 consists of a DO group of statements that will execute only if the variable last contains a value other than 0 or missing.

If we think about this, Section 1 code sounds like a great candidate for the INIT system method, Section 2 for the RUN method, and Section 3 for the TERM method.

2.3.3 The Equivalent DS2 Data Program

Here is a DS2 data program equivalent to the original SAS DATA step:

proc ds2 ;

   data _null_;

 

      /* Section 1 */

      method init();

         put '**********';

         put 'Starting';

         put '**********';

      end;

 

      /* Section 2 */

      method run();

         set sas_data.banks;

         put Bank Rate;

      end;

 

      /* Section 3 */

      method term();

         put '**********';

         put 'Ending';

         put '**********';

      end;

   enddata;

   run;

quit;

2.3.4 More Complex Data Program Processing

DS2 can handle other traditional DATA step processes, such as concatenation and interleaving with a BY statement:

proc ds2;

data concatenated;

   method run();

      set one two;

   end;

enddata;

run;

data interleaved;

   method run();

      set one two;

      by id;

   end;

enddata;

run;

quit;

Figure 2.9: Concatenation versus Interleaving with the SET Statement

image

2.3.5 Automatic Conversion with PROC DSTODS2

The SAS 9.4M5 release includes a new procedure, PROC DSTODS2. This procedure accepts a traditional SAS DATA step program file as input, and it produces the equivalent DS2 data program as its output. PROC DSTODS2 is primarily designed to translate DATA step scoring code generated by SAS Enterprise Miner to a DS2 data program that can then be used for in-database scoring, and as such, supports a (fairly extensive) subset of the SAS DATA step programming syntax. Where the original DATA step code is too complex for PROC DSTODS2 to convert completely, the procedure automatically converts as much of the code as possible, inserting the untranslatable DATA step code segments as comments at appropriate places in the DS2 data program that it produces. The user can then edit the DS2 data program to complete the conversion process. This can be a tremendous help when converting large, traditional DATA step programs to DS2.

Let’s try out PROC DSTODS2:

Here is a SAS program that we want to convert:

/*2.3.5.sas*/

data AU DE US;

   set sas_data.Customer;

   select (country);

      when ('US') output US;

      when ('AU') output AU;

      when ('DE') output DE;

      otherwise;

   end;

run;

 

data master;

   set AU DE US;

run;

Here is our first attempt to convert the program using PROC DSTODS2:

/*2.3.5.dstods2.sas*/

proc dstods2 in="&path/programs/2.3.4.sas"

             out="&path/programs/2.3.4.converted.sas";

run;

Here is the error message we get in the log:

ERROR: Multiple DATA statements encountered.

ERROR: DATA statement must appear at beginning of program.

It looks like we’ll have to convert one step at a time. Let’s split the program into two separate DATA program files:

/*2.3.5.a.sas*/

data AU DE US;

   set sas_data.Customer;

   select (country);

      when ('US') output US;

      when ('AU') output AU;

      when ('DE') output DE;

      otherwise;

   end;

run;

 

/*2.3.5.b.sas*/

data master;

   set AU DE US;

run;

and convert them using PROC DSTODS2:

proc dstods2 in="&path/programs/2.3.4.a.sas"

             out="&path/programs/2.3.4.a.converted.sas";

run;

proc dstods2 in="&path/programs/2.3.4.b.sas"

             out="&path/programs/2.3.4.b.converted.sas";

run;

Here is the converted (and prettily formatted) program code:

/*2.3.5.a.converted.sas*/

data AU DE US;

   method run();

      set SAS_DATA.CUSTOMER;

      select (COUNTRY);

         when ('US') output US;

         when ('AU') output AU;

         when ('DE') output DE;

      otherwise ;

      end;

      ;

      _return: ;

   end;

 enddata;

 

/*2.3.5.b.converted.sas*/

data MASTER;

   method run();

   set AU DE US;

   ;

   _return: ;

end;

enddata;

There is an unused _return: label in the code, and a null statement (just a semicolon) in each of the DATA programs, but the code is syntactically correct. We’ll just need to add the RUN statements, and we’re ready to run in PROC DS2!  

PROC DSTODS2 can handle arrays, DO loops, BY statements, conditional logic, MERGE and explicit OUTPUT statements, along with many others, so this procedure should prove to be quite useful, especially if you have a number of DATA steps that you need to convert to DS2 data programs.

2.4 Review of Key Concepts

   All DS2 programs are structured in blocks.  

   There are three types of DS2 program blocks: data, package, and thread.

   A program block begins with the appropriate DATA, PACKAGE, or THREAD statement, and ends with the corresponding ENDDATA, ENDPACKAGE, or ENDTHREAD statement. The remainder of the program consists of a combination of global declarative statements and method definitions. All executable statements must be part of a method block definition.

   There are three system methods: INIT, RUN, and TERM. Every data and thread program must contain explicit coding for one of these methods. System methods execute automatically and do not accept parameters.

   You can write user-defined methods, keeping the following in mind:

   User-defined methods do not execute automatically; they execute only when called.  

   User-defined methods can accept parameters with values passed either by value or by reference (IN_OUT parameters).

   A method that has no IN_OUT parameters can return a value, much like a SAS function.

   Method IN_OUT parameter values can be modified at the call site, much like a SAS CALL routine.

   User-defined methods can be stored for easy reuse in a DS2 package.

   Variables should be declared in a DS2 program using a DECLARE (DCL) statement. Where the variable is declared determines the variable’s scope.

   Variables introduced to a DS2 program via a SET statement, declared in the global program space (before method definitions begin), or that appear undeclared in the program code will have global scope. Global variables can be referenced anywhere inside the DS2 program, are part of the PDV, and are included in the program result set by default.

   Variables declared inside a METHOD block and method parameter variables are local in scope and can be referenced only within that method. Local variables are never included in the PDV and cannot become part of the program result set.

   You can use PROC DSTODS2 to help speed up the process of converting a traditional SAS DATA step to a DS2 data program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.164.24