Reading and Writing Only Essential Data

Overview

Best practices specify that you should write programs that read and write only essential data. If you process fewer observations and variables, you conserve resources. This topic covers many different techniques that can improve performance.

Selecting Observations Using Subsetting IF versus WHERE Statement

You can use WHERE statements or subsetting IF statements to subset data. Although both statements test a condition to determine whether SAS should select an observation, the WHERE statement is more efficient.
The following graphic illustrates differences between these statements.
Selecting Observations Using Subsetting IF versus WHERE Statement
I/O operations are measured as data moves between the disk that contains input SAS data and the input buffer in memory, and when data moves from the output buffer to the disk that contains output data sets. Input data is not affected by the WHERE statement or subsetting IF statement. However, output data is affected by both.
CPU time is measured when data must be processed in the program data vector. CPU time can be saved if fewer observations are processed.
A WHERE statement and a subsetting IF statement make different use of the program data vector. The WHERE statement selects observations before they are loaded into the program data vector, which results in a savings in CPU operations. The subsetting IF statement loads each observation sequentially into the program data vector. If the subsetting condition is true, the observation is processed and is written to the output page buffer.
WHERE statements work on existing variables in existing SAS data sets. Subsetting IF statements can work on any variable in the program data vector, including new or existing variables.

Comparative Example: Creating a Subset of a SAS Data Set

Overview

Suppose you want to create a subset of the data set Retail.Customer. You want to include data for only the United Kingdom. The subset contains approximately 6% of the Retail.Customer data.
The following sample programs compare two techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 Subsetting IF Statement
This program uses the IF statement to select observations if the value for Country is GB.
data retail.UnitedKingdom;
   set retail.customer;
   if country='GB';
run;
2 WHERE Statement
This program uses the WHERE statement to select observations when the value for Country is GB. This can be more efficient than using a subsetting IF statement.
data retail.UnitedKingdom;
   set retail.customer;
   where country='GB';
run;

General Recommendations

  • To save CPU resources when the subset is small, use a WHERE statement instead of a subsetting IF statement to subset a SAS data set.

Other Differences between the IF and WHERE Statements

Review the following table to note other differences between the IF and WHERE statements.
 Action
The Subsetting IF Statement
The WHERE Statement
Selecting Data
Can select records from external files, observations from SAS data sets, observations created with an INPUT statement, or observations based on the value of a computed or derived variable.
Can select only observations from SAS data sets.
Conditional Execution
Is an executable statement
Is not an executable statement
Grouping Data Using a BY Statement
Has no effect on FIRST. or LAST. flags.
Affects FIRST. or LAST. flags, which are set after processing the WHERE expression.
Merging Data
Selects observations after combining current observations.
Applies the selection criteria to each input data set before combining observations.
Note: If you use the WHERE= data set option and the WHERE statement in the same DATA step, SAS ignores the WHERE statement for input data sets. The WHERE= data set option and the WHERE statement call the same SAS routine.

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

Another way to read and write only essential data is to process a segment of subsetted data. You accomplish this specialized task by using a WHERE expression in conjunction with the OBS= and FIRSTOBS= data set options.
In the following example, the WHERE expression selects observations before the OBS= and FIRSTOBS= options are applied. The values specified for OBS= and FIRSTOBS= are the logical observation numbers in the subset, not the physical observation numbers in the data set.
options fmtsearch=(formats);

proc print 
     data=company.organization_dim(firstobs=5 obs=8);
     var employee_id employee_gender salary;
     where salary>40000;
run;
Using the Where Statement with the OBS= and FIRSTOBS= Options
FIRSTOBS = 5 is the fifth observation in the subset, whereas it was observation 101 in the data set Company.Organization.
OBS = 8 is the eighth observation in the subset, whereas it was observation 159 in the data set Company.Company.Organization..

Selecting Observations When Reading Data from External Files

Positioning a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subset—before reading all the data—can reduce the overhead required for processing data.
The following graphic illustrates how data is read from an external file, loaded into the input buffer, and read into the program data vector.
Remember that I/O operations are measured as data moves between disks and buffers—for both input and output data. Each record is loaded into the input buffer before moving to the program data vector for processing, so I/O is not affected by the placement of a subsetting IF statement in the DATA step.
You can reduce the CPU resources that are required for processing data by limiting what is read into the program data vector. Position a subsetting IF statement after an INPUT statement that reads only the data that is required in order to check for specific conditions. Subsequent statements do not execute for unwanted observations.
Note: Converting raw character fields to SAS character variables requires less CPU time than converting raw numeric fields to the real binary encoding of SAS numeric variables.

Comparative Example: Creating a Subset of Data by Reading Data from an External File

Overview

Suppose you want to create a SAS data set by reading a subset of data from an external file that is referenced by the fileref Customerdata. You want the subset to contain only customers in the United Kingdom.
The subset is approximately 6% of the countries in the external file, which contains 89,954 records and 12 fields.
The following sample programs compare two techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 Reading All Variables and Subsetting
In this program, the INPUT statement reads all variables before the subsetting IF statement checks the value of Country. Then, if the value for Country is GB, the observation is written to the output data set Retail.UnitedKingdom.
data retail.UnitedKingdom;
   infile customerdata;
   input @1   Customer_ID         12.
         @13  Country             $2.
         @15  Gender              $1.
         @16  Personal_ID        $15.
         @31  Customer_Name      $40.
         @71  Customer_FirstName $20.
         @91  Customer_LastName  $30.
         @121 Birth_Date       date9.
         @130 Customer_Address   $45.
         @175 Street_ID           12.
         @199 Street_Number       $8.
         @207 Customer_Type_ID     8.;
    if country='GB';
run;
2 Reading Selected Variables and Subsetting
In this program, the first INPUT statement reads only Country and holds the record in the input buffer using the single trailing @ sign. Then the program uses a subsetting IF statement to check the value of Country. If the value for Country is not GB, other variables are not read in or written to the output data set Retail.UnitedKingdom. If the value for Country is GB, values for other variables are input and written to the output data set Retail.UnitedKingdom.
data retail.UnitedKingdom;
   infile customerdata;
   input @13 Country $2. @;
   if country='GB';
   input @1   Customer_ID         12.
         @15  Gender              $1.
         @16  Personal_ID        $15.
         @31  Customer_Name      $40.
         @71  Customer_FirstName $20.
         @91  Customer_LastName  $30.
         @121 Birth_Date       date9.
         @130 Customer_Address   $45.
         @175 Street_ID           12.
         @199 Street_Number       $8.
         @207 Customer_Type_ID     8.;
run;

General Recommendations

  • Position a subsetting IF statement in a DATA step so that only variables that are necessary to select the record are read before subsetting. This can result in significant savings in CPU time. There is no difference in I/O or memory usage between the two techniques.
  • When selecting records from an external file, read the field or fields on which the selection is being made before reading all the fields into the program data vector.
  • Use the single trailing @ sign to hold the input buffer so that you can continue to read the record when the variable or variables satisfy the IF condition.
  • Reset the pointer so that you can begin reading the record in the first position by using @1 Customer_ID.
In addition to subsetting observations, you can subset variables by using statements or options that efficiently read and write only essential data.

Subsetting Variables with the KEEP and DROP Statements and Options

To subset variables, you can use either of the following:
  • the DROP and KEEP statements
  • the DROP= and KEEP= data set options
Subsetting Variables with the KEEP= and DROP= Statements and Options
Use of the KEEP= data set option and the DROP= data set option can affect resource usage, depending on whether they are used in a SET or MERGE statement or in a DATA statement.
The following figure shows how options in these statements process data.
Subsetting Variables with the KEEP= and DROP= Statements and Options
When used in the SET or MERGE statement, the KEEP= and DROP= data set options affect which variables are read into the program data vector. Reading only the variables that need to be processed in the DATA step can sometimes improve efficiency.
When used in the DATA statement, these same options put drop flags on variables to be excluded and affect which variables are written to the output data set.
The DROP and KEEP statements work just like the KEEP= or DROP= options in the DATA statement.
The following table describes differences in how the KEEP statement and the KEEP= data set option write variables to SAS data sets.
DROP or KEEP Statement
DROP= or KEEP= Output Data Set Option
DROP= or KEEP= Input Data Set Option
Writes only the selected variables to all output data sets.
Can write different variables to different output data sets.
Reads only the selected variables into the PDV.
Available only in the DATA step.
Available in the DATA step or most PROC steps.
Available in the DATA step or most PROC steps.

Comparative Example: Creating a Report That Contains Average and Median Statistics

Overview

Suppose you want to create a report that contains the average and median values for the variable Profit, based on the data set Retail.Order_fact. Depending on the number of variables eliminated, it might be more efficient to use the KEEP= option in a SET statement to limit which variables are read.
The following sample programs compare two techniques for reading and writing variables to a data set. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 Without the KEEP= option
This program reads all variables from the data set Retail.Order_fact and does not restrict which variables are written to the output data set Retail.Profit. PROC MEANS reads all the variables from the data set.
data retail.profit;
   set retail.order_fact;  
   if discount=. then    
      Profit=(total_retail_price-costPrice_Per_Unit)*quantity;  
   else Profit=((total_retail_price*discount)-costprice_per_unit)*quantity;
run;
proc means data=retail.profit mean median maxdec=2;
   title 'Order Information';
   class employee_id;
   var profit;
run;
2 KEEP= in the DATA Statement
This program uses the KEEP= data set option in the DATA statement to write two variables to the output data set Retail.Profit. PROC MEANS reads only two variables from the data set.
data retail.profit(keep=employee_id profit);
   set retail.order_fact;
   if discount=. then
      Profit=(total_retail_price-costprice_per_unit)*quantity;
   else Profit=((total_retail_price*discount)-costprice_per_unit)*quantity;
run;
proc means data=retail.profit mean median maxdec=2;
   title 'Order Information';
   class employee_id;
   var profit;
run;
3 KEEP= in the DATA and SET Statements
This program uses the KEEP= option in the SET statement to read six variables from Retail.Order_fact, and it uses the KEEP= data set option in the DATA statement to write two variables to the output data set Retail.Profits. PROC MEANS reads only two variables from the data set.
data retail.profits(keep=employee_id profit);  
   set retail.order_fact(keep=employee_id total_retail_price discount 
                              costprice_per_unit quantity);  
      if discount=. then    
         Profit=(total_retail_price-costprice_per_unit)*quantity;  
      else Profit=((total_retail_price*discount)-costprice_per_unit)*quantity;
run;
proc means data=retail.profit mean median maxdec=2;
   title 'Order Information';
   class employee_id;
   var profit;
run;
4 KEEP= in the SET and MEANS Statements
This program uses the KEEP= option in the SET statement to read selected variables from Retail.Order_fact, and it uses the KEEP= data set option in the MEANS statement to process only the variables that are needed for the statistical report. You might do this if you need additional variables in Retail.Profits for further processing, but only two variables for processing by PROC MEANS.
data retail.profit;
  set retail.order_fact(keep=employee_id total_retail_price discount
                             costprice_per_unit quantity);
  if discount=. then
     Profit=(total_retail_price-costprice_per_unit)*quantity;
  else Profit=((total_retail_price*discount)-costprice_per_unit)*quantity;
run;
proc means data=retail.profit(keep=employee_id profit) mean median maxdec=2;
   title 'Order Information';
   class employee_id;
   var profit;
run;

General Recommendations

  • To reduce both CPU time and I/O operations, avoid reading and writing variables that are not needed.

Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables

Overview

Suppose you want to read data from an external file that is referenced by the fileref Rawdata and create a SAS data set that contains only the variables Customer_ID, Country, Gender, and Customer_Name.
The following sample programs compare two techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.

Programming Techniques

1 Reading All Fields
In this program, the KEEP= data set option writes only the variables that are needed to the output data set, whereas the INPUT statement reads all fields from the external file.
data retail.customers(keep=Customer_ID Country Gender Customer_Name);
   infile rawdata;
   input @1   Customer_ID         12.
         @13  Country             $2.
         @15  Gender              $1.
         @16  Personal_ID        $15.
         @31  Customer_Name      $40.
         @71  Customer_FirstName $20.
         @91  Customer_LastName  $30.
         @121 Birth_Date       date9.
         @130 Customer_Address   $45.
         @175 Street_ID           12.
         @199 Street_Number       $8.
         @207 Customer_Type_ID     8.;
run;
2 Reading Selected Fields
In this program, the INPUT statement reads selected fields from the external file, and by default, these are written to the output data set. This program is an example of efficient processing.
data retail.customers;
   infile rawdata;
   input @1   Customer_ID         12.
         @13  Country             $2.
         @15  Gender              $1.
         @31  Customer_Name      $40.;
run;

General Recommendations

  • Read only the fields you need from an external data file to save CPU and real-time resources.
  • To save CPU resources, avoid converting numeric data that you do not need in further processing.
Note: Remember that numeric data is moved into the program data vector after being converted to real binary, floating point numbers; multiple digits are stored in one byte. Character data is moved into the program data vector with no conversion; one character is stored in one byte.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.98.14