Reducing Data Storage Space for Numeric Variables

Another way to reduce data storage space is to reduce the length of numeric variables. In addition to conserving data storage space, reduced-length numeric variables use less I/O, both when data is written and when it is read. For a file that is read frequently, this savings can be significant. However, in order to safely reduce the length of numeric variables, you need to understand how SAS stores numeric data.

How SAS Stores Numeric Variables

To store numbers of large magnitude and to perform computations that require many digits of precision to the right of the decimal point, SAS stores all numeric values using double-precision floating-point representation. SAS numeric variables have a maximum length of 8 bytes and a minimum length of 2 or 3 bytes, depending on your operating environment. The default length is 8 bytes. Multiple digits are stored per byte.
Floating-point representation is an implementation of scientific notation. For example, the number 234 might be written as .234*10**3 with a base of 10. In this example, .234 is referred to as the mantissa, 10 is the base, and 3 is the exponent. The figures below show how SAS stores a numeric value in 8 bytes. Mainframe environments use base 16. The first bit stores the sign, the next seven bits store the exponent, and the remaining 56 bits store the mantissa.
Floating-point representation
Non-mainframe environments use base 2. The first bit stores the sign, the next 11 bits store the exponent, and the remaining 52 bits store the mantissa.
Floating-point representation
Note: The minimum length for a numeric variable is 2 bytes in mainframe environments and 3 bytes in non-mainframe environments.
Now that you have seen how SAS stores numeric variables, consider how you can assign a length to your numeric variables that is less than the default length of 8 bytes.

Assigning Lengths to Numeric Variables

You can use a LENGTH statement to assign a length from 2 to 8 bytes to numeric variables. Remember, the minimum length of numeric variables depends on the operating environment. Also, keep in mind that the LENGTH statement affects the length of a numeric variable only in the output data set. Numeric variables always have a length of 8 bytes in the program data vector and during processing.
General form, LENGTH statement for numeric variables:
LENGTH variable(s) length <DEFAULT=n>;
Here is an explanation of the syntax:
variable(s)
specifies the name of one or more numeric SAS variables, separated by spaces.
length
is an integer that specifies the length of the variable(s).
DEFAULT=n
this optional argument changes the default number of bytes that SAS uses to store any newly created numeric variables. If you use the DEFAULT= argument, you do not need to list any variables.
DEFAULT= applies only to numeric variables that are created after the LENGTH statement. List specific variables, and their lengths, along with the DEFAULT= argument, if you want the listed variables to receive a specified length. Then the non-listed variables receive the DEFAULT= length.
CAUTION:
Numeric values lose precision if truncated. You learn more about the loss of precision with reduced-length numeric variables in the next section of this chapter.

Example

The following program assigns a length of 4 to the new variable Sale_Percent in the data set ReducedSales. The LENGTH statement in this DATA step does not apply to the variables that are read from the Sales data set; those variables maintain whatever length they had in Sales when they are read into ReducedSales.
data reducedsales;
   length default=4;
   set sales;
   Sale_Percent=15;
run;

Maintaining Precision in Reduced-Length Numeric Variables

There is a limit to the values that you can precisely store in a reduced-length numeric variable. You have learned that reducing the number of bytes that are used for storing a numeric variable does not affect how the numbers are stored in the program data vector. Instead, specifying a value of less than 8 in the LENGTH statement causes the number to be truncated to the specified length when the value is written to the SAS data set.
You should never use the LENGTH statement to reduce the length of your numeric variables if the values are not integers. Fractional numbers lose precision if truncated. Even if the values are integers, you should keep in mind that reducing the length of a numeric variable limits the integer values that can accurately be stored as a value.
The following table lists the possible storage length for integer values on UNIX or Windows operating environments.
Table 21.1 UNIX or Windows
Length (bytes)
Largest Integer Represented Exactly
3
8,192
4
2,097,152
5
536,870,912
6
137,438,953,472
7
35,184,372,088,832
8
9,007,199,254,740,992
The following table lists the possible storage length for integer values on the z/OS operating environment.
Table 21.2 z/OS
Length (bytes)
Largest Integer Represented Exactly
2
256
3
65,536
4
16,777,216
5
4,294,967,296
6
1,099,511,627,776
7
281,474,946,710,656
8
72,057,594,037,927,936
Suppose you store an integer that is equal to or less than the number listed above as the largest integer that can be represented exactly in a reduced-length variable. In such a case, SAS truncates bytes that contain only zeros. If the integer that is stored in a reduced-length variable is larger than the recommended limit, SAS truncates bytes that contain numbers other than zero, and the integer value is changed. Similarly, you should not reduce the stored size of non-integer data because it can result in a loss of precision due to the truncation of nonzero bytes.
If you decide to reduce the length of your numeric variables, you might want to verify that you have not lost any precision in your values. Here is one way to do this action.

Using PROC COMPARE

You can use PROC COMPARE to gauge the precision of the values that are stored in a shortened numeric variable. You do this by comparing the original variable with the shortened variable. The COMPARE procedure compares the contents of two SAS data sets, selected variables in different data sets, or variables within the same data set.
General form, PROC COMPARE step to compare two data sets:
PROC COMPARE BASE=SAS-data-set-one
COMPARE=SAS-data-set-two;
RUN;
Here is an explanation of the syntax:
SAS-data-set-one and SAS-data-set-two
specifies the two SAS data sets that you want to compare.
PROC COMPARE is a good technique to use for gauging the loss of precision in shortened numeric variables because it shows you whether there are differences in the stored numeric values even if these differences do not appear once the numeric variables have been formatted. PROC COMPARE looks at the two data sets and compares the following:
  • data set attributes
  • variable attributes for matching variables
  • observations
  • values in matching variables
Output from the COMPARE procedure includes the following information:
  • a data set summary
  • a variables summary
  • a listing of common variables that have different attributes
  • an observation summary
  • a values comparison summary
  • a listing of variables that have unequal values
  • a detailed list of value comparison results for variables

Example

The data set Company.Discount contains data about sale dates and discounts for certain retail products. There are 35 observations in Company.Discount, which is described below.
Variable
Type
Length
Description
Product_ID
num
8
product ID number
Start_Date
num
4
start date of sale
End_Date
num
5
end date of sale
Unit_Sales_Price
num
8
discounted sales price per unit
Discount
num
8
discount as percent of normal sales price
Suppose you shorten the length of the numeric variable Discount. The DATA step below creates a new data set named Company.Discount_Short, whose only difference from Company.Discount is that the length of the variable Discount is 4 instead of 8.
data company.discount_short;
   length Discount 4;
   set Company.Discount;
run;
You can use PROC COMPARE to evaluate whether shortening the length of Discount affects the precision of its values by comparing Company.Discount to Company.Discount_Short.
proc compare base=company.discount
             compare=company.discount_short;
run;
If you were to print these two data sets (Company.Discount and Company.Discount_Short), the values might appear to be identical. However, there are differences in the values as they are stored, but those differences are not apparent in the formatted output.
In the partial output below, you can see that shortening the length of Discount results in a loss of precision in its values; the values for Discount in Company.Discount_Short differ by a maximum of 1.9073E-07. The value comparison results show that although the values for Discount in the first five observations appear as 70% in both data sets, the precise (unformatted) values differ by −1.907E-7.
Figure 21.1 Partial PROC COMPARE Output
PROC COMPARE Output
Values Comparison Summary
Value Comparison Results

Comparative Example: Creating a SAS Data Set That Contains Reduced-Length Numeric Variables

Default versus Reduced-Length Numeric Variables

Suppose you want to create a SAS data set to store retail data about a group of orders. Suppose that the data that you want to include in your data set is all numeric data and that it is currently stored in a raw data file.
The following sample programs compare two techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site.
Note: Throughout this book, the keyword _NULL_ is often used in place of the data set name in sample programs. Using _NULL_ suppresses the creation of an output data set. Using _NULL_ when benchmarking enables you to determine what resources are used to read a SAS data set.

Programming Techniques

1 Default-Length Numeric Variables
This program reads the external data file that is referenced by the fileref Flat1 and creates a new data set called Retail.Longnums that contains 12 numeric variables. Each of the variables in Retail.Longnums has the default storage length of 8 bytes. The second DATA step in this program reads the numeric variables from Retail.Longnums.
data retail.longnums;
   infile flat1;
   input  Customer_ID          12.
          Employee_ID          12.
          Street_ID            12.
          Order_Date           date9.
          Delivery_Date        date9.
          Order_ID             12. 
          Order_Type           comma16.
          Product_ID           12.     
          Quantity             4.     
          Total_Retail_Price   dollar13.2
          CostPrice_Per_Unit   dollar13.2
          Discount             5.        ;
run;

data _null_;
   set retail.longnums;
run;
2 Reduced-Length Numeric Variables
This program reads the external data file that is referenced by the fileref Flat1 and creates a new SAS data set called Retail.Shortnums that contains 12 numeric variables. A LENGTH statement is used to reduce the storage length of most of the numeric variables in Retail.Shortnums, as follows:
  • Total_Retail_Price and CostPrice_Per_Unit have a storage length of 8 bytes.
  • Product_ID has a storage length of 7 bytes.
  • Street_ID and Order_ID have a storage length of 6 bytes.
  • Employee_ID has a storage length of 5 bytes.
  • Customer_ID, Order_Date, Delivery_Date, and Discount have a storage length of 4 bytes.
  • Order_Type and Quantity have a storage length of 3 bytes.
The second DATA step reads the reduced-length numeric variables from Retail.Shortnums.
data retail.shortnums;
   infile flat1;
          length Quantity  Order_Type   3
          Customer_ID  Order_Date
          Delivery_Date  Discount       4
          Employee_ID                   5
          Street_ID  Order_ID           6
          Product_ID                    7
          Total_Retail_Price
          CostPrice_Per_Unit            8;
   input  Customer_ID          12.
          Employee_ID          12.
          Street_ID            12.
          Order_Date           date9.
          Delivery_Date        date9.
          Order_ID             12. 
          Order_Type           comma16.
          Product_ID           12.     
          Quantity             4.     
          Total_Retail_Price   dollar13.2
          CostPrice_Per_Unit   dollar13.2
          Discount             5.        ;
run;

data _null_;
   set retail.shortnums;
run;
Note: Remember that when you reduce the storage length of numeric variables, you risk losing precision in their values. You can use PROC COMPARE to verify the precision of shortened numeric variables.
proc compare base=retail.longnums;
             compare=retail.shortnums;
run;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.107.229