© Randy Betancourt, Sarah Chen 2019
R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_2

2. Python Types and Formatting

Randy Betancourt1  and Sarah Chen2
(1)
Chadds Ford, PA, USA
(2)
Livingston, NJ, USA
 

In this chapter we discuss Python “types” along with string and numeric formatting. Python has different “types,” some of which are built into the language, some of which are added by third parties, and some of which are created by Python code. These types can represent all kinds of data and allow Python to be used as a general-purpose programming language in addition to its use as a data processing language. As a general rule, SAS programmers need not concern themselves with data types. This is because the data model used to store variables in Foundation SAS datasets (.sas7bdat) is a simple one. The data “types” for SAS are either numeric or character. Internally SAS uses floating-point representation for numeric values.

The SAS language handles a tremendous amount of details without user intervention when reading and writing numeric or character data. For example, SAS informats (both SAS-supplied and user-defined ones) provide the mappings for reading various data types for numeric and character data inputs. Similarly, SAS formats provide the mappings needed to write various data types. We will begin by discussing numerics, followed by strings (character variable values in SAS). Further in the chapter, we examine how Python formats both numeric and character data types.

We begin by examining numerics which include Boolean type operators used for truth testing. This is followed by an overview of what Python refers to as strings and SAS refers to as character variables. Finally we will discuss formatting for both numerics and strings.

Numerics

Python has three distinct numeric data types:
  1. 1.

    Integers

     
  2. 2.

    Floating point

     
  3. 3.

    Complex numbers

     

In addition, Booleans are a subtype of integers which is discussed in detail. Complex numbers will not be discussed as they are outside the scope of this book. In Python numbers are either numeric literals or created as the result of built-in operators or functions. Any numeric literal containing an exponent sign or a decimal point is mapped to a floating-point type. Whole numbers including hexadecimals, octal, and binary numbers are mapped as integer types.

Python permits “mixed” arithmetic operations, meaning numerics with different types used in expressions are permitted. In Listing 2-1, the built-in function type() is used to return the object’s data type. (Throughout this book we will rely on the built-in type() function extensively to return information from a Python object.)
>>> nl = ' '
>>>
>>> x = 1
>>> y = 1.5
>>> z = x * y
>>>
>>> print(nl                     ,
...       'x type is:' , type(x) , nl,
...       'y type is:' , type(y) , nl,
...       'z type is:' , type(z))
 x type is: <class 'int'>
 y type is: <class 'float'>
 z type is: <class 'float'>
Listing 2-1

Mixed Types

In this example, x is an integer, y is a float. The product of x and y, is assigned to z which Python then cast as a float. This illustrates the rule Python uses for mixed arithmetic operations. This example also neatly illustrates the compactness of Python as a language. Similar to the SAS language, there is no need to declare variables and their associated data types as they are inferred from their usage.

In contrast, the Base SAS language does not make a distinction between integers and floats. Newer SAS language implementations such as DS2 and Cloud Analytic Services (CAS) define a range of numeric data types such as integer and floats. These newer language implementations are outside the scope of this book. Listing 2-2 illustrates this same program logic written in SAS. The line numbers from the SAS log have been included.
4   data types;
5      x = 1;
6      y = 1.5;
7      z = x*y;
8
9      Put "x is: " x /
10         "y is: " y/
11         "z is: " z;
12
13 title "SAS Reported 'type' for numeric variables";
14 proc print data=sashelp.vcolumn(where=(libname='WORK' and
15                                 memname='TYPES'));
16    id name;
17    var type;
18 run;
OUTPUT:
x is: 1
y is: 1.5
z is: 1.5
Listing 2-2

SAS Data Types

This SAS program creates the temporary SAS dataset WORK.TYPES. With the creation of the SAS dataset, we can search the SAS DICTIONARY table SASHELP.VCOLUMN and return the “type” associated with the SAS variables, x, y, and z. The results from PROC PRINT is displayed in Figure 2-1. It shows variables x, y, and z are defined as num, indicating they are numerics.
../images/440803_1_En_2_Chapter/440803_1_En_2_Fig1_HTML.jpg
Figure 2-1

SAS Data Types for Numerics

You are not likely to encounter issues related to data type differences when working with SAS and Python. For the most part, issues related to mapping data types arise when reading data from external environments, particularly with relational databases. We will discuss these issues in detail in Chapter 6, “pandas Readers and Writers.”

Python Operators

Similar to SAS the Python interpreter permits a wide range of mathematical expressions and functions to be combined together. Python’s expression syntax is very similar to the SAS language using the operators +, –, *, and / for addition, subtraction, multiplication, and division, respectively. And like SAS parentheses (()) are used to group operations for controlling precedence.

Table 2-1 displays the Python floating-point and numeric type operations precedence (excluding complex numbers).
Table 2-1

Python Arithmetic Operations Precedence1

Precedence

Operation

Results

1

x ** y

x to the power y

2

divmod(x, y)

The pair (x // y, x % y)

3

float(x)

x converted to floating point

4

int(x)

x converted to integer

5

abs(x)

Absolute value of x

6

+x

x unchanged

7

–x

x negated

8

x  %  y

Remainder of x / y

9

x // y

Floor of x and y

10

x  *  y

Product of x and y

11

x / y

Quotient x by y

12

x – y

Difference of x and y

13

x  +  y

Sum of x and y

Boolean

As stated previously Python’s Boolean data type is a subtype of integer. Because of its utility in data cleansing tasks and general-purpose testing, we will cover this data type in detail. Python’s two Boolean values are True and False with the capitalization as shown. In a numerical context, for example, when used as an argument to arithmetic operations, they behave like integers with values 0 for False and 1 for True. This is illustrated in Listing 2-3.
>>> print(bool(0))
False
>>> print(bool(1))
True
Listing 2-3

Boolean Value Tests for 0 and 1

In contrast SAS does not have a Boolean data type. As a result, SAS Data Step code is often constructed as a Series of cascading IF-THEN/DO blocks used to perform Boolean-style truth tests. SAS does have implied Boolean test operators, however. An example is the END= variable option on the SET statement. This feature is used as an end-of-file indicator when reading a SAS dataset. The value assigned to the END= variable option is initialized to 0 and set to 1 when the SET statement reads the last observation in a SAS dataset.

Other SAS functions also use implied Boolean logic. For example, the FINDC function used to search strings for characters returns a value of 0, or false, if the search excerpt is not found in the target string. Every Python object can be interpreted as a Boolean and is either True or False.

These Python objects are always False:
  • None

  • False

  • 0 (for integer, float, and complex)

  • Empty strings

  • Empty collections such as “ ”, ( ), [ ], { }

Comparison Operators

Python has eight comparison operators shown in Table 2-2. They all have the same priority which is higher than that of the Boolean operators in Table 2-1.
Table 2-2

Python Comparison Operations

Operation

Meaning

<

Strictly less than

<=

Less than or equal

>

Strictly greater than

>=

Greater than or equal

==

Equal

!=

Not equal

is

Object identity

is not

Negated object identity

The last two Python comparison operators is and is not do not have direct analogs in SAS. You can think of Python’s is and is not as testing object identity, that is, if two or more objects are the same. In other words, do both objects point to the same memory location? A Python object can be thought of as a memory location holding a data value and a set of associated operations. This concept is further illustrated in Listings 2-4 and 2-5 .
>>> x = 32.0
>>> y = 32
>>> if (x == y):
...   print ("True. 'x' and 'y' are equal")
... else:
...   print("False. 'x' and 'y' are not equal")
...
True. 'x' and 'y' are equal
Listing 2-4

Python Equivalence Test

In this example, x is assigned the value 32.0 and y is assigned 32. Lines 3 through 6 illustrate the Python IF/ELSE construct. As one expects x and y evaluate to the same arithmetic value. Note Python uses == to test for equality in contrast to SAS which uses =.

Listing 2-5 illustrates Python’s is identify test for x and y.
>>> x = 32.0
>>> y = 32
>>> x is y
False
Listing 2-5

Python IS Comparison

The is operator does not test if the values assigned to x and y are equivalent (we have already shown they are equivalent in Listing 2-4), rather we are testing to determine if the Python objects x and y are the same object. In other words, do x and y point to the same memory location? Listing 2-6 helps further illustrate this point.
>>> x = 32.0
>>> y = x
>>> x is y
True
Listing 2-6

Python IS Comparison 2

In the preceding example, x and y are the same object by virtue of the assignment statement
y = x
Let’s further examine examples of Python’s Boolean comparison operators along with contrasting SAS examples. As stated previously empty sets in Python return False. This is illustrated in Listing 2-7 .
>>> print(bool("))
False
>>> print(bool(' '))
True
>>> print(bool('Arbitrary String'))
True
Listing 2-7

Boolean Tests for Empty and Non-empty Sets

The first Boolean test returns False given the string is empty or null. The results from the second Boolean test returns True. This is a departure for how SAS handles missing character variables. In SAS, zero or more whitespaces (ASCII 32) assigned to a character variable is considered a missing value.

Chapter 3, “pandas Library,” goes into further detail on missing value detection and replacement.

Next let’s examine simple Boolean comparison operations. In Python Boolean comparison operations can be chained together. This is illustrated in Listing 2-8.
>>> x = 20
>>> 1 < x < 100
True
Listing 2-8

Boolean Chained Comparisons

Boolean comparisons are performed between each pair of terms. In Listing 2-8, 1 < x evaluates True and x < 100 evaluates True making the expression True.
>>> x = 20
>>> 10 < x < 20
False
Listing 2-9

Boolean Chained Comparisons 2

In Listing 2-9, 10 < x evaluates True and x < 20 evaluates False making the expression False.

A fairly common type of Boolean expression is testing for equality and inequality among numbers and strings. For Python the inequality comparison uses != for evaluation and the SAS language uses ^=. Listing 2-10 illustrates a simple example.
>>> x = 2
>>> y = 3
>>> x != y
True
Listing 2-10

Python Numeric Inequality

Listing 2-11 illustrates this same program.
4   data _null_;
5   /* inequality comparison */
6   x = 2;
7   y = 3;
8
9   if x ^= y then
10       put 'True';
11   else put 'False';
12  run;
OUTPUT:
True
Listing 2-11

SAS Numeric Inequality

Using a NULL Data Step, variables x and y are assigned numeric values 2 and 3, respectively. The IF-THEN/ELSE statement is used along with a PUT statement to write to the SAS log. Since the value 2 does not equal 3, the inequality test with ^= evaluates true and ‘True’ is written. The ELSE condition is not executed.

Further in this chapter, we will discuss strings and string formatting in more detail. Python’s Boolean tests for string equality and inequality follow the same pattern used for numerics. Listing 2-12 uses the Boolean comparison operator “==” in contrast to the SAS comparison operator “=”.
>>> s1 = 'String'
>>> s2 = 'string'
>>> s1 == s2
False
Listing 2-12

Boolean String Equality

This Boolean comparison returns False since the first character in object s1 is “S” and the first character in object s2 is “s”.

Listing 2-13 illustrates this same Python program written with SAS.
4   data _null_;
5
6    /* string equality comparison */
7    s1 = 'String';
8    s2 = 'string';
9
10   if s1 = s2 then
11      put 'True';
12   else put 'False';
13   run;
OUTPUT:
False
Listing 2-13

SAS String Equality

Using a NULL Data Step, the variables s1 and s2 are assigned the character values ‘String’ and ‘string’, respectively. The IF-THEN/ELSE statement is used along with a PUT statement to write to the SAS log. Since the character variable s1 value of ‘String’ does not match the character variable s2 value of ‘string’, the IF statement evaluates false. The ELSE statement is executed resulting in ‘False’ written to the SAS log.

IN/NOT IN

We can illustrate membership operators with in and not in with Listing 2-14.
>>> 'on' in 'Python is easy to learn'
True
>>> 'on' not in 'Python is easy to learn'
False
Listing 2-14

IN and NOT IN Comparisons

IN evaluates to True if a specified sequence is found in the target string. Otherwise it evaluates False. not in evaluates False if a specified sequence is found in the string. Otherwise it evaluates True.

AND/OR/NOT

The Python’s Boolean operation order for and, or, and not is listed in Table 2-3. Python’s evaluation rules for and and or operators behave as follows.
Table 2-3

Python Boolean Operations Precedence

Precedence

Operation

Results

1

not x

If x is false, then True; False otherwise.

2

x and y

If x is false, its value is returned; otherwise y is evaluated and the resulting value is returned False.

3

x or y

If x is true, its value is returned; otherwise, y is evaluated and the resulting value is returned.

The operator not yields True if its argument is false; otherwise, it yields False.

The expression x and y first evaluates x; if x is False, its value is returned; otherwise, y is evaluated and the resulting value is returned.

The expression x or y first evaluates x; if x is True, its value is returned; otherwise, y is evaluated and the resulting value is returned.

Let’s examine how Boolean operation precedence work. This is illustrated in Listing 2-15.
>>> True and False or True
True
>>> (True or False) or True
True
Listing 2-15

Boolean AND/OR Precedence

The Boolean and operator precedence has a higher priority than that of or. In the first pair, True and False evaluate False. Therefore, the second evaluation becomes False or True which evaluates True. The second example in Listing 2-15 illustrates the use of parentheses to further clarify the Boolean operation. Parentheses have the highest precedence order.

The Python Boolean and operator returns True if both predicates evaluate True. Otherwise, it returns False. Listing 2-16 tests the condition for finding both the character ‘r’ and a blank in a Python sequence (string).
>>> s3 = 'Longer String'
>>> 'r' and " " in s3
True
Listing 2-16

Python Boolean and

The same logic is shown using SAS in Listing 2-17.
4   data _null_;
5
6     /* SAS 'and' operator */
7     s3 = 'Longer String';
8
9     if findc(s3,'r') ^= 0 and findc(s3,' ') ^= 0 then
10           put 'True';
11     else put 'False';
12   run;
OUTPUT:
True
Listing 2-17

SAS Boolean AND Operator

The FINDC function searches the character variable s3 left to right for the character ‘r’. This function returns the location for the first occurrence where the character ‘r’ is found, in this case, position 6. This causes the first half of the IF predicate to evaluate true. Following AND is the second half of the IF predicate using the FINDC function to search for a blank character (ASCII 32) which is found at position 7. This predicate evaluates true. Since both IF predicates evaluate true, this results in the statement following THEN to execute and write ‘True’ to the SAS log.

The Python or Boolean operator returns True when one or the other predicate evaluates True. Listing 2-18 illustrates the Boolean or operation.
>>> s4 = 'Skinny'
>>> s5 = 'Hunger'
>>> 'y' in s4 or s5
True
Listing 2-18

Python Boolean or

This same logic is shown using SAS in Listing 2-19.
4   data _null_;
5
6   /* Equivalent in comparison with 'or operator */
7     s4 = 'Skinny';
8     s5 = 'hunger';
9
10   if findc(s4,'y') ^= 0 or findc(s5,'y') ^= 0 then
11      put 'True';
12   else put 'False';
13   run;
OUTPUT:
True
Listing 2-19

SAS Boolean OR

The FINDC function searches the character variable s4 left to right for the character ‘y’. This function returns the location for the first occurrence of where the character ‘y’ is found, in this case, position 6. This results in the first half of the IF predicate to evaluate true. Since the first IF predicate evaluates true, this results in the statement following THEN statement to execute and write ‘True’ to the SAS log. The ELSE statement is not executed.

Numerical Precision

It is a mathematical truth that .1 multiplied by 10 produces 1 (for base-10, of course). Consider Listing 2-20.
>>> x = [.1] * 10
>>> x == 1
False
Listing 2-20

Boolean Equivalence

So how is this possible? Let’s begin by closely examining the first line of the program. x defines a Python list. A list is a data structure containing an ordered collection of items. In this case, our list contains ten numeric floats with the value 0.1. When the Python interpreter executes the first line of the program, the list x is expanded to contain ten items (floats), each with the value of 0.1. This is illustrated in Listing 2-21 where we use the print() function to display the list.
>>> x = [.1] * 10
>>> print(x)
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
Listing 2-21

Contents of List x

This intermediate summation step is illustrated in Listing 2-22.
>>> .1 + .1 + .1 + .1 + .1 + .1 + .1 + .1 + .1 + .1
0.9999999999999999
Listing 2-22

Python Sum Operation

As an aside, you can see another example illustrating the simplicity of Python. There are no variables to declare and in this case no expressions or assignments made.

The explanation for these results is how floating-point numbers are represented in computer hardware as base 2 fractions. And as it turns out 0.1 cannot be represented exactly as a base 2 fraction. It is an infinitely repeating fraction.2

Fortunately, there are straightforward remedies to this challenge. Similar to SAS, the Python Standard Library has a number of built-in numeric functions such as round().3 A built-in function means it is available to the Python interpreter and does not require the importing of any additional packages.

Python’s round() function returns a number rounded to a given precision after the decimal point. If the number of digits after the decimal is omitted from the function call or is None, the function returns the nearest integer to its input value.

Listing 2-23 is a re-write of the Python program from Listing 2-20.
>>> nl = ' '
>>>
>>> total = 0
>>> list = [.1] * 10
>>>
>>> for i in list:
...     total += i
...
>>> print(nl,
...       "Boolean expression: 1 == total is:       ", 1 == total,
...       nl,
...       "Boolean expression: 1 == round(total) is:", 1 == round(total),
...       nl,
...       "total is:", total,
...       nl,
...   "total type is:", type(total))
 Boolean expression: 1 == total is:        False
 Boolean expression: 1 == round(total) is: True
 total is: 0.9999999999999999
 total type is: <class 'float'>
Listing 2-23

Python round Function

The object total is an accumulator used in the for loop. The construct += as part of the accumulation is equivalent to the SAS expression
total = total + i;

The print() function contains a Boolean comparison operation similar to the one found in Listing 2-20. Without the use of the built-in round() function, this Boolean equivalency test returns False.

The round() function rounds the total object to the nearest integer. In contrast to the line above it, here the Boolean equality operator == returns True since 0.999… has been rounded to the integer value of 1.

The numerical precision issue raised here is not unique to Python. The same challenge exists for SAS, or any other language utilizing floating-point arithmetic, which is to say nearly all computer languages. Listing 2-24 uses the same logic to illustrate numerical accuracy.
4   data _null_;
5   one = 1;
6   total = 0;
7   inc = .1;
8
9   do i = 1 to 10;
10      total + inc;
11      put inc ', ' @@;
12   end;
13
14   put;
15   if total = one then put 'True';
16     else put 'Comparison of "one = total" evaluates to False';
17
18   if round(total) = one then put 'Comparison of "one = round(total)" evaluates to True';
19      else put 'False';
20
21   put 'Total is: ' total 8.3;
22   run;
OUTPUT:
0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ,
Comparison of "one = total" evaluates to False
Comparison of "one = round(total)" evaluates to True
Total is:    1.000
Listing 2-24

SAS Round Function

This SAS program uses a DO/END loop to accumulate values into the total variable. The inc variable, set to a numeric value of 0.1, is a stand-in for the items in the Python list. The first IF statement on line 15 performs a comparison of the accumulated values into variable total with the numeric variable one having an integer value of 1. Similar to the Python example, the first half of this IF predicate (.999…= 1) evaluates false and the second half of the IF predicate executes indicating the comparison is false.

The IF statement on line 18 uses the ROUND function to round the variable total value (.999…) to the nearest integer value (1). This IF predicate now evaluates true and writes to the log. Line 19 does not execute.

The last line of the program writes the value of the variable total using the SAS-supplied 8.3 format which displays the value 1.000. The internal representation for the variable total remains .9999999999.

Strings

In Python strings are referred to as an ordered sequence of Unicode characters. Strings are immutable, meaning they cannot be updated in place. Any method applied to a string such as replace() or split() used to modify a string returns a copy of the modified string. Strings are enclosed in either single quotes (') or double quotes (").

If a string needs to include quotes as a part of the string literal, then backslash () is used as an escape character. Alternatively, like SAS, one can use a mixture of single quotes (‘) and double quotes (“) assuming they are balanced.

Let’s start with some simple examples for both Python and their SAS analogs. Listing 2-25 illustrates a “Hello World” example.
>>> s5 = 'Hello'
>>> s6 = "World"
>>>
>>> print(s5,s6)
Hello World
>>> print(s5 + s6)
HelloWorld
>>> print('Type() for s5 is:', type(s5))
Type() for s5 is: <class 'str'>
Listing 2-25

Python String Assignment and Concatenation

Python uses the plus symbol (+) for string concatenation operation. SAS provisions an extensive set of functions for string concatenation operations to provide finer controls for output appearances.

Listing 2-26 uses the CAT function to concatenate the character variables s5 and s6.
4  data _null_;
5     s5 = 'Hello';
6     s6 = 'World';
7
8  concat1 = cat(s5, s6);
9  concat2 = cat(s5, ' ', s6);
10
11  put s5= s6= /
12  concat1= /
13  concat2=;
14  run;
s5=Hello s6=World
concat1=HelloWorld
concat2=Hello World
Listing 2-26

SAS Character Assignment and Concatenation

Similar to SAS, Python has an extensive set of string manipulation methods. Listing 2-27 illustrates the Python upper() Method.
>>> print(s5 + " " + s6.upper())
Hello WORLD
Listing 2-27

Python upper() Method

The SAS program in Listing 2-28 illustrates the same program logic.
4   data _null_;
5      s5 = 'Hello';
6      s6 = 'World';
7
8      upcase = cat(s5, ' ', upcase(s6));
9
10   put upcase;
11   run;
OUTPUT:
Hello WORLD
Listing 2-28

SAS UPCASE Function

Listing 2-29 illustrates the ability to create strings that preserve spacing.
>>> s7 = "'Beautiful is better than ugly.
... Explicit is better than implicit.
... Simple is better than complex.
... Complex is better than complicated.
... Flat is better than nested.
... Sparse is better than dense.
... Readability counts, and so on..."'
>>> print(s7)
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts, and so on...
Listing 2-29

Python Multiline String

Observe how three consecutive single quotes (‘) are needed to define a multiline string. A Docstring preserves the spacing and line breaks in the string literal.

Continuing with the Python program from Listing 2-29, Listing 2-30 illustrates the use of the count() method for counting occurrences of an excerpt (‘c’ in this case) in a target string.
>>> print('Occurrences of the letter "c":', s7.count('c'))
Occurrences of the letter "c": 6
Listing 2-30

Python count Method

The count() method illustrated in the preceding example is one of a number of methods used by Python for sophisticated string manipulation tasks. You can think of these methods as being similar to SAS functions. In the case of Python, methods are associated with and act upon a particular object. In the preceding example, s7 is a string object used to hold the value of the Docstring. For the remainder of this book, we use the Python nomenclature object rather than variable when referring to code elements assigned values.

The methods available for the built-in string object are found in the Python Standard Library 3.7 documentation at https://docs.python.org/3/library/stdtypes.html#string-methods .

String Slicing

Python uses indexing methods on a number of different objects having similar behaviors depending on the object. With a sequence of characters (string), Python automatically creates an index with a start position of zero (0) for the first character in the sequence and increments to the end position of the string (length –1). The index can be thought of as one-dimensional array.

The general form for Python string slicing is
string[start : stop : step]

Python string slicing is a sophisticated form of parsing. By indexing a string using offsets separated by a colon, Python returns a new object identified by these offsets. start identifies the lower-bound position of the string which is inclusive; stop identifies the upper-bound position of the string which is non-inclusive; Python permits the use of a negative index values to count from right to left; step indicates every nth item, with a default value of one (1). Seeing a few examples will help to clarify.

At times you may find it easier to refer to characters toward the end of the string. Python provides an “end-to-beginning” indexer with a start position of –1 for the last character in the string and decrements to the beginning position.

Table 2-4 illustrates this concept.
Table 2-4

Python Sequence Indexing

Character

H

e

l

l

o

W

o

r

l

d

Index value

0

1

2

3

4

5

6

7

8

9

10

Index value R to L

–11

–10

–9

–8

–7

–6

–5

–4

–3

–2

–1

A number of SAS character handling functions have modifiers to enable scanning from right to left as opposed to the default behavior of scanning left to right.

Listing 2-31 illustrates finding the first character in a sequence (index 0 position).
>>> s = 'Hello World'
>>> s[0]
'H'
Listing 2-31

Python String Slicing, Example 1

In this example, the slicing operation contains no colon so the default is the “start” position being 0 as the index value which returns the first letter of the string.

Listing 2-32 is the analog example using the SUBSTR function to extract the first letter from the variable “s”.
4   data _null_;
5      s = 'Hello World';
6      extract = substr(s, 1, 1);
7
8    put extract;
9    run;
OUTPUT:
H
Listing 2-32

SAS SUBSTR Function

In contrast to Python, with an index start position of 0, SAS uses an index start position of 1. The SAS SUBSTR function scans the character variable s (first argument), starts at position 1 (second argument), and extracts 1 character position (third argument).

Consider another example, Listing 2-33, having no start position which then defaults to 0. The stop position following the colon (:) goes up to but does include index position 5.
>>> s = 'Hello World'
>>> s[:5]
'Hello'
Listing 2-33

Python String Slicing, Example 2

In other words, in this example, index position 5 maps to the whitespace (blank) separating the two words and is not returned.

Listing 2-34 illustrates what happens when an index start value is greater than the length of sequence being sliced.
>>> s = 'Hello World'
>>> print(len(s))
11
>>> empty = s[12:]
>>> print(empty)
>>>
>>> bool(empty)
False
Listing 2-34

Python String Slicing, Example 3

When the index start value is greater than the length of the sliced sequence, Python does not raise an error, rather, it returns an empty (null) string. Recall from the preceding discussion on Boolean comparisons that empty objects evaluate False.

Now consider Listing 2-35 illustrating the case of when the stop value for string slicing is greater than the length of the actual string.
>>> s = 'Hello World'
>>> s[:12]
'Hello World'
Listing 2-35

Python String Slicing, Example 4

When the index stop value is greater than the length of the sliced sequence, then the entire sequence is returned.

Listing 2-36 identifies the start index position 3 which is included and the stop index position of –1 (indicating the last character in the sequences) which is not included.
>>> s = 'Hello World'
>>> s[3:-1]
'lo Worl'
Listing 2-36

Python String Slicing, Example 5

Since the stop index position is not inclusive, the last character in the sequence is not included.

If we want to include the last letter in this sequence, then we would leave the stop index value blank. Listing 2-37 illustrates how to return a sequence beginning at start position 3 to the end of the sliced sequence.
>>> s = 'Hello World'
>>> s[3:]
'lo World'
Listing 2-37

Python String Slicing, Example 6

Listing 2-38 illustrates scanning a sequence from right to left.
>>> s = 'Hello World'
>>> s[-11]
'H'
>>> s[-12]
IndexError: string index out of range
Listing 2-38

Python String Slicing, Example 7

With the first slice operation, because there is a single index value, it defaults to the start value. With a negative value, the slice operation begins at the end of the sequence and proceeds right to left decrementing the index value by 1 (assuming the step value remains the default value of 1).

In the second slice operation, a negative start value larger than the sequence length to be sliced is out of range and therefore raises an IndexError.

Listing 2-39 illustrates use of the backslash () to escape the single quote (‘) to be a part of the returned sequence.
>>> q = 'Python's capabilities'
>>> print(q)
Python's capabilities
>>> q1 = "Python's features"
>>> print(q1)
Python's features
Listing 2-39

Python String Quoting

Formatting

In the day-to-day work of data analysis, a good deal of energy is devoted to formatting of numerics and strings in our reports and analysis. We often need values used in our program output formatted for a more pleasing presentation. This often includes aligning text and numbers, adding symbols for monetary denominations, and mapping numerics into character strings. In this section we introduce the basics of Python string formatting. Throughout the rest of the book, we will encounter additional examples.

In the preceding examples, we saw illustrations of basic string manipulation methods. Python also provisions string formatting method calls and methods.

Formatting Strings

Formatting Python strings involve defining a string constant containing one or more format codes. The format codes are fields to be replaced enclosed by curly braces ({ }). Anything not contained in the replacement field is considered literal text, which is unchanged on output. The format arguments to be substituted into the replacement field can use either keyword ({gender}, e.g.) or positional ({0}, {1} e.g.) arguments.

A simple example for calling the format() method with a positional argument is illustrated in Listing 2-40.
>>> 'The subject's gender is {0}'.format("Female")
"The subject's gender is Female"
Listing 2-40

Format Method with a Positional Argument

The argument “Female” from the format() method is substituted into the replacement field designated by {0} contained inside the string constant literal text. Also notice the use of the backslash () to escape the single quote to indicate a possessive apostrophe for the string literal ‘subject’.

Format specifications separated by a colon (:) are used to further enhance and control output appearances. This is illustrated in Listing 2-41 .
>>> 'The subject's gender is {0:>10}'.format("Female")
"The subject's gender is     Female"
Listing 2-41

Format Method Specification

In Listing 2-41, the format specification in the replacement field uses the alignment option {0:>10} to force the replacement field to be right aligned with a width of ten characters. By default the field width is the same size as the string used to fill it. In subsequent examples we use this same pattern for format specifications to control the field width and appearances of numerics.

Listing 2-42 illustrates multiple positional arguments. Further, these positional arguments can be called in any order.
>>> scale = 'Ratings are: {0} {1} or {2}'
>>> scale.format('1. Agree', '2. Neutral', '3. Disagree')
'Ratings are: 1. Agree 2. Neutral or 3. Disagree'
>>>
>>> scale = 'Ratings are: {2} {0} {1}'
>>> scale.format('1. Agree', '2. Neutral', '3. Disagree')
'Ratings are: 3. Disagree 1. Agree 2. Neutral'
Listing 2-42

Format Method with Positional Arguments

The following syntax calls the string format() method to create three positional arguments
scale.format('1. Agree', '2. Neutral', '3. Disagree')
The format() method also accepts keyword= arguments as illustrated in Listing 2-43.
>>> location = 'Subject is in {city}, {state} {zip}'
>>> location.format(city='Denver', state="CO", zip="80218")
'Subject is in Denver, CO 80218'
Listing 2-43

Format Method with Keyword Arguments

Combining positional and keyword arguments together is illustrated in Listing 2-44.
>>> location = 'Subject is in {city}, {state}, {0}'
>>> location.format(80218, city="Denver", state="CO")
'Subject is in Denver, CO, 80218'
Listing 2-44

Combining Format Method Keyword and Positional Arguments

Notice when combining positional and keyword arguments together, keyword arguments are listed first followed by positional arguments.

Beginning with Python 3.6, formatted string literals or f-strings were introduced as an improved method for formatting. f-strings are designated with a preceding f and curly braces containing the replacement expression. f-strings are evaluated at runtime allowing the use of any valid expression inside the string. Consider Listing 2-45.

Listing 2-45f-string Formatting
>>> radius = 4
>>> pi     = 3.14159
>>>
>>> print("Area of a circle with radius:", radius,
...       ' ',
...   f"is: {pi * radius **2}")
Area of a circle with radius: 4
 is: 50.26544

In this example, the formula for calculating the area of a circle is enclosed within a set of curly braces ({ }). At execution time, the results are calculated and printed as a result of calling the print( ) function.

Formatting Integers

The pattern for applying formats to integers is similar to that of strings. The main difference being the replacement field deals with formatting numeric values. And as indicated previously, some format specifications have values independent of the data types to be formatted. For example, field padding is common to all data types, whereas a comma separator (to indicate thousands) is only applied to integers and floats.

Consider Listing 2-46 .
>>> int = 123456789
>>> nl  = ' '
>>> print(nl,
...       'int unformatted:', int,
...       nl,
...       'int formatted:' , "{:,d}".format(int))
 int unformatted: 123456789
 int formatted: 123,456,789
Listing 2-46

Decimal Right Aligned

In this example, we use a positional argument for the format() method along with the format specification {:>20} to indicate we want the decimal value right aligned with a field width of 20.

Listing 2-47 illustrates combining multiple format specifications to achieve the desired appearance.
>>> print("{:>10,d} ".format(123456789),
... "{:>10,d}".format(1089))
123,456,789
      1,089
Listing 2-47

Combining Format Specifications

In this example, the format specification {:>10,d} indicates the field is right justified with a width of 10. The ,d part of the specification indicates the digits use a comma as the thousands separator. This example uses a single print() function requiring a new line indicator after the first number in order to see the effect of the alignment.

Integers can be displayed with their corresponding octal, hexadecimal, and binary representation. This feature is illustrated in Listing 2-48 .
>>> int = 99
>>> nl = ' '
>>>
>>> print (nl,
...        'decimal:    ', int,
...        nl,
...        'hexidecimal:', "{0:x}".format(int),
...        nl,
...        'octal:      ', "{0:o}".format(int),
...        nl,
...        'binary:     ', "{0:b}".format(int))
 decimal:     99
 hexidecimal: 63
 octal:       143
 binary:      1100011
Listing 2-48

Python Displaying Different Base Values

The analog SAS program is shown in Listing 2-49.
4   data _null_;
5     input int 8.;
6     int_left = left(put(int, 8.));
7     put 'int:      ' int_left /
8         'hex:      ' int hex2. /
9         'octal:    ' int  octal. /
10        'binary:   ' int binary8. /
11        'Original: ' _infile_;
12  list;
13  datalines;
OUTPUT:
int:      99
hex:      63
octal:    143
binary:   01100011
Original: 99
RULE:      ----+----1----+----2----+----3----+----4----+----5
14         99
244  ;;;;
245  run;
Listing 2-49

SAS Displaying Different Base Values

This example reads on input the numeric value 99 and uses a PUT statement to write this value to the log using 8., hex2., octal., and binary8. formats.

For both Python and SAS, the default is to display integer values without leading zeros or a plus (+) sign to indicate positive integer values. Listing 2-50 illustrates how to alter these default behaviors.
>>> 'Integer 99 displayed as {:04d}'.format(99)
'Integer 99 displayed as 0099'
Listing 2-50

Python Format for Leading 0’s

The format specifier:04d indicates leading zeros (0) are to be added in the field width of 4. The analog SAS program is shown in Listing 2-51 .
4  data _null_;
5    input int 8.;
6    /* display_int is a character variable */
7    display_int = put(int, z4.);
8    put 'int:      ' display_int/
9        'Original: ' _infile_;
10  list;
11  datalines;
int:      0099
Original: 99
RULE:      ----+----1----+----2----+----3----+----4----+----5
12         99
13   ;;;;
14   run;
Listing 2-51

SAS Format for Leading 0’s

This example uses the SAS-supplied z4. format shown on line 7.

In order to display the plus (+) sign for integers in Python, consider Listing 2-52 .
>>> '{:+3d}'.format(99)
'+99'
Listing 2-52

Python Leading Plus Sign

The format specification {:+3d} indicates a preceding plus sign (+) using a field width of 3.

The corresponding SAS program in Listing 2-53 illustrates creating and calling a user-defined plussign. format.
4  proc format;
5     picture plussign
6             0 - 99 = '  00' (prefix='+');
NOTE: Format PLUSSIGN has been output.
7
8  data _null_;
9     input int 8.;
10
11  put 'int:     ' int plussign. /
12      'Original: ' _infile_;
13  list;
14  datalines;
int:      +99
Original: 99
RULE:      ----+----1----+----2----+----3----+----4----+----5
15         99
16  ;;;;
17  run;
Listing 2-53

SAS Leading Plus Sign

PROC FORMAT is used to create a PICTURE format and is called on line 11.

Formatting Floats

Consider Listing 2-54. This example illustrates a format specification for floats to display one digit after the decimal using {0:.1f} or four places after the decimal {0:.4f}. Regardless of how the value is displayed using one or four places to the right of the decimal, the internal representation of the value remains the same.
>>> "precision: {0:.1f} or {0:.4f}".format(3.14159265)
'precision: 3.1 or 3.1416'
Listing 2-54

Python Decimal Places

Listing 2-55 illustrates a format specification for percentages. In the case of both Python and SAS, the percent format multiplies the resulting number by 100 and places a trailing percent (%) sign.
>>> "6.33 as a Percentage of 150: {0:.2%}".format(6.33/150)
'6.33 as a Percentage of 150: 4.22%'
Listing 2-55

Python Percent Format

The analog SAS program, in Listing 2-56, uses the SAS-supplied percent 8.2 format to indicate two places after the decimal are displayed followed by a percent sign (%).
4  data _null_;
5     pct = 6.33 / 150;
6
7  put '6.33 as a percentage of 150: ' pct percent8.2;
8  run;
OUTPUT:
6.33 as a percentage of 150:   4.22%
Listing 2-56

SAS Percent Format

Datetime Formatting

Strictly speaking, datetime is not a Python built-in type but instead refers to the datetime module and their corresponding objects which supply classes for manipulating date and datetime values. These next examples illustrate using the strftime(format) for date, datetime, and time handling. Python date, datetime, and time objects support the strftime(format) method which is used to derive a string representing either dates or times from date and time objects. This string is then manipulated with directives to produce the desired appearances when displaying output. In other words, the strftime(format) method constructs strings from the date, datetime, and time objects rather than manipulating these objects directly.

We will see a good deal in more detail for date and time arithmetic in Chapter 7, “Date and Time.” For now, consider Listing 2-57.
>>> from datetime import datetime, date, time
>>> now = datetime.now()
>>> print(now)
2018-08-01 12:19:47.326261
>>> print(type(now))
<class 'datetime.datetime'>
Listing 2-57

Python Import Datetime

Up to this point all of the Python examples we have seen are executed using a built-in interpreter. We have not needed to rely on additional Python modules or programs. In order to load other Python programs or modules, we use the import statement . Here, the first line in our example imports the objects datetime, date, and time from the Python module datetime.

In our example we also create the now object on line 2. In our program, the value associated with the now object is like a snapshot of time (assuming we did not execute line 2 again).

Calling the print() method for the now object displays the current data and time this program executed.

Consider Listing 2-58. Here we introduce formatting directives for date and time formatting.
>>> from datetime import datetime, date, time
>>>
>>> nl = ' '
>>> now = datetime.now()
>>>
>>> print('now: ' , now,
...       nl      ,
...       'Year:' , now.strftime("%Y"),
...       nl      ,
...       'Month:' , now.strftime("%B"),
...       nl       ,
...       'Day:  ' , now.strftime("%d"),
...       nl, nl   ,
...       'concat1:'    , now.strftime("%A, %B %d, %Y A.D."),
...       nl,
...       'datetime:'   , now.strftime("%c"))
 now:   2019-02-19 17:14:17.752075
 Year:  2019
 Month: February
 Day:   19
 concat1: Tuesday, February 19, 2019 A.D.
 datetime: Tue Feb 19 17:14:17 2019
Listing 2-58

STRFTIME(format)

The strftime(format) directives are used to control the appearances of the value associated with the now object created in Listing 2-57. The now object holds the datetime returned from the datetime.now() function . These formatting directives are able to parse as well as format and control the appearances of the output.

Also notice the nl object assigned the value ‘ ’ used in this example. This is a new line indicator for the print() function to go to a new line enabling a single call of the print() function to print multiple, physical lines of output.

For example, the format directive %Y returns the century and year, for example, 2018. Table 2-5 calls out the formatting directive and the corresponding line number location for the preceding example.4
Table 2-5

Formatting Directives Used in Listing 2-58

Directive

Meaning

%A

Weekday

%B

Month

%d

Day of month

%Y

Century and year

%c

Date and time

Listing 2-59 is the analog program to the Python example.
4         data _null_;
5
6         * Get Year;
7         date = today();
8         year_pt = year(date);
9
10        * Get Month;
11        month_nm = put(date, monname.);
12        month_pt2 = put(date, monname3.);
13
14        * Get Day;
15        day_pt = put(day(date), z2.);
16        date2 = day(date);
17        dow1 = put(date, downame.);
18        dow2 = put(date, downame3.);
19
20        * Get time;
21        now = time();
22        tm = put(now, time8.);
23
24        * whitespace, comma, suffix;
25        ws = ' ';
26        comma = ',';
27        ad = 'A.D.';
28
29        put 'Default output: ' date ' for time ' now;
30        put 'Year: ' year_pt;
31        put 'Month: ' month_nm;
32        put 'Day: ' day_pt ;
33
34        concat1 = cat(dow1, comma, month_nm, ws, day_pt, comma, ws, year_pt, ws, ad);
35        concat2 = cat(dow2, ws, month_pt2, ws, date2, ws, tm, ws, year_pt);
36
37        put 'concat1: ' concat1;
38        put 'current datetime: ' concat2;
39        run;
OUTPUT:
Default output: 20736  for time 71660.228
Year: 2016
Month: October    Day: 09
concat1: Sunday,  October 09, 2016 A.D.
current datetime: Sun Oct 9 19:54:20 2016
Listing 2-59

SAS Datetime Example

The output for the SAS variable date of 20736 is obviously not the actual date, rather, it represents the number of days since January 1, 1960, which is the epoch start used by SAS. Likewise, the output for the SAS variable now 71660.228 is the number of seconds from the epoch start .

Summary

In this chapter we covered Python types and basic formatting. We understand how Python and SAS variable assignments can be made without declarations and that types need not be declared as they are inferred. We also introduced Python string slicing and formatting. Throughout the remainder of the book, we will build on these concepts.

Up to this point, we have discussed various features from the Python Standard Library related to data analysis. Chapter 3, “pandas Library,” describes the pandas data structure. The pandas library is a “higher-level” capability which makes Python an outstanding language for conducting real-world data analysis tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.228.44