© Randy Betancourt, Sarah Chen 2019
R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_6

6. pandas Readers and Writers

Randy Betancourt1  and Sarah Chen2
(1)
Chadds Ford, PA, USA
(2)
Livingston, NJ, USA
 

In this chapter we discuss methods for reading data from a range of input data sources such as comma-separated values (.csv) files, database tables, JSON, and other sources of input to create DataFrames. The pandas readers are a collection of input/output methods for writing and loading values into DataFrames. These input/output methods are analogous to the family of SAS/Access Software used by SAS to read data values into SAS datasets and write SAS datasets into target output formats.

The pandas main facility for input and output to DataFrames are the readers like read_csv(), read_json() and read_sql_table(), among others, which we cover in detail. These readers use similar syntax. The first argument is the path or location to the target which can be fully qualified filenames, relative filenames (relative to the current working directory executing Python), database tables, URLs (including HTTPS, SFTP, and S3 access methods), and so on. In many cases, the defaults for the readers’ arguments are enough for read/write operations.

The reader methods have arguments to specify:
  • Input and output locations

  • Column and index location and names

  • Parsing rules for handling incoming data

  • Missing data rules

  • Datetime handling

  • Quoting rules

  • Compression and file formats

  • Error handling

We start by examining a Series of use cases for reading .csv files beginning with simple examples followed by additional examples combining various arguments to enable a more complex set of parsing rules.

Reading .csv Files

Consider Listing 6-1.
>>> import pandas as pd
>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"
>>> df1 = pd.read_csv(url, skiprows=2)
>>> print(df1)
     ID       Date   Amount  Quantity   Status Unnamed: 5
0    42  16-Oct-17  $23.99      123.0   Closed     Jansen
1  7731  15-Jan-17  $49.99        NaN  Pending        Rho
2  8843   9-Mar-17      129      45.0      NaN      Gupta
3  3013  12-Feb-17               15.0  Pending   Harrison
4  4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(df1.dtypes)
ID              int64
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object
Listing 6-1

Basic .csv Read

In this example, the call to the read_csv() reader uses two arguments. The URL object holds the string value for the input location to be read followed by the skiprows= argument. As you can see from Figure 6-1, the column headers are on row 3.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig1_HTML.jpg
Figure 6-1

.csv File Input

Also observe how the ID values for column “A” are left aligned which is the default for character strings in Excel. Column “F” is missing its header label and there are several missing values.

Listing 6-1 displays the column types. These types are inferred from the input data read from the .csv file. The read_csv() reader in this example maps the ID column to an int64 type despite the fact that these values are stored as strings in the .csv file. In subsequent examples we will see how to explicitly control type mappings when reading values into DataFrames or unloading values from DataFrames.

Listing 6-1 displays the type for the Date column as object indicating these values are stored in the df1 DataFrame as strings.

The read_csv() reader provisions the na_values= argument to define how missing values are handled on input. The na_values= argument allows the definition of any value to represent missing for both character and numeric column types. This feature is illustrated in Listing 6-2.
>>> miss = {'Amount' : [' ', 'NA']}
>>> df2 = pd.read_csv(url, skiprows=2, na_values=miss)
>>> print(df2)
     ID       Date   Amount  Quantity   Status Unnamed: 5
0    42  16-Oct-17  $23.99      123.0   Closed     Jansen
1  7731  15-Jan-17  $49.99        NaN  Pending        Rho
2  8843   9-Mar-17      129      45.0      NaN      Gupta
3  3013  12-Feb-17      NaN      15.0  Pending   Harrison
4  4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(df2.dtypes)
ID              int64
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object
>>>
>>> df1[3:4]
     ID       Date Amount  Quantity   Status Unnamed: 5
3  3013  12-Feb-17             15.0  Pending   Harrison
>>> df2[3:4]
     ID       Date Amount  Quantity   Status Unnamed: 5
3  3013  12-Feb-17    NaN      15.0  Pending   Harrison
Listing 6-2

Setting na_values for read_csv()

In this example, the syntax
miss = {'Amount' : [' ', 'NA']}

creates the miss object as a Dictionary key:value pair where the key value Amount is the DataFrame column name followed by values blank ('') and NA used to designate both as missing values. The parameter na_values= uses this miss Dictionary to map values to missing when they are encountered in the Amount column.

In contrast to Listing 6-1, the value for the Amount column on row 3 in the df1 DataFrame is blank
>>> df1[3:4]
     ID       Date Amount  Quantity   Status Unnamed: 5
3  3013  12-Feb-17             15.0  Pending   Harrison
whereas in Listing 6-2, the value for the Amount column on row 3 in the df2 Dataframe is NaN. These values are displayed with the row-slicing operations
>>> df2[3:4]
     ID       Date Amount  Quantity   Status Unnamed: 5
3  3013  12-Feb-17    NaN      15.0  Pending   Harrison

The read_csv() reader has two parameters to explicitly set column types: the dtype= and the converters= arguments. Both arguments accept Dictionary key:value pairs with the keys identifying the target column(s) and values that are functions converting the read values into their corresponding column types. The dtype= argument allows you to specify how to treat incoming values, for example, either as strings or numeric types. The converters= argument allows you to call a conversion function which in turn maps the data onto the desired column type based on the function’s logic, for example, parsing a string value to be read as a datetime. The converters= argument takes precedence over the dtype= argument in the event both are used together.

Listing 6-3 illustrates using the dtype= argument to map the ID column type to object, in other words a sequence of characters.
>>> df3 = pd.read_csv(url, skiprows=2, na_values=miss, dtype={'ID' : object})
>>> print(df3)
     ID       Date   Amount  Quantity   Status Unnamed: 5
0  0042  16-Oct-17  $23.99      123.0   Closed     Jansen
1  7731  15-Jan-17  $49.99        NaN  Pending        Rho
2  8843   9-Mar-17      129      45.0      NaN      Gupta
3  3013  12-Feb-17      NaN      15.0  Pending   Harrison
4  4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(df3.dtypes)
ID             object
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object
>>>
>>> print(df2['ID'].dtype)
int64
>>> print(df3['ID'].dtype)
object
Listing 6-3

Explicit Type Mapping for read_csv() Reader

In this example, the ID column in the df3 DataFrame has a type of object compared to the ID column type in the df2 DataFrame which is int64.

The converters= argument permits a function to convert the incoming data with the function results determining the column’s type. In this example, regardless of whether the incoming values for the ID column are numeric or character types, they are stored in the target DataFrame as strings.

To further illustrate this point, consider Listing 6-4. Here we define the strip_sign() function to remove the dollar sign ($) from the incoming values for the Amount column. The converters= argument contains the Dictionary key:value pair with the key identifying the Amount column and the corresponding value naming the converter function, in this case, strip_sign.
>>> import math
>>>
>>> def strip_sign(x):
...         y = x.strip()
...         if not y:
...             return math.nan
...         else:
...             if y[0] == '$':
...                 return float(y[1:])
...             else:
...                 return float(y)
...
>>> df4 = pd.read_csv(url, skiprows=2, converters={'ID' : str, 'Amount': strip_sign})
>>>
>>> print(df4,
... ' ', ' ',
... 'Amount dtype:', df4['Amount'].dtype)
     ID       Date  Amount  Quantity   Status Unnamed: 5
0  0042  16-Oct-17   23.99     123.0   Closed     Jansen
1  7731  15-Jan-17   49.99       NaN  Pending        Rho
2  8843   9-Mar-17  129.00      45.0      NaN      Gupta
3  3013  12-Feb-17     NaN      15.0  Pending   Harrison
4  4431   1-Jul-17   99.99       1.0   Closed       Yang
 Amount dtype: float64
Listing 6-4

read_csv() with convert=

The strip_sign() function returns a float or a NaN when the incoming value is missing. Because the strip_sign() function returns numeric values, the resulting type for the Amount column in the df4 DataFrame is float64.

Listing 6-5 illustrates mapping the ID column as a string and setting this column as the DataFrame index at the end of the read/parse operation. Chapter 4, “Indexing and GroupBy,” contains details for working effectively with DataFrame indexes.
>>> df5 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}).set_index('ID')
>>> print(df5)
           Date   Amount  Quantity   Status Unnamed: 5
ID
0042  16-Oct-17  $23.99      123.0   Closed     Jansen
7731  15-Jan-17  $49.99        NaN  Pending        Rho
8843   9-Mar-17      129      45.0      NaN      Gupta
3013  12-Feb-17      NaN      15.0  Pending   Harrison
4431   1-Jul-17  $99.99        1.0   Closed       Yang
>>> print(df5.dtypes)
Date           object
Amount         object
Quantity      float64
Status         object
Unnamed: 5     object
dtype: object
>>>
>>> df5.loc['0042']
Date          16-Oct-17
Amount          $23.99
Quantity            123
Status           Closed
Unnamed: 5       Jansen
Name: 0042, dtype: object
Listing 6-5

read_csv() set_index Attribute

With the ID column as the index, rows can be sliced with their labeled values:
df5.loc['0042']
Date          16-Oct-17
Amount          $23.99
Quantity            123
Status           Closed
Unnamed: 5       Jansen
Name: 0042, dtype: object

Date Handling in .csv Files

A common requirement for reading data input is preserving date and datetime values rather than treating the values as strings. This can be done via the converters= argument, but there are special built-in converters. The read_csv() reader has specialized parameters for datetime handling. In most cases, the default datetime parser simply needs to know which column or list of columns compose the input date or datetime values. In those cases where date or datetime values are non-standard, the parse_dates= argument accepts a defined function to handle custom date and datetime formatting instructions.

Listing 6-6 illustrates the parse_dates= argument for the read_csv() method.
>>> miss = {'Amount' : [' ', 'NA']}
>>> df5 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}).set_index('ID')
>>>
>>> df6 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}, parse_dates=['Date']).set_index('ID')
>>> print(df6)
           Date   Amount  Quantity   Status Unnamed: 5
ID
0042 2017-10-16  $23.99      123.0   Closed     Jansen
7731 2017-01-15  $49.99        NaN  Pending        Rho
8843 2017-03-09      129      45.0      NaN      Gupta
3013 2017-02-12      NaN      15.0  Pending   Harrison
4431 2017-07-01  $99.99        1.0   Closed       Yang
>>>
>>> print(df6.dtypes)
Date          datetime64[ns]
Amount                object
Quantity             float64
Status                object
Unnamed: 5            object
dtype: object
>>>
>>> print(df5['Date'].dtype)
object
>>> print(df6['Date'].dtype)
datetime64[ns]
Listing 6-6

Parsing Dates with parse_dates=

The parse_dates= parameter accepts a Python list of column names, in this case, the Date column. In Listing 6-5, the Date column in the df5 DataFrame has the default object type as shown:
>>> print(df5['Date'].dtype)
object
In contrast, with this example, the Date column is now mapped as a datetime64 column type as a result of using the parse_dates= argument.
>>> print(df6['Date'].dtype)
datetime64[ns]

Chapter 7, “Date and Time,” covers the details on date and datetime types and how they are manipulated.

Listing 6-7 illustrates the ability to use custom labels for column headings.
>>> cols=['ID', 'Trans_Date', 'Amt', 'Quantity', 'Status', 'Name']
>>> df7 = pd.read_csv(url, skiprows=3, na_values=miss,
       converters={'ID' : str},
       parse_dates=['Trans_Date'], header=None, names=cols,
       usecols=[0, 1, 2, 3, 4, 5]).set_index('ID')
>>> print(df7)
     Trans_Date      Amt  Quantity   Status      Name
ID
0042 2017-10-16  $23.99      123.0   Closed    Jansen
7731 2017-01-15  $49.99        NaN  Pending       Rho
8843 2017-03-09      129      45.0      NaN     Gupta
3013 2017-02-12               15.0  Pending  Harrison
4431 2017-07-01  $99.99        1.0   Closed      Yang
>>> print(df7.dtypes)
Trans_Date    datetime64[ns]
Amt                   object
Quantity             float64
Status                object
Name                  object
dtype: object
Listing 6-7

Controlling Column Heading

The example begins by defining a Python list of column headings in the order in which they are read from the .csv file with the syntax
cols=['ID', 'Trans_Date', 'Amt', 'Quantity', 'Status', 'Name']

The names= argument accepts the cols object to label the DataFrame columns. You must set the header= argument to None to inform the reader not to use the column labels found in the .csv file. Next, the usecols= argument specifies the input columns to be read. Remember, Python uses a 0-based index for start positions, so the default column A in the .csv file maps to column 0 in the DataFrame, column B in the .csv file maps to column 1 in the DataFrame, and so on.

Finally, notice that since we are not reading the column labels found in the .csv file, the skiprows= argument is set to 3 for this example which is where we find the first line of data values (line 4 in the .csv file). Since Python uses a 0-based index as the start position, row 1 in the .csv file maps to row 0 in the DataFrame, row 2 in the .csv file maps to row 1 in the DataFrame, and so on.

Read .xls Files

As you might expect, pandas provides the read_excel() reader for reading Microsoft .xls files. The parameters and arguments for read_excel() reader are similar to the read_csv() reader.

Take a look at Listing 6-8.
df8 = pd.read_excel('C:\data\messy_input.xlsx', sheet_name="Trans1", skiprows=2, converters={'ID' : str}, parse_dates={'Date' :['Month', 'Day’, 
                    'Year']}, keep_date_col=True).set_index('ID')
print(df8)
print(df8.dtypes)
Listing 6-8

Read .xls File into DataFrame

The results are rendered in Figure 6-2, executed in a Jupyter notebook. In order to execute this example, download the messy_input.xlsx file from

https://github.com/RandyBetancourt/PythonForSASUsers/blob/master/data/messy_input.xlsx

and copy it into a location where you have read/write access on the local filesystem. In this example, we chose “C:datamessy_input.xlsx”.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig2_HTML.jpg
Figure 6-2

Read .xls File DataFrame Output

In this example, we use the parse_dates= parameter to define the Date column with the syntax
parse_dates={'Date' :['Month', 'Day', 'Year']}

parse_dates= argument accepts a Dictionary key:value pair where the key is the name of the output datetime DataFrame column being defined, in this case, Date. The Dictionary values name the input columns composing the constituent input columns for the Date column values. In this case, the Year, Month, and Day columns are parsed together to define the Date column.

We also illustrate keeping the Year, Month, and Day columns with the syntax
keep_date_col=True

The keep_date_col= argument determines the keep disposition for the columns used with the parse_dates= argument. In this case, Year, Month, and Day columns are kept on the df8 DataFrame.

It is not uncommon to read multiple Excel files together to form a single DataFrame. Listing 6-9 illustrates this use case.

Consider sales transactions for the month are stored in separate Excel files each with identical layouts. In this example, we introduce the glob module from the Python Standard Library. The glob module finds all the filenames matching a specified pattern based on the rules used by the Unix shell. This is useful when you need to search for filenames using string expressions and wildcard characters.

The three .xlsx files for this example are located at

https://github.com/RandyBetancourt/PythonForSASUsers/tree/master/data

and in this case are copied to the folder C:data.
>>> import glob
>>> input = glob.glob('C:\data*_2*.xlsx')
>>> print(input)
['C:\data\February_2018.xlsx', 'C:\data\January_2018.xlsx', 'C:\data\March_2018.xlsx']
>>>
>>> final = pd.DataFrame()
>>> for f in glob.glob('C:\data*_2*.xlsx'):
...     df = pd.read_excel(f, converters={'ID' : str}).set_index("ID")
...     final = final.append(df, ignore_index=False, sort=False)
...
>>> print(final)
      Amount  Quantity
ID
1042   99.89        21
3311   59.99        12
9846   12.99        25
2222   19.19       115
8931   79.99         2
0044  199.89        10
8731   49.99         2
7846  129.00        45
1111   89.19        15
2231   99.99         1
0002   79.89        43
2811   19.99        19
8468  112.99        25
3333  129.99        11
9318   69.99        12
Listing 6-9

Appending Multiple .xls Files into a DataFrame

The input object is defined by calling the glob module to search the filenames using the pattern
'C:\data*_2*.xlsx'
This call returns a Python list of filenames matching our pattern subsequently displayed by calling the first print() method. The files to append together are
['C:\data\February_2018.xlsx', 'C:\data\January_2018.xlsx', 'C:\data\March_2018.xlsx']
In order to read these .xls files in a single operation, an empty DataFrame called final is defined with the syntax
final = pd.DataFrame()

Next a for loop calling the glob module is used to return the Python list of our input three files followed by the call to the read_excel() reader to read the current .xls file in the Python list into the df dataframe. Finally the panda .append attribute for the final DataFrame is called appending each df dataframe to the end of the final DataFrame. Also notice inside the for loop, the set_index() method is used to map the ID column as the DataFrame index.

An alternative to retrieving a Python list of fully qualified pathnames for this type of iterative processing is retrieving input filenames relative to the current working directory executing your Python session. This involves importing the os module from the Python Standard Library to first determine the location for the Python’s current working directory followed by a call to change the current working directory to a new location. This operation is illustrated in Listing 6-10.
>>> import os
>>>
>>> wd = os.getcwd()
>>> print (wd)
C:Users andyDesktopPy_Source
>>> os.chdir('C:\data')
>>> wd = os.getcwd()
>>> print(wd)
C:data
>>> csv_files = glob.glob('*.csv')
>>> print(csv_files)
['dataframe.csv', 'dataframe2.csv', 'Left.csv', 'messy_input.csv', 'Right.csv', 'Sales_Detail.csv', 'School_Scores.csv', 'scores.csv', 'Tickets.csv']
Listing 6-10

Absolute vs. Relative Filenames

The syntax
wd = os.getcwd()
creates the wd object holding the value for the location of the current working directory executing the Python session, in this case
C:Users andyDesktopPy_Source
The syntax
os.chdir('C:\data')
calls the os.chdir attribute and changes the working directory for the current Python session to
C:data

The call to the glob module uses *.csv as an argument to define the csv_files object holding a Python list of all the filenames in the current directory have the filename extension .csv.

A similar approach for reading and appending multiple .csv files with SAS is illustrated in Listing 6-11. The .csv files in this example can be downloaded from https://github.com/RandyBetancourt/PythonForSASUsers/tree/master/data

The SAS FILENAME statement calls the FILENAME PIPE access method to return the fully qualified .csv files found in the directory
 c:data
1    filename tmp pipe 'dir "c:data*_2*.csv" /s/b';
2
3    data _null_;
4      infile tmp;
5      length file_name $ 128;
6      input file_name $;
7
8      length imported $500;
9      retain imported ' ';
10
11     imported = catx(' ',imported,cats('temp',put(_n_,best.)));
12     call symput('imported',strip(imported));
13
14     call execute('proc import datafile="'
15                  || strip(file_name)
16                  || '" out=temp'
17                  || strip(put(_n_,best.))
18                  || ' dbms=csv replace; run;'
19                 );
20   run;
NOTE: The infile TMP is:
      Unnamed Pipe Access Device,
      PROCESS=dir "c:data*_2*.csv" /s/b,RECFM=V,
      LRECL=32767
NOTE: 3 records were read from the infile TMP.
      The minimum record length was 22.
      The maximum record length was 25.
5 rows created in WORK.TEMP1 from c:dataFebruary_2018.csv.
5 rows created in WORK.TEMP2 from c:dataJanuary_2018.csv.
5 rows created in WORK.TEMP3 from c:dataMarch_2018.csv.
94   data final;
95      set &imported;
96      format amount dollar8.2;
97   run;
NOTE: There were 5 observations read from the dataset WORK.TEMP1.
NOTE: There were 5 observations read from the dataset WORK.TEMP2.
NOTE: There were 5 observations read from the dataset WORK.TEMP3.
NOTE: The dataset WORK.FINAL has 15 observations and 3 variables.
Listing 6-11

Appending Multiple .csv Files to a Dataset

The _null_ Data Step reads the list of filenames of the FILENAME PIPE statement and assigns these fully qualified filename values to the file_name variable. The CALL SYMPUT statement creates the Macro variable &imported. On each iteration of the Data Step, it concatenates the values “temp1”, “temp2”, and so on. This Macro variable value becomes the list of input datasets used on line 95 to create the output dataset final.

The syntax
call execute('proc import datafile="'
                  || strip(file_name)
                  || '" out=temp'
                  || strip(put(_n_,best.))
                  || ' dbms=csv replace; run;'
                 );

builds the PROC IMPORT statement and corresponding options as a parameter to the CALL EXECUTE statement. The CALL EXECUTE statement executes the call to PROC IMPORT three times, one for each record read from the FILENAME PIPE statement.

Figure 6-3 uses PROC PRINT to display the contents of the final dataset.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig3_HTML.jpg
Figure 6-3

Final Dataset Output

Write .csv Files

There are cases when you need to export the contents of a DataFrame to a .csv file or an .xls file. In those cases, use the to_csv() writer and the to_excel() writer. The to_csv() writer is illustrated in Listing 6-12.
>>> final.to_csv('C:\data\final.csv', header=True)
Listing 6-12

Output DataFrame to .csv

The writer does not return any information to the console to indicate the operation’s success. Figure 6-4 displays the first five rows of the output file, in this case, C:datafinal.csv.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig4_HTML.jpg
Figure 6-4

Results from to_csv()

Use PROC EXPORT to write the contents of a SAS dataset to a .csv file. This capability is illustrated in Listing 6-13.
4  filename out_csv "c:datafinal_ds.csv";
5  proc export data = final
6              outfile = out_csv
7              dbms = csv;
8  run;
NOTE: The file OUT_CSV is:
      Filename=c:datafinal_ds.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=12Nov2018:10:13:47,
      Create Time=12Nov2018:10:13:47
NOTE: 16 records were written to the file OUT_CSV.
      The minimum record length was 11.
      The maximum record length was 18.
NOTE: There were 15 observations read from the dataset WORK.FINAL.
15 records created in OUT_CSV from FINAL.
NOTE: "OUT_CSV" file was successfully created.
Listing 6-13

Output SAS Dataset to .csv

The SAS log indicates the location and physical file characteristics of the output .csv along with the number of records successfully written.

Write .xls Files

You can write a DataFrame to an Excel sheet multiple DataFrames to a book of sheets, using the to_excel() writer. The arguments are largely the same as the to_csv() writer. This capability is illustrated in Listing 6-14.
>>> final.to_excel('C:\data\final.xls', merge_cells=False)
Listing 6-14

Output DataFrame to .xls File

The results from the to_excel() operation are displayed in Figure 6-5.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig5_HTML.jpg
Figure 6-5

Results from write_excel()

There are multiple ways to output a SAS dataset to .xls files. PROC EXPORT provides a convenient approach; however, it limits the format control you can exert on the appearances of the output .xls file. Alternatively, if you want much finer control over the output format in Excel, then the ODS tagsets.ExcelXP can be used. Listing 6-15 illustrates calling tagsets.ExcelXP to write the contents of a SAS dataset to an Excel file.
4 ods tagsets.ExcelXP
5     file="c:datafinal_ds.xls"
6     style=statistical
7     options(frozen_headers='1'
8                  embedded_titles='yes'
9                  default_column_width='18');
NOTE: Writing TAGSETS.EXCELXP Body file: c:datafinal_ds.xls
NOTE: This is the Excel XP tagset (Compatible with SAS 9.1.3 and above, v1.131, 04/23/2015). Add
options(doc='help') to the ods statement for more information.
10
11 proc print data=final;
12 run;
NOTE: There were 15 observations read from the dataset WORK.FINAL.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.08 seconds
      cpu time            0.01 seconds
13 ods tagsets.excelxp close;
14 run;
Listing 6-15

Output Final Dataset to .xls

Note that if you are using SAS Display Manager to generate this example, you may need to disable the “View results as they are generated feature”. This is found by going to the SAS Tools menu in Display Manager option and selecting Options ➤ Preferences and then selecting the Results tab. Uncheck the box labeled “View results as they are generated” as shown in Figure 6-6.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig6_HTML.jpg
Figure 6-6

SAS Preferences Menu

Read JSON

JSON stands for JavaScript Object Notation and is a well-defined structure for exchanging data among different applications. JSON is designed to be read by humans and easily parsed by programs. It is relied upon to transmit data through RESTful web services and APIs.

The flexibility for calling a REST API to acquire data is illustrated in Listing 6-16. This example creates the jobs DataFrame by calling GitHub Jobs API over HTTPS using read_json() reader to return a set of posted positions along with a detailed job description.

Details on the GitHub Jobs API are at

https://jobs.github.com/api
>>> jobs = pd.read_json("https://jobs.github.com/positions.json?description=python")
>>> print(jobs.shape)
(41, 11)
>>> print(jobs.columns)
Index(['company', 'company_logo', 'company_url', 'created_at', 'description',
       'how_to_apply', 'id', 'location', 'title', 'type', 'url'],
      dtype='object')
>>> print(jobs[['company', 'location']].head(5))
                       company                  location
0                       Sesame          New York; Berlin
1                     BlueVine          Redwood City, Ca
2          New York University  New York, New York 10001
3      University of Cambridge             Cambridge, UK
4  Norwegian Block Exchange AS    Oslo, Norway or Remote
Listing 6-16

panda Read JSON API

In this example, the read_json() reader makes a call over HTTPS to the GitHub Jobs API returning records where the position field contains the string ‘Python’. At the time this script was executed, there were 41 job positions posted. This is shown by the call to the shape attribute showing 41 rows and 11 columns in the jobs DataFrame.

Similarly, SAS provides several different methods for reading JSON files. By far, the simplest method is to use the JSON LIBNAME access method, illustrated in Listing 6-17.
4  filename response temp;
5
6  proc http
7   url="https://jobs.github.com/positions.json?description=python"
8   method= "GET"
9   out=response;
10  run;
NOTE: 200 OK
11
12  libname in_json JSON fileref=response;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
NOTE: Libref IN_JSON was successfully assigned as follows:
      Engine:        JSON
      Physical Name: C:Users andyAppDataLocalTempSAS
      Temporary Files\_TD9088_RANDY-PC_#LN00063
13  proc copy in=in_json
14            out=work;run;
NOTE: Copying IN_JSON.ALLDATA to WORK.ALLDATA (memtype=DATA).
NOTE: There were 462 observations read from the dataset IN_JSON.ALLDATA.
NOTE: The dataset WORK.ALLDATA has 462 observations and 4 variables.
NOTE: Copying IN_JSON.ROOT to WORK.ROOT (memtype=DATA).
NOTE: There were 42 observations read from the dataset IN_JSON.ROOT.
NOTE: The dataset WORK.ROOT has 42 observations and 12 variables.
15
16  proc print data=work.root (obs=5);
17     id company;
18     var location;
19  run;
NOTE: There were 5 observations read from the dataset WORK.ROOT.
Listing 6-17

SAS Read JSON Access Method

Figure 6-7 uses PROC PRINT to display the first five observations read from the GitHub Jobs API.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig7_HTML.jpg
Figure 6-7

SAS Read of JSON Records

The SAS program has three steps in order to load the JSON records into a SAS dataset. They are
  1. 1.

    Call PROC HTTP to fetch the records with the query:

    https://jobs.github.com/positions.json?description=python

     
  2. 2.

    Along with a GET to return the records to the SAS session. The OUT=response statement stores incoming records in a temporary file associated with the response FILEREF.

     
  3. 3.

    Copy the incoming JSON.ROOT and JSON.ALLDATA datasets to the WORK directory. Obviously, you can copy these datasets into a permanent SAS data library. The copy operation is needed since the JSON access method reads the incoming records only once.

     
  4. 4.

    Continue the processing logic to manipulate the ROOT dataset created by the JSON LIBNAME engine. In our case we simply call PROC PRINT to display the first five observations from the dataset.

     
In order to make this particular example work, the SAS session must be executed with an encoding of UTF-8. By default, SAS sessions executing under Windows use WLATIN1 encoding which can lead to transcoding errors when calling the JSON LIBNAME engine to read UTF-8 formatted records. Use the following PROC OPTIONS statement to determine the encoding method used by the SAS session:
proc options option=encoding; run;
    SAS (r) Proprietary Software Release 9.4  TS1M5
ENCODING=UTF-8    Specifies the default character-set encoding for the SAS session.
Note that the encoding option can only be changed at SAS initialization time. This is generally controlled by the sasv9.cfg configuration file. The default location on Windows is
C:Program FilesSASHomeSASFoundation9.4 lsensasv9.cfg

Write JSON

The pandas library provides the to_json() writer to write the contents of a DataFrame to a JSON file. This feature is illustrated in Listing 6-18.
>>> df8.drop(columns = ['Day', 'Month', 'Year'], inplace = True)
>>> df8.to_json("c:\data\df8_output.json", orient="records", lines=True)
Listing 6-18

Write DataFrame to JSON

The results of writing the df8 DataFrame to a JSON-formatted file is displayed in Figure 6-8.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig8_HTML.jpg
Figure 6-8

Write DataFrame to JSON

In order to write a SAS dataset to a Series of JSON records, use PROC JSON as illustrated in Listing 6-19. PROC JSON was introduced with Base SAS release 9.4.
4  proc json out='c:datasas_final.json' pretty
5                                         nosastags;
6     export final;
7  run;
Listing 6-19

Output a SAS Dataset to JSON

The first two JSON records are displayed in Figure 6-9.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig9_HTML.jpg
Figure 6-9

Write SAS Dataset to JSON

Read RDBMS Tables

Common input sources of data for analysis are relational database tables. The pandas.io.sql module provides a set of query wrappers to enable data retrieval while minimizing dependencies on RDBMS-specific APIs. Another way of saying this is, the clever folks who brought you pandas also figured out they can avoid re-inventing the wheel so they utilize a library called SQLAlchemy as an abstraction layer to the various databases needing to be read. This approach reduces the amount of database-dependent code pandas needed internally to read and write data using ODBC-compliant engines, which is to say, nearly all databases.

By using the SQLAlchemy library to read RDBMS tables (and queries), you pass SQLAlchemy Expression language constructs which are database-agnostic to the target database. This is analogous to PROC SQL’s behavior of using general SAS SQL constructs which in turn are translated on your behalf for a specific database without you having to know the RDBMS SQL dialect.

In order to execute these Read RDBMS Table examples, you need three components to be set up. They are
  1. 1.

    The SQLAlchemy library

     
  2. 2.

    The appropriate RDBMS driver library, in our case, the pyodbc library

     
  3. 3.

    The appropriately configured Windows ODBC client connecting to a target database

     

The following examples use Microsoft SQL Server 2017 for Windows as the target RDBMS for reading and writing. With a little bit of effort, you can convert these examples to execute on a different RDBMS running either locally or remotely.

The SQLAlchemy library is a part of the Anaconda distribution described in Chapter 1, “Why Python?”

To confirm a version of the SQLAlchemy library is available, start a Python session and enter
>>> import sqlalchemy
>>> sqlalchemy.__version__
'1.2.7'

Notice the two consecutive underscores before and after the string version.

The second part of the setup is the client-side library, which is the pyodbc library in our case. To confirm a version of the pyodbc library is available, start a Python session and enter
>>> import pyodbc
>>> pyodbc.version
'4.0.23'

Our setup has both Python 3.6 and SQL Server 2017 executing on the same Windows 10 machine. If the SQL Server instance is running remotely, you will need to make the appropriate adjustments.

The third and final step is to configure an ODBC DSN to connect to the target RDBMS. In order to set up the appropriate ODBC client interface on Windows 10, launch the ODBC Data Source Administrator by going to the Windows Start Menu to click

Start ➤ Windows Administrative Tools ➤ ODBC Data Sources (64-bit) to launch the ODBC Data Source Administrator shown in Figure 6-10.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig10_HTML.jpg
Figure 6-10

ODBC Data Source Administrator

If you are using Windows 7, navigate to Control Panel ➤ System and Security ➤ Administrative Tools and select Data Sources (ODBC). These examples illustrate the ODBC setup for Windows 10.

Make sure that the System DSN tab is selected and then click the Add… button to select the driver for the SQL Server Database. See Figure 6-11.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig11_HTML.jpg
Figure 6-11

Create New Data Source

Press the Finish button which displays the Create a New Data Source to SQL Server dialog box presented in Figure 6-12.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig12_HTML.jpg
Figure 6-12

Create a New Data Source to SQL Server

For Name: we choose AdventureWorksDW to indicate we are using the AdventureWorksDW2017 database SQL Server ships as an illustration for their Analytical Services. For Server: supply the instance name of the SQL Server Database you want to connect to. In our case, the SQL Server instance name is a local instance named RANDY-PC. Notice we use AdventureWorksDW as the name for ODBC source name which in turn connects to the SQL Server AdventureWorksDW2017 database.

For more information about the AdventureWorks2017 sample databases, see

https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks

Press the Next> dialog box and you are presented with the authentication dialog box shown in Figure 6-13.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig13_HTML.jpg
Figure 6-13

SQL Server Authentication

In our case, we select Windows authentication using the network login ID. Press the Next> dialog box. This action presents the Change Default Database Dialog menu shown in Figure 6-14.

If your dialog box indicates “master” as the default database, then check the box labeled “Change the default database to:” and select AdventureWorksDW2017, assuming it is available and your account has been granted read access.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig14_HTML.jpg
Figure 6-14

Change Default Database

Check the Change the default database tick box and supply the database name AdventureWorksDW2017. This is illustrated in Figure 6-15.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig15_HTML.jpg
Figure 6-15

Change the Default Database

Press Finish> to be presented with the “Change the language of SQL Server messages to:” tick box. In our case, we chose to log the ODBC driver statistics. This is illustrated in Figure 6-16.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig16_HTML.jpg
Figure 6-16

Change SQL Server Language

Press Finish> to be presented with the ODBC Data Source Configuration panel shown in Figure 6-17.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig17_HTML.jpg
Figure 6-17

Confirm Connection Details

Press Test Data Source... box to test the ODBC Connection. You should see the Test Results dialog box shown in Figure 6-18.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig18_HTML.jpg
Figure 6-18

Confirming Test Results

In order to connect to the SQL Server AdventureWorksDW2017 database with SQLAlchemy, we use the create_engine function to create an engine object from the database URI. Database-specific information for the create_engine function and URI formatting is at

https://docs.sqlalchemy.org/en/latest/core/engines.html

You only need to create the engine once per database instance you are connecting to. The create_engine function containing the SQL Server Database URI is illustrated in Listing 6-20.
>>> import pyodbc
>>> from sqlalchemy import create_engine, MetaData, Table, select
>>>
>>> ServerName = "RANDY-PC"
>>> Database = "AdventureWorksDW2017?driver=SQL+Server"
>>> TableName = "DimCustomer"
>>>
>>> engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
>>> print(engine)
Engine(mssql+pyodbc://RANDY-PC/AdventureWorksDW2017?driver=SQL Server)
Listing 6-20

create_engine Function

The engine object is defined using tuples to concatenate the objects ServerName and Database to form the database connection information:
mssql+pyodbc://RANDY-PC/AdventureWorksDW2017?driver=SQL Server
In cases where the SQL Server instance is remote to the Python session, the engine object string will be similar to
mssql+pyodbc://USER:PW@SERVER/AdventureWorksDW?driver=ODBC+Driver+13+for+SQL+Server

where USER:PW are the userid and password pair and SERVER is the remote hostname or IP address running the SQL Server instances. Also notice the ODBC driver name could be different, depending on which driver you use. You may need to contact database administrator for additional information.

Once the engine object is correctly defined, use the read_sql_table() method to read all or subsets of database artifacts such as tables and views. Consider Listing 6-21, needing just two arguments: the target database table, in this case, DimCustomer table, and the engine object defining the connection string to the target database instance.
>>> import pandas as pd
>>> t0 = pd.read_sql_table('DimCustomer', engine)
>>> t0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Data columns (total 29 columns):
CustomerKey             18484 non-null int64
GeographyKey            18484 non-null int64
CustomerAlternateKey    18484 non-null object
Title                   101 non-null object
FirstName               18484 non-null object
MiddleName              10654 non-null object
LastName                18484 non-null object
NameStyle               18484 non-null bool
BirthDate               18484 non-null datetime64[ns]
MaritalStatus           18484 non-null object
Suffix                  3 non-null object
Gender                  18484 non-null object
EmailAddress            18484 non-null object
YearlyIncome            18484 non-null float64
TotalChildren           18484 non-null int64
NumberChildrenAtHome    18484 non-null int64
EnglishEducation        18484 non-null object
SpanishEducation        18484 non-null object
FrenchEducation         18484 non-null object
EnglishOccupation       18484 non-null object
SpanishOccupation       18484 non-null object
FrenchOccupation        18484 non-null object
HouseOwnerFlag          18484 non-null object
NumberCarsOwned         18484 non-null int64
AddressLine1            18484 non-null object
AddressLine2            312 non-null object
Phone                   18484 non-null object
DateFirstPurchase       18484 non-null datetime64[ns]
CommuteDistance         18484 non-null object
dtypes: bool(1), datetime64[ns](2), float64(1), int64(5), object(20)
memory usage: 4.0+ MB
Listing 6-21

Basic Read with pd.read_sql_table

The info attribute for the t0 DataFrame displays the names of all the columns read from the database table, the number of rows with non-null values, and the columns’ corresponding types. For example, the BirthDate column is mapped to a datetime64 type. This indicates the BirthDate column can be used in datetime arithmetic expressions without needing cast it to datetime64 using the to_datetime() function .

If needed, the read_sql_table() reader accepts the parse_dates= argument to coerce date and datetime columns into a datetime64 type with
parse_dates={'BirthDate': {'format': '%Y-%m-%d'}}

using a nested Dictionary key:value pairs where the key is the name of the column followed by another Dictionary where the key is 'format' and the value is the date parsing instructions, known as format directives.

The to_datetime() function along with format directives is discussed in detail in Chapter 7, “Data and Time.”

To return a subset of columns from a database table, use the columns= argument illustrated in Listing 6-22.

The call to the read_sql_table() method contains three arguments. The first argument is the target table, DimCustomer; the second argument is the engine object containing the connection information needed to access the SQL Server Database instance; and the third argument is columns= which forms the SELECT list that is ultimately executed as a T-SQL query on the SQL Server instance.
>>> col_names = ['FirstName', 'LastName', 'BirthDate', 'Gender', 'YearlyIncome', 'CustomerKey']
>>> tbl = 'DimCustomer'
>>> t1 = pd.read_sql_table(tbl, engine, columns=col_names)
>>> t1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18484 entries, 0 to 18483
Data columns (total 6 columns):
FirstName       18484 non-null object
LastName        18484 non-null object
BirthDate       18484 non-null datetime64[ns]
Gender          18484 non-null object
YearlyIncome    18484 non-null float64
CustomerKey     18484 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 866.5+ KB
Listing 6-22

Returning Column Subset

Use the index_col= argument to map an input column as the DataFrame index to create row labels. This feature is shown in Listing 6-23. This example extends Listing 6-22 by adding a fourth parameter, index_col=, for the call to read_sql_table reader().
>>> t2 = pd.read_sql_table(tbl, engine, columns=col_names, index_col="CustomerKey")
>>> print(t2[['FirstName', 'LastName', 'BirthDate']].head(5))
             FirstName LastName  BirthDate
CustomerKey
11000              Jon     Yang 1971-10-06
11001           Eugene    Huang 1976-05-10
11002            Ruben   Torres 1971-02-09
11003          Christy      Zhu 1973-08-14
11004        Elizabeth  Johnson 1979-08-05
>>> t2.index
Int64Index([11000, 11001, 11002, 11003, 11004, 11005, 11006, 11007, 11008, 11009,
. . .
29474, 29475, 29476, 29477, 29478, 29479, 29480, 29481, 29482,
29483],
dtype='int64', name="CustomerKey", length=18484)
Listing 6-23

Mapping Column to an Index

Observe how the CustomerKey column read from the database is now the row label for the t2 DataFrame.

Query RDBMS Tables

As part of an analysis effort, we often need to construct SQL queries against the target dataset to return rows and columns to construct a DataFrame. Use the read_sql_query() reader to send an SQL query to the database and form a DataFrame from the returned result set. This feature is illustrated in Listing 6-24.
>>> q1 = pd.read_sql_query('SELECT FirstName, LastName, Gender, BirthDate, YearlyIncome '
...                  'FROM dbo.DimCustomer '
...                  'WHERE YearlyIncome > 50000; '
...                         , engine)
>>> print(q1[['FirstName', 'LastName', 'BirthDate']].tail(5))
     FirstName LastName   BirthDate
9853     Edgar    Perez  1964-05-17
9854     Alvin      Pal  1963-01-11
9855    Wesley    Huang  1971-02-10
9856     Roger    Zheng  1966-03-02
9857    Isaiah  Edwards  1971-03-11
Listing 6-24

DataFrame from Query to RDBMS

In this example, the first argument to the read_sql_query() is a valid T-SQL query for SQL Server. This is followed by the engine object which holds the connection information to the database created previously. This call creates the q1 DataFrame. Notice how each line of the SQL query uses single quotes and a space before the close quote.

Listing 6-25 illustrates the SAS analog in Listing 6-22. This example has the same SELECT list as Listing 6-22. Use PROC PWENCODE to encode your password string to ensure your password string is not stored in clear text. Alternatively, you can assign your encoded password string to a SAS Macro variable with
proc pwencode in="YOUR_PASSWORD_HERE" out=pwtemp;
run;
This defines the Macro variable &pwtemp holding the encoded password string.
4   proc pwencode in=XXXXXXXXXXX;
5   run;
    {SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522
6   libname sqlsrvr odbc
7           uid=randy
8           pwd={SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522
9           datasrc=AdventureWorksDW
10          bulkload=yes;
NOTE: Libref SQLSRVR was successfully assigned as follows:
      Engine:        ODBC
      Physical Name: AdventureWorksDW
11 title1 "Default Informats and Formats";
12 proc sql;
13    create table customers as
14    select FirstName
15           ,LastName
16           ,BirthDate
17           ,Gender
18           ,YearlyIncome
19           ,CustomerKey
20        from sqlsrvr.DimCustomer;
NOTE: Table WORK.CUSTOMERS created, with 18484 rows and 6 columns.
21     select name
22           ,informat
23           ,format
24        from DICTIONARY.COLUMNS
25        where libname = 'WORK' &
26              memname = 'CUSTOMERS';
27 quit;
Listing 6-25

SAS ODBC Access to SQL Server

The second set of PROC SQL statements , lines 21–26, query the SAS-maintained table, DICTIONARY.COLUMNS, to return format and informat assignments to the WORK.CUSTOMERS columns. SAS formats and informats are analogous to column types, in that the informats direct how values are read on input and formats direct how values are written on output.

In this case, the incoming SQL Server table values for the BirthDate column are read using the SAS $10. informat treating the values as a 10-byte long character string. The output from PROC SQL is displayed in Figure 6-19.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig19_HTML.jpg
Figure 6-19

Customer Table Default Formats and Informats Assigned by SAS

In order to utilize the incoming BirthDate variable with SAS datetime expressions, a subsequent Data Step is needed to copy the BirthDate variable values into a different variable formatted for datetime handling. Listing 6-26 illustrates copying the customers dataset and using the INPUT function to load the original character values for the BirthDate variable to a numeric variable and assign it a permanent format of YYMMDD10.

Unlike Python, existing SAS variables cannot be recast. To output the customers dataset with the BirthDate variable with a datetime format requires copying the dataset, renaming the incoming BirthDate variable to, in this case, dob, and assigning dob’s values to the output BirthDate variable with the INPUT function.
4  data customers(drop = dob);
5     set customers(rename=(BirthDate = dob));
6  length BirthDate 8;
7
8  BirthDate = input(dob,yymmdd10.);
9  format BirthDate yymmdd10.;
10  run;
NOTE: There were 18484 observations read from the dataset WORK.CUSTOMERS.
NOTE: The dataset WORK.CUSTOMERS has 18484 observations and 6 variables.
11  title1 "Modified Informats and Formats";
12  proc sql;
13     select name
14           ,informat
15           ,format
16        from DICTIONARY.COLUMNS
17        where libname = 'WORK' &
18           memname = 'CUSTOMERS';
19  quit;
Listing 6-26

Mapping BirthDate Variable to Datetime Variable

Figure 6-20 uses PROC SQL to query the SAS DICTIONARY tables in order to display the informats and formats assigned to the modified customers dataset.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig20_HTML.jpg
Figure 6-20

Customer Dataset Modified Formats and Informats

Not all SQL queries return a result set. Listing 6-27 illustrates using the sql.execute() function. This is useful for queries that do not return a result set such as CREATE TABLE, DROP, INSERT statements, and so on. The SQL statements are specific to the target RDBMS.
>>> from pandas.io import sql
>>> sql.execute('USE AdventureWorksDW2017; ', engine)
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DE80>
>>> sql.execute('DROP TABLE CustomerPy', engine)
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DDA0>
>>> sql.execute("CREATE TABLE CustomerPy (ID int,           
...             Name nvarchar(255),                         
...             StartDate date);", engine)
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0D898>
Listing 6-27

sql.execute() Statement

In this case, the T-SQL syntax for SQL Server DROPs the CustomerPy table with
sql.execute('DROP TABLE CustomerPy', engine)
followed by a CREATE TABLE statement to define the CustomerPy table
sql.execute("CREATE TABLE CustomerPy (ID int,
             Name nvarchar(255),
             StartDate date);", engine)
Figure 6-21 displays the created SQL Server table, CustomerPy physical attributes created with the sql.execute() statement.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig21_HTML.jpg
Figure 6-21

sql.execute() Results

Listing 6-28 illustrates the use of the SQL pass-thru facility as wrapper to pass T-SQL statements directly to SQL Server through the ODBC API. This example is the analog to Listing 6-27 for passing SQL to the database which does not return a result set.
4  proc sql;
5     connect to odbc as sqlsrvr
6                     (dsn=AdventureWorksDW
7                      uid=randy
8                      password=
8!                 {SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522);
9
10       exec
11        (CREATE TABLE CustomerPy (ID int,
12                  Name nvarchar(255),
13                  StartDate date)) by sqlsrvr;
14
15    %put Note:  Return Code for SQL Server is: &sqlxrc;
Note:  Return Code for SQL Server is: 0
16     %put Note:  Return Message for SQL Server is: &sqlxmsg;
Note:  Return Message for SQL Server is:
17
18     select * from connection to sqlsrvr
19        (ODBC::SQLTables (,,"CustomerPy",));
20      disconnect from sqlsrvr;
21
22     %put Note:  Return Code for SQL Server is: &sqlxrc;
Note:  Return Code for SQL Server is: 0
23     %put Note:  Return Message for SQL Server is: &sqlxmsg;
Note:  Return Message for SQL Server is:
24     quit;
Listing 6-28

SAS Pass-Thru to SQL Server

With SAS SQL Pass-Thru, any statement inside a parenthesized expression is passed directly to the database library API, in this case, ODBC.
(CREATE TABLE CustomerPy (ID int,
                    Name nvarchar(255),
                    StartDate date)
SAS/Access to ODBC supports calls to the ODBC::SQLAPI. This interface acts as an alternative method for querying the RDBMS catalog tables. In this example
select * from connection to sqlsrvr
      (ODBC::SQLTables (,,"CustomerPy",));
returns information about the created CustomerPy table. Figure 6-22 displays the query results.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig22_HTML.jpg
Figure 6-22

CustomerPy Attributes

Read SAS Datasets

Organizations often have large numbers of permanent SAS datasets stored on the filesystem. pandas provide the read_sas() reader for creating DataFrames by reading permanent SAS datasets. Permanent SAS datasets are often referred to as .sas7bdat files (after the extension SAS uses to name dataset files on Windows and Unix filesystems). Listing 6-29 uses a Data Step to create the permanent out_data.to_df SAS dataset, where out_data maps to the filesystem location C:data and to_df is the name of dataset, which is recognized by Windows as the file: C:data o_df.sas7bdat.
4 libname out_data 'c:data';
NOTE: Libref OUT_DATA was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:data
5 data out_data.to_df;
6    length even_odd $ 4;
7    call streaminit(987650);
8    do datetime = '01Dec2018 00:00'dt to '02Dec2018 00:00'dt by 60;
9       amount = rand("Uniform", 50, 1200);
10       quantity = int(rand("Uniform", 1000, 5000));
12       if int(mod(quantity,2)) = 0 then even_odd = 'Even';
12          else even_odd = 'Odd';
13       output;
14    end;
15 format datetime datetime16.
16        amount dollar12.2
17        quantity comma10.;
18 run;
NOTE: The dataset OUT_DATA.TO_DF has 1441 observations and 4 variables.
19 proc contents data=out_data.to_df;
20 run;
21 proc print data=out_data.to_df(obs=5);
22 run;
NOTE: There were 5 observations read from the dataset OUT_DATA.TO_DF.
Listing 6-29

Create SAS Dataset for read_sas()

The SAS datetime variable is formatted as datetime constants; the amount variable is formatted to precede the value with a dollar sign ($) and the quantity variable to embed a comma (,) as part of its value. Figure 6-23 uses PROC CONTENTS to display the attribute information for the dataset. Notice that since we are using Windows, SAS dataset written to the filesystem uses wlatin1 encoding.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig23_HTML.jpg
Figure 6-23

Attributes for OUT_DATA.TO_DF

Figure 6-24 uses PROC PRINT to display the first five observations of the dataset.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig24_HTML.jpg
Figure 6-24

Output for OUT_DATA.TO_DF

With the OUT_DATA.TO_DF SAS dataset written to the filesystem, it can be used as input into the call to read_sas() reader to create a DataFrame. Listing 6-30 illustrates creating the from_sas DataFrame.
>>> from_sas = pd.read_sas("c:\Data\to_df.sas7bdat",
...                        format='SAS7BDAT',
...                        encoding = 'latin-1')
>>> from_sas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1441 entries, 0 to 1440
Data columns (total 4 columns):
even_odd    1441 non-null object
datetime    1441 non-null datetime64[ns]
amount      1441 non-null float64
quantity    1441 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 45.1+ KB
>>> print(from_sas.head(5))
  even_odd            datetime       amount  quantity
0      Odd 2018-12-01 00:00:00   340.629296    3955.0
1     Even 2018-12-01 00:01:00  1143.526378    1036.0
2     Even 2018-12-01 00:02:00   209.394104    2846.0
3     Even 2018-12-01 00:03:00   348.086155    1930.0
4     Even 2018-12-01 00:04:00   805.929860    3076.0
Listing 6-30

Create DataFrame from SAS Dataset

The call to the read_sas() reader has as its first argument the full Windows’s pathname to the OUT_DATA.TO_DF dataset (as it is known to SAS). The value for the second argument, format=, is SAS7BDAT. If the SAS dataset is in transport format, then this value is set to XPORT. The third argument, encoding=, is set to latin-1 to match the encoding for the to_df.sas7bdat dataset (as it is known to Windows).

The read_sas() reader issues a file lock for the target input in order to read the SAS dataset. This can cause file contention issues if you attempt to open the SAS dataset for output after executing the read_sas() reader. The SAS log will issue the following error:
ERROR: A lock is not available for OUT_DATA.TO_DF.DATA.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The dataset OUT_DATA.TO_DF was only partially opened and will not be saved.

If this situation is encountered, ending the Python session used to call the read_sas() reader releases the file lock.

The pandas library does not provide a write_sas writer. In Chapter 8, “SASPy Module,” we cover details for executing bi-directional interchange between panda DataFrames and SAS datasets using SAS Institute’s open source SASPy Module.

Write RDBMS Tables

The pandas library provisions the ability to write DataFrames as RDBMS tables with the to_sql() writer. Consider Listing 6-31 which uses the from_sas DataFrame created in Listing 6-30. In this example, the from_sas DataFrame calls the to_sql() method attempting to create the SQLTableFromDF SQL Server table. This call returns an ODBC error indicating the to_sql() syntax is correct, but the underlying SQLAlchemy call to the ODBC API failed.
>>> from_sas.to_sql('SQLTableFromDF', engine, if_exists="replace", index=False)
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003)
Listing 6-31

Write DataFrame to Table

If you are using a different ODBC driver other than the SQLServer one used here, you may encounter a different error message indicating SQL inserts are rejected. Fortunately this error is easily eliminated by using the chunksize= argument for the to_sql() writer illustrated in Listing 6-32.
>>> from_sas.to_sql('SQLTableFromDF', engine, if_exists="replace", chunksize=100, index=False)
Listing 6-32

Chunk DataFrame Writes to Table

The arguments to the to_sql() method call are
  1. 1.

    RDBMS target table to write, in this case, SQLTableFromDF.

     
  2. 2.

    The engine object containing the RDBMS connection string defined in Listing 6-20.

     
  3. 3.

    if_exists='replace' eliminates the need to explicitly DROP the target table if it already exists.

     
  4. 4.

    chunksize=100 which enables writes in increments of 100 rows to avoid size limitations when attempting to load larger DataFrames to an RDBMS table.

     
  5. 5.

    index=False does not define a primary key column when loading the DataFrame values into the RDBMS table.

     
Unlike SAS, when the to_sql() writer writes to the RDBMS table, it does not return messages or provide a return code to indicate success. Of course, we can send queries to the RDBMS to validate the existence of the SQLTableFromDF SQL Server table, which is the purpose of Listing 6-33.
>>> confirm1 = pd.read_sql_query("SELECT column_name as 'COL_NAME', "
...            "data_type as 'Data_Type', "
...            "IS_NULLABLE as 'Nulls Valid' "
...            "FROM information_schema.columns "
...            "WHERE table_name = 'SQLTableFromDF' ", engine)
>>> print(confirm1)
   COL_NAME Data_Type Nulls Valid
0  even_odd   varchar         YES
1  datetime  datetime         YES
2    amount     float         YES
3  quantity     float         YES
Listing 6-33

Confirm to_sql() Writes

Notice the quoting. Each physical line in the call to read_sql_query() reader requires double quotes. Single quotes are used to label column headings following the T-SQL AS keyword along with single quotes used in the T-SQL WHERE clause. The single quotes are passed to the RDBMS engine since they are a required element for a valid T-SQL query.

Figure 6-25 displays the output from SQL Server Management Studio execution of this query.
../images/440803_1_En_6_Chapter/440803_1_En_6_Fig25_HTML.jpg
Figure 6-25

SQL Server Table Write Confirmation

In the case of Listing 6-32, there are no missing values in the from_sas DataFrame. Missing values between data formats can be challenging at times since each data format uses different sentinel values to indicate “missing”.

In the case of a DataFrame, missing values can be NaN for float64s and NaT (for Not a Time) for datetime64 types. Consider Listing 6-34. The goal of this example is to load the resulting to_mssql DataFrame into an SQL Server table to ensure subsequent SQL queries properly handle these missing values.
>>> import numpy as np
>>> from_sas.loc[from_sas['quantity'] < 2000, 'amount'] = np.NaN
>>> from_sas.loc[from_sas['quantity'] < 2000, 'datetime'] = np.NaN
>>>
>>> to_mssql = from_sas.copy()
>>>
>>> from_sas.isnull().sum()
even_odd      0
datetime    363
amount      363
quantity      0
dtype: int64
>>> print(to_mssql.head(5))
  even_odd            datetime      amount  quantity
0      Odd 2018-12-01 00:00:00  340.629296    3955.0
1     Even                 NaT         NaN    1036.0
2     Even 2018-12-01 00:02:00  209.394104    2846.0
3     Even                 NaT         NaN    1930.0
4     Even 2018-12-01 00:04:00  805.929860    3076.0
Listing 6-34

Insert Missing Values into DataFrame

This example uses the loc() indexer to find rows meeting the logical condition for the quantity column value less than 2000. If found, the amount column and the datetime column values are set to np.NaN. For more details on using the loc() indexer to conditionally update values, see section “Conditional Updates” in Chapter 5, “Data Management.”

In this example, notice the missing value for the datetime column whose type is datetime64 is NaT. In both cases the .isnull().sum() chained attributes indicate there are rows with missing values in the to_mssql DataFrame.

Consider Listing 6-35. In this example, we copy the from_sas DataFrame to the to_mssql DataFrame. This is followed by a call to the to_sql() writer to load the to_mssql DataFrame to the SQLTableFromDF2 SQL Server table.
>>> to_mssql = from_sas.copy()
>>>
>>> to_mssql.to_sql('SQLTableFromDF2', engine, if_exists="replace", chunksize=100, index=False)
>>> confirm2 = pd.read_sql_query("SELECT TOP 5 * "
...                              "FROM SQLTableFromDF2 "
...                              "WHERE amount is NULL ", engine)
>>>
>>> print(confirm2)
  even_odd datetime amount  quantity
0     Even     None   None    1036.0
1     Even     None   None    1930.0
2      Odd     None   None    1677.0
3     Even     None   None    1352.0
4     Even     None   None    1876.0
Listing 6-35

Validate Reading Missing Values

The syntax
to_mssql.to_sql('SQLTableFromDF2', engine, if_exists="replace", chunksize=100, index=False)
creates the SQLTableFromDF2 SQL Server table. The syntax
confirm2 = pd.read_sql_query("SELECT TOP 5 * "
...                          "FROM SQLTableFromDF2 "
...                          "WHERE amount is NULL ", engine)
executes a T-SQL query returning the first five rows from the table using the T-SQL WHERE clause to create the confirm2 DataFrame
  WHERE amount is NULL

The call to the print() function for the confirm2 DataFrame displays the values for the datetime and amount columns made the round trip from DataFrame to SQL Server Table back to a DataFrame maintaining the integrity of missing values between both data formats.

Summary

Data analysis tasks often require building a data processing pipeline to acquire and format data among a number of different data formats. This chapter provides the user with a solid grounding for acquiring data and loading them into a DataFrame using different panda readers as well as illustrates for writing DataFrame values to .csv, .xls, .sas7bdat, and RDBMS tables. By adding this know-how to one’s skill set, it enables the user to build more elaborate scripts to tackle a range of data analysis tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.55.75