Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7
Using SQL with SAS and R

7.1 What is SQL?

SQL (Structured Query Language) is a language for querying and modifying data in Relational Database Management Systems (RDBMs). However SQL is also used within Apache Hive and Python as well as PySpark. The pandasql package allows you to query pandas DataFrames using SQL syntax. The entry point into all SQL functionality in Spark is the SQLContext class. The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

7.1.1 Basic Terminology

A database is a collection of information that is organized so that it can be easily accessed, managed and updated.

A relational database is a set of tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables.

7.1.2 CAP Theorem

CAP Theorem is a concept that a distributed database system can only have 2 of the 3: Consistency, Availability, and Partition Tolerance.

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non‐error) response – without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc.

Eventually‐consistent services are sometimes classified as providing BASE (Basically Available, Soft state, Eventual consistency), in contrast to traditional ACID.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSQL databases, choose availability over consistency.

7.1.3 SQL in SAS and R

SQL can be used in R by the sqldf package and using the sqldf() function whereas in SAS we use PROC SQL for using SQL. Proc sql does not require a run statement because proc sql statements are executed immediately.

7.2 SQL Select

The select statement is used to tell the database what data we want from it.

The basic syntax of a select statement is:

We use the * wildcard to select all columns:‐

In R:

We will use the inbuilt mtcars dataset in R and I have also created a mtcars dataset in SAS as follows:

Downloaded data from:
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv
Uploaded to folder in SAS
Created a dataset named mtcars using proc import.

FILENAME REFFILE '/home/ajay4/sasuser.v94/mtcars.csv';

PROC IMPORT DATAFILE=REFFILE
     DBMS=CSV
     OUT=WORK.mtcars;
     GETNAMES=YES;
RUN;

proc sql number outobs=10;
select * from mtcars;
Run;

Row   VAR1                mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1     Mazda RX4           21    6    160    110  3.9   2.62   16.46  0   1   4     4
2     Mazda RX4 Wag       21    6    160    110  3.9   2.875  17.02  0   1   4     4
3     Datsun 710          22.8  4    108    93   3.85  2.32   18.61  1   1   4     1
4     Hornet 4 Drive      21.4  6    258    110  3.08  3.215  19.44  1   0   3     1
5     Hornet Sportabout   18.7  8    360    175  3.15  3.44   17.02  0   0   3     2
6     Valiant             18.1  6    225    105  2.76  3.46   20.22  1   0   3     1
7     Duster 360          14.3  8    360    245  3.21  3.57   15.84  0   0   3     4
8     Merc 240D           24.4  4    146.7  62   3.69  3.19   20     1   0   4     2
9     Merc 230            22.8  4    140.8  95   3.92  3.15   22.9   1   0   4     2
10    Merc 280            19.2  6    167.6  123  3.92  3.44   18.3   1   0   4     4

In SQL * denotes all data.

Using outobs we limit the output to 10.

Using select we select the variables from a particular database satisfying a certain condition (if present). The basic syntax for using sql in SAS is:

The syntax for using sql in R is:

> install.packages("sqldf")
> library(sqldf)
> sqldf("select * from mtcars limit 10")

    mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1   21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
2   21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
3   22.8  4    108.0  93   3.85  2.320  18.61  1   1   4     1
4   21.4  6    258.0  110  3.08  3.215  19.44  1   0   3     1
5   18.7  8    360.0  175  3.15  3.440  17.02  0   0   3     2
6   18.1  6    225.0  105  2.76  3.460  20.22  1   0   3     1
7   14.3  8    360.0  245  3.21  3.570  15.84  0   0   3     4
8   24.4  4    146.7  62   3.69  3.190  20.00  1   0   4     2
9   22.8  4    140.8  95   3.92  3.150  22.90  1   0   4     2
10  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4

Note we used limit to limit the number of rows.

We can select particular columns by specifying their names separated by commas in the select statement:

To select only mpg,vs, and cyl columns:

In R:

In SAS:

7.2.1 SQL WHERE

We use the WHERE clause along with SELECT to conditionally select rows.

For example:

To select only rows which have cyl=6

In R

> sqldf("select * from mtcars where cyl=6")

   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1  21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
2  21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
3  21.4  6    258.0  110  3.08  3.215  19.44  1   0   3     1
4  18.1  6    225.0  105  2.76  3.460  20.22  1   0   3     1
5  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4
6  17.8  6    167.6  123  3.92  3.440  18.90  1   0   4     4
7  19.7  6    145.0  175  3.62  2.770  15.50  0   1   5     6
Rows which had cyl=6 are selected.

7.2.2 SQL Order By

“Order by” is used to display the output sorted. It can be sorted in ascending, descending, or alphabetical order.

In SAS:

Sort/Order Data in SAS

In R

> sqldf("select * from mtcars where cyl=6 order by disp")

   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1  19.7  6    145.0  175  3.62  2.770  15.50  0   1   5     6
2  21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
3  21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
4  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4
5  17.8  6    167.6  123  3.92  3.440  18.90  1   0   4     4
6  18.1  6    225.0  105  2.76  3.460  20.22  1   0   3     1
7  21.4  6    258.0  110  3.08  3.215  19.44  1   0   3     1

7.2.3 AND, OR, NOT in SQL

AND, OR, and NOT operators can be used along with the WHERE clause.

AND operator displays only those rows which meet all conditions in the WHERE clause.

OR operator displays those rows which meet at least one of the conditions in the WHERE clause.

NOT operator displays those rows which do not meet the condition specified in the WHERE clause.

In SAS:

To select only those rows from mtcars which have cyl=6 as well as gear=3

proc sql number1;
select * from mtcars where cyl=6 and gear=3 ;
order by disp;
run;

Proc print data=number1;
run;


Row  VAR1      mpg   cyl  disp  hp   drat  wt     qsec   vs  am  gear  carb
1    Hornet 4  21.4  6    258   110  3.08  3.215  19.44  1   0   3     1
     Drive
2    Valiant   18.1  6    225   105  2.76  3.46   20.22  1   0   3     1

To select only those rows which have either carb = 1 or carb = 4

To select only and all those rows which do not have am=0:

In R:

To select only those rows from mtcars which have cyl=6 as well as gear=3

> sqldf("select * from mtcars where cyl=6 and gear=3 order by disp")

   mpg   cyl  disp  hp   drat  wt     qsec   vs  am  gear  carb
1  18.1  6    225   105  2.76  3.460  20.22  1   0   3     1
2  21.4  6    258   110  3.08  3.215  19.44  1   0   3     1

To select only those rows which have either carb=1 or carb=4

> sqldf("select * from mtcars where carb=1 or carb=4 order by disp")
   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1  33.9  4    71.1   65   4.22  1.835  19.90  1   1   4     1
2  32.4  4    78.7   66   4.08  2.200  19.47  1   1   4     1
3  27.3  4    79.0   66   4.08  1.935  18.90  1   1   4     1
4  22.8  4    108.0  93   3.85  2.320  18.61  1   1   4     1
5  21.5  4    120.1  97   3.70  2.465  20.01  1   0   3     1
6  21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
7  21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
8  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4
9  17.8  6    167.6  123  3.92  3.440  18.90  1   0   4     4
10 18.1  6    225.0  105  2.76  3.460  20.22  1   0   3     1
11 21.4  6    258.0  110  3.08  3.215  19.44  1   0   3     1
12 13.3  8    350.0  245  3.73  3.840  15.41  0   0   3     4
13 15.8  8    351.0  264  4.22  3.170  14.50  0   1   5     4
14 14.3  8    360.0  245  3.21  3.570  15.84  0   0   3     4
15 14.7  8    440.0  230  3.23  5.345  17.42  0   0   3     4
16 10.4  8    460.0  215  3.00  5.424  17.82  0   0   3     4
17 10.4  8    472.0  205  2.93  5.250  17.98  0   0   3     4

To select only and all those rows which do not have am=0

> sqldf("select * from mtcars where not am=0 order by disp")

   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1  33.9  4    71.1   65   4.22  1.835  19.90  1   1   4     1
2  30.4  4    75.7   52   4.93  1.615  18.52  1   1   4     2
3  32.4  4    78.7   66   4.08  2.200  19.47  1   1   4     1
4  27.3  4    79.0   66   4.08  1.935  18.90  1   1   4     1
5  30.4  4    95.1   113  3.77  1.513  16.90  1   1   5     2
6  22.8  4    108.0  93   3.85  2.320  18.61  1   1   4     1
7  26.0  4    120.3  91   4.43  2.140  16.70  0   1   5     2
8  21.4  4    121.0  109  4.11  2.780  18.60  1   1   4     2
9  19.7  6    145.0  175  3.62  2.770  15.50  0   1   5     6
10 21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
11 21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
12 15.0  8    301.0  335  3.54  3.570  14.60  0   1   5     8
13 15.8  8    351.0  264  4.22  3.170  14.50  0   1   5     4

7.2.4 SQL Select Distinct

We can select only the unique values of a column using the SELECT DISTINCT statement:

For example:

To know what the distinct values are that the variable gear takes in mtcars:‐

In SAS:

In R

7.2.5 SQL INSERT INTO

INSERT INTO statement is used to add rows to a data table.

While using statements like INSERT INTO, ALTER TABLE, UPDATE which make changes to a data table in R, it is important to remember that sqldf() makes a copy of the data table provided to it, makes the required changes and returns the copy with the required changes.
To make changes on the original data table, we need to assign the copy of the table returned by sqldf() to the object that stored the original copy.
To use multiple SQL statements in sqldf(), we use the c() function.
Proc Sql on the other hand makes the changes on the original copy itself.

To Insert a Row in SAS:

Note Var1 has taken the value of row.names from input R dataset.

To delete existing dataset we use Proc Delete

proc delete data=mtcars;
run;

FILENAME REFFILE '/home/ajay4/sasuser.v94/mtcars.csv';

PROC IMPORT DATAFILE=REFFILE
     DBMS=CSV
     OUT=WORK.mtcars;
     GETNAMES=YES;
RUN;

proc sql;
insert into mtcars values('Maserati Bora',19.0,6,315.0,355,3.84,3.170,14.55,1,0,4,2);
select * from mtcars;run;

proc print data=mtcars (firstobs=27 obs=33);run;

Obs  VAR1           mpg    cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
27   Porsche 914-2  26     4    120.3  91   4.43  2.14   16.7   0   1   5     2
28   Lotus Europa   30.4   4    95.1   113  3.77  1.513  16.9   1   1   5     2
29   Ford Pantera   L15.8  8    351    264  4.22  3.17   14.5   0   1   5     4
30   Ferrari Dino   19.7   6    145    175  3.62  2.77   15.5   0   1   5     6
31   Maserati Bora  15     8    301    335  3.54  3.57   14.6   0   1   5     8
32   Volvo 142E     21.4   4    121    109  4.11  2.78   18.6   1   1   4     2
33   Maserati Bora  19     6    315    355  3.84  3.17   14.55  1   0   4     2

To insert a row in mtcars in R:

data(mtcars)

mtcars=sqldf(c("insert into mtcars values(19.0,6,315.0,355,3.84,3.170,14.55,1,0,4,2)","select * from mtcars"))

tail(mtcars)

    mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
28  30.4  4    95.1   113  3.77  1.513  16.90  1   1   5     2
29  15.8  8    351.0  264  4.22  3.170  14.50  0   1   5     4
30  19.7  6    145.0  175  3.62  2.770  15.50  0   1   5     6
31  15.0  8    301.0  335  3.54  3.570  14.60  0   1   5     8
32  21.4  4    121.0  109  4.11  2.780  18.60  1   1   4     2
33  19.0  6    315.0  355  3.84  3.170  14.55  1   0   4     2

Note the use of c() function and that we have assigned the result of sqldf ()to mtcars. Also note the output has 33 rows instead of the initial 32 with the new row having the input values.

7.2.6 SQL Delete

The DELETE statement is used along with WHERE to conditionally delete rows.

In SAS:

Obs	VAR1	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
1	Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
2	Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
10	Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
11	Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
29	Ford Pantera L	15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
30	Ferrari Dino	19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
31	Maserati Bora	15	8	301	335	3.54	3.57	14.6	0	1	5	8

In R:

data(mtcars)
mtcars2=sqldf(c("delete from mtcars where gear=3 or cyl=4","select * from mtcars"))
mtcars2

> mtcars2
   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
1  21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4
2  21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4
3  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4
4  17.8  6    167.6  123  3.92  3.440  18.90  1   0   4     4
5  15.8  8    351.0  264  4.22  3.170  14.50  0   1   5     4
6  19.7  6    145.0  175  3.62  2.770  15.50  0   1   5     6
7  15.0  8    301.0  335  3.54  3.570  14.60  0   1   5     8

7.2.7 SQL Aggregate Functions

Aggregate functions perform a calculation on a set of values and return a single value. Some aggregate functions are min(),max(),avg.(),sum(),count() are:

Aggregate functions ignore missing values except count().

min() gives the minimum value of the column.
max() gives the maximum value of the column.
avg.() gives the average value of the values in the column.
sum() gives the sum of all values in the column.
count() gives the number of non missing values in a column.

In SAS:

Here we replace earlier values of MTCARS by replace option in proc import

FILENAME REFFILE '/home/ajay4/sasuser.v94/mtcars.csv';

PROC IMPORT DATAFILE=REFFILE replace
     DBMS=CSV
     OUT=WORK.mtcars;
     GETNAMES=YES;
RUN;



proc sql;
Create table mtcars2 as
select min(mpg), max(mpg),avg(mpg) from mtcars;
Run;

Proc print data=mtcars2;
Run;

Obs	_TEMG001	_TEMG002	_TEMG003
1	10.4	33.9	20.0906

Here, create table creates a new table mtcars2

In R:

7.2.8 SQL ALIASES

Aliases are used to give a temporary name. The AS clause is used with SELECT to do so.

In SAS:

proc sql;
Create table mtcars2 as
select min(mpg) as minimum, max(mpg) as maximum, avg(mpg) as average from import;
Run;

Proc print data=mtcars2;
Run;

Obs    minimum    maximum    average
1      10.4       33.9       20.0906

In R:

7.2.9 SQL ALTER TABLE

ALTER TABLE is used to add, delete or modify column names in SAS.

ADD,DELETE,MODIFY are clauses that can be used with ALTER TABLE.

The new column added contains null values by default.

To make a new column in mtcars with the label ‘name’ and of type char.

In SAS:

Variables in Creation Order
#	Variable	Type	Len	Format	Informat
1	VAR1	Char	21	$21.	$21.
2	mpg	Num	8	BEST12.	BEST32.
3	cyl	Num	8	BEST12.	BEST32.
4	disp	Num	8	BEST12.	BEST32.
5	hp	Num	8	BEST12.	BEST32.
6	drat	Num	8	BEST12.	BEST32.
7	wt	Num	8	BEST12.	BEST32.
8	qsec	Num	8	BEST12.	BEST32.
9	vs	Num	8	BEST12.	BEST32.
10	am	Num	8	BEST12.	BEST32.
11	gear	Num	8	BEST12.	BEST32.
12	carb	Num	8	BEST12.	BEST32.
13	name	Char	8

In R:

mtcars3=sqldf(c("alter table mtcars add name char","select * from mtcars"))
head(mtcars3)

   mpg   cyl  disp  hp   drat  wt     qsec   vs  am  gear  carb  name
1  21.0  6    160   110  3.90  2.620  16.46  0   1   4     4     <NA>
2  21.0  6    160   110  3.90  2.875  17.02  0   1   4     4     <NA>
3  22.8  4    108   93   3.85  2.320  18.61  1   1   4     1     <NA>
4  21.4  6    258   110  3.08  3.215  19.44  1   0   3     1     <NA>
5  18.7  8    360   175  3.15  3.440  17.02  0   0   3     2     <NA>
6  18.1  6    225   105  2.76  3.460  20.22  1   0   3     1     <NA>

str(mtcars3)
> str(mtcars3)
'data.frame': 32 obs. of 12 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
$ name: chr NA NA NA NA ...

7.2.10 SQL UPDATE

UPDATE is used to make changes to the rows of a table and is used with the SET and WHERE clause.

In SAS:

Data mtcars;
Set import;
run;

proc sql;
alter table mtcars add name char;
Run;

proc sql;
alter table mtcars add name char;
Run;

proc sql ;
update mtcars set name="Economy" where cyl=4;

proc sql ;
update mtcars set name="Luxury" where cyl=6;

proc sql ;
update mtcars set name="Muscle" where cyl=8;

In R:

mtcars=sqldf(c("alter table mtcars add name char","select * from mtcars"))

mtcars=sqldf(c("update mtcars set name ='Economy' where cyl=4","select * from mtcars"))
mtcars=sqldf(c("update mtcars set name ='Luxury' where cyl=6","select * from mtcars"))
mtcars=sqldf(c("update mtcars set name ='Muscle' where cyl=8","select * from mtcars"))
head(mtcars)


> head(mtcars)

   mpg   cyl  disp  hp   drat  wt     qsec   vs  am  gear  carb  name
1  21.0  6    160   110  3.90  2.620  16.46  0   1   4     4     Luxury
2  21.0  6    160   110  3.90  2.875  17.02  0   1   4     4     Luxury
3  22.8  4    108   93   3.85  2.320  18.61  1   1   4     1     Economy
4  21.4  6    258   110  3.08  3.215  19.44  1   0   3     1     Luxury
5  18.7  8    360   175  3.15  3.440  17.02  0   0   3     2     Muscle
6  18.1  6    225   105  2.76  3.460  20.22  1   0   3     1     Luxury

7.2.11 SQL IS NULL

Missing values in SQL are checked by IS NULL or IS NOT NULL. We take the airquality dataset from https://vincentarelbundock.github.io/Rdatasets/csv/datasets/airquality.csv

Example:

To select all rows with missing values in the Ozone column:of airquality:

In SAS (first we replace NA missing value of R with missing value in SAS using compress. We use outobs in Proc SQL to limit output to five rows.

data import2;
set import2 ;
Ozone=compress(Ozone,"NA","");
Run;

proc sql outobs=5 ;
select * from import2 where Ozone IS NULL;

VAR1      Ozone      SolarR      Wind      Temp      Month      Day
5                    NA          14.3      56        5          5
10                   194         8.6       69        5          10
25                   66          16.6      57        5          25
26                   266         14.9      58        5          26
27                   NA          8         57        5          27

In R:

> sqldf("select * from airquality where Ozone IS NULL LIMIT 5")
     Ozone      Solar.R      Wind      Temp      Month      Day
1    NA         NA           14.3      56        5          5
2    NA         194          8.6       69        5          10
3    NA         66           16.6      57        5          25
4    NA         266          14.9      58        5          26
5    NA         NA           8.0       57        5          27

We chose top 5 rows using LIMIT 5 for printing purposes

7.2.12 SQL LIKE and BETWEEN

The LIKE option is used along with WHERE and wildcards like % and _ to select rows that have values with a specified pattern in a column:

% is used to denote a string of characters
_ is used to denote a single character.

In SAS:

In R:

> sqldf("select * from mtcars where name like 'L%' LIMIT 5;")

   mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb  name
1  21.0  6    160.0  110  3.90  2.620  16.46  0   1   4     4     Luxury
2  21.0  6    160.0  110  3.90  2.875  17.02  0   1   4     4     Luxury
3  21.4  6    258.0  110  3.08  3.215  19.44  1   0   3     1     Luxury
4  18.1  6    225.0  105  2.76  3.460  20.22  1   0   3     1     Luxury
5  19.2  6    167.6  123  3.92  3.440  18.30  1   0   4     4     Luxury

BETWEEN operator can also be used with WHERE as follows:

> sqldf("select * from mtcars where mpg between 10 and 14;")

   mpg   cyl  disp  hp   drat  wt     qsec   vs  am  gear  carb  name
1  10.4  8    472   205  2.93  5.250  17.98  0   0   3     4     Muscle
2  10.4  8    460   215  3.00  5.424  17.82  0   0   3     4     Muscle
3  13.3  8    350   245  3.73  3.840  15.41  0   0   3     4     Muscle

7.2.13 SQL GROUP BY

The GROUP BY statement is used with aggregate functions to calculate the corresponding values separately for all groups of a column. It is similar to the CLASS operator in PROC MEANS and group by operator in Hmisc in R and in data table. For multiple variables in R, we use llist in Hmisc::summarize and .(var1,var2) in data.table.

To calculate the average of mpg for each of the distinct values that cyl can take:

In SAS:

proc sql;
select avg(mpg) as Avg_mpg ,cyl from mtcars group by cyl;

Avg_mpg    cyl
26.6636    4
419.74286  6
15.1       8

proc sql;
select avg(mpg) as Avg_mpg,cyl,gear from mtcars group by cyl,gear ;

Avg_mpg     cyl     gear
21.5        4       3
26.925      4       4
28.2        4       5
19.75       6       3
19.75       6       4
19.7        6       5
15.05       8       3
15.4        8       5

In R:


> sqldf("select avg(mpg),cyl from mtcars group by cyl ;")

   avg(mpg)   cyl
1  26.66364   4
2  19.74286   6
3  15.10000   8

> sqldf("select avg(mpg),cyl,gear from mtcars group by cyl,gear ;")

     avg(mpg)    cyl    gear
1    21.500      4      3
2    26.925      4      4
3    28.200      4      5
4    19.750      6      3
5    19.750      6      4
6    19.700      6      5
7    15.050      8      3
8    15.400      8      5

7.2.14 SQL HAVING

SQL HAVING

HAVING is used after GROUP BY to select only certain groups from the grouped aggregate values.

To display only those values of avg(mpg) grouped by cyl which have avg(mpg) >19 and sorted in ascending order of avg(mpg).

In SAS:

In R:

> sqldf("select avg(mpg),cyl from mtcars group by cyl having avg(mpg)>19 order by avg(mpg);")

      avg(mpg)     cyl
1     19.74286     6
2     26.66364     4

> sqldf("select avg(mpg),cyl from mtcars group by cyl,gear having avg(mpg)>19 order by avg(mpg);")

      avg(mpg)     cyl
1     19.700       6
2     19.750       6
3     19.750       6
4     21.500       4
5     26.925       4
6     28.200       4

7.2.15 SQL CREATE TABLE and SQL CONSTRAINTS

CREATE TABLE clause can be used to create a table in SQL. NOT NULL and UNIQUE are CONSTRAINTS in SQL. They are used in front of the variable to specify that values in the user_id column cannot be missing and cannot be repeated respectively.

In SAS:

proc sql;
create table user(
 user_id integer not null unique ,
 name character,
 age integer)
 ;

proc sql;
insert into user values(1,'John',19);
insert into user values(2,'Sarah',20);
insert into user values(3,'Jack',21) ;

proc print data=user;run;

Obs	user_id	Name	Age
1	1	John	19
2	2	Sarah	20
3	3	Jack	21

proc sql;
create table book(
book_id integer not null unique,
book_name character,
book_author character
);

proc sql;
insert into book values (1,'Inferno', 'Dan Brown');
insert into book values (2,'Deception Point', 'Dan Brown');
insert into book values (3,'Witches','Roald Dahl');
insert into book values (4,'Hunger Games','Suzanne Collins') ;

proc sql;
alter table user
add primary key (book_id);

proc print data=book;run;

Figure 7.1 Proc SQL in SAS.

Note: The string values have been truncated in proc sql.

In R:

There is only one INSERT INTO statement is used in R and data values for multiple rows are separated by commas whereas the number of INSERT INTO statements needed in SAS equals the number of rows to be inserted.

PRIMARY KEY is another constraint in SQL which is used to uniquely identify a row in a table.

Columns with only the UNIQUE constraint can have null values whereas PRIMARY KEY columns cannot have null values. There can be several columns with UNIQUE constraint in a data table whereas there can be only one PRIMARY KEY column in a data table.

7.2.16 SQL UNION

UNION clause with select is used to select all observations which lie in at least one of the result sets. UNION ALL can be used to select the observations that lie in both the result sets twice along with the observations that lie in only one of them.

To select all observations which have either cyl=6 or gear=4 or both

In SAS:

In R:

7.2.17 SQL JOINS

SQL JOINS can be used to merge data from more than one table into a single table.

INNER JOIN

Inner Join is used to select only those records that have the same value for a particular column.

LEFT JOIN

Left Join is used to select all records from the first table and those records from the second table which have common values in both tables for the specified column.

Let’s take these tables ‐issued, book and user.

And book table.

And

User

In SAS – Inner Join:

In SAS – Left Join:

In R:

There are many types of JOINs in SQL:

Inner Join:
Full Outer Join:
Left Outer Join:
Right Outer Join:
Self‐Join:
Cross Join:

7.3 Merges

We use the merge function in both SAS and R and compare them with the SQL Joins. First, we make the tables in SAS and export the data to import it in R.

proc sql;
create table book(
book_id integer not null unique,
book_name character,
book_author character
);

proc sql;
insert into book values(1,'Inferno', 'Dan Brown');
insert into book values(2,'Deception Point','Dan Brown');
insert into book values (3,'Witches','Roald Dahl');
insert into book values (4,'Hunger Games','Suzanne Collins') ;

proc sql;
alter table user
add primary key (book_id);
proc print data=book;run;

proc sql;
create table user(
user_id integer not null unique ,
name character,
age integer)
;

proc sql;
insert into user values(1,'John',19);
insert into user values(2,'Sarah',20);
insert into user values(3,'Jack',21) ;

proc print data=user;run;

proc sql;
create table issued(
issue_id integer not null unique,
user_id integer ,
book_id integer
);

proc sql;
insert into issued values(1,2, 1);
insert into issued values(2,2,3);
insert into issued values (3,3,4);

proc sql;
alter table issued
add primary key (issue_id );

proc print data=issued ;run;

Obs	book_id	book_name	book_author
1	1	Inferno	Dan Brow
2	2	Deceptio	Dan Brow
3	3	Witches	Roald Da
4	4	Hunger G	Suzanne

Obs	user_id	name	age
1	1	John	19
2	2	Sarah	20
3	3	Jack	21

Obs	issue_id	user_id	book_id
1	1	2	1
2	2	2	3
3	3	3	4

We first use the libname statement to make the datasets permanent.

libname book '/home/ajay4/book' ;
run;

data book.issued;
set issued;
run;

data book.user;
set issued;
run;

data book.book;
set issued;
Run;

We merge using the following

proc sort data=issued;
by user_id;
run;


proc sort data=user;
by user_id;
run;

data issueduser;
merge issued(in=a) user(in=b);
by user_id;
run;

proc print data=issueduser;
Run;

Obs   issue_id     user_id     book_id     name     age
1     .            1           .           John     19
2     1            2           1           Sarah    20
3     2            2           3           Sarah    20
4     3            3           4           Jack     21

Let’s try a left join

data issueduser;
merge issued(in=a) user(in=b);
by user_id;
run;

proc print data=issueduser;
Run;

Obs     issue_id     user_id     book_id     name     age
1       1            2           1           Sarah    20
2       2            2           3           Sarah    20
3       3            3           4           Jack     21

We download the datasets using the download button.

Figure 7.13 Download Data in SAS Studio.

We then read them in R using the following:

> install.packages("sas7bdat")
> library("sas7bdat", lib.loc="~/R/win-library/3.5")
 setwd('C:\Users\ajay\Music')
 dir()
 book=read.sas7bdat('book.sas7bdat')
 issued=read.sas7bdat('issued.sas7bdat')
 user=read.sas7bdat('user.sas7bdat')
> setwd('C:\Users\ajayohri\Music')

> dir()
[1] "book.csv" "book.sas7bdat" "desktop.ini" "issued.sas7bdat"
[5] "user.sas7bdat" "Videos - Shortcut.lnk"

> book=read.sas7bdat('book.sas7bdat')
> issued=read.sas7bdat('issued.sas7bdat')
> user=read.sas7bdat('user.sas7bdat')

> book 
book_id  book_name  book_author
1  1        Inferno    Dan Brow
2  2        Deceptio   Dan Brow
3  3        Witches    Roald Da
4  4        Hunger G   Suzanne

> issued 
issue_id  user_id  book_id
1  1         2        1
2  2         2        3
3  3         3        4

> user 
user_id  name   age
1  1        John   19
2  2        Sarah  20
3  3        Jack   21
>
> merged1= merge(issued,user,by='user_id',all.x=T)
> merged1 
user_id  issue_id  book_id  name   age
1  2        1         1        Sarah  20
2  2        2         3        Sarah  20
3  3        3         4        Jack   21

The all.x helps to create the create the left join here.

For other joins between two data frames d1 and d2.

Inner joinmerge(df1, df2, by="common_key_column")
Outer joinmerge(df1, df2, by="common_key_column", all=TRUE)
Left outermerge(df1, df2, by="common_key_column", all.x=TRUE)
Right outermerge(df1, df2, by="common_key_column", all.y=TRUE)

7.4 Summary

SQL can be used in both R and SAS. We can use the sqldf package to use SQL in R and PROC SQL to use SQL in SAS.

The basic syntax for using sql in R is:

The basic syntax for using sql in SAS is:‐

SAS needs data to be sorted before merge. We can also merge data using joins.

7.5 Quiz Questions

What does SQL stand for?
Name a package that can be used to use SQL in R.
Which proc statement is used to use SQL in SAS?
Which function is used to combine multiple SQL queries in a single call to sqldf()?
What is * used for in the following select statement?
```
select * from mydata;
```
Which clause is used to add new values and make changes to a row of a table?
Name some SQL constraints.
What is HAVING used for in SQL?
Name some SQL aggregate functions.
Explain the difference between INNER JOIN and LEFT JOIN in SQL.

Quiz Answers

Structured Query Language
sqldf
PROC SQL
c()
* is used to select all columns from a data table.
Sql UPDATE is used to make changes to the rows of a table and is used with the SET and WHERE clause.
NOT NULL, Unique, Primary Key.
HAVING is used after GROUP BY to select only certain groups from the grouped aggregate values.
min(),max(),avg.(),sum(),count() are some SQL aggregate functions.
INNER JOIN is used to select only those records which have matching values in the specified column whereas LEFT JOIN selects all records from the first table along with the records which have matching values in the specified column.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7 Using SQL with SAS and R

Create new playlist

Sign In

Sign Up

7.1 What is SQL?

7.1.1 Basic Terminology

7.1.2 CAP Theorem

7.1.3 SQL in SAS and R

7.2 SQL Select

7.2.1 SQL WHERE

7.2.2 SQL Order By

7.2.3 AND, OR, NOT in SQL

7.2.4 SQL Select Distinct

7.2.5 SQL INSERT INTO

7.2.6 SQL Delete

7.2.7 SQL Aggregate Functions

7.2.8 SQL ALIASES

7.2.9 SQL ALTER TABLE

7.2.10 SQL UPDATE

7.2.11 SQL IS NULL

7.2.12 SQL LIKE and BETWEEN

7.2.13 SQL GROUP BY

7.2.14 SQL HAVING

7.2.15 SQL CREATE TABLE and SQL CONSTRAINTS

7.2.16 SQL UNION

7.2.17 SQL JOINS

7.3 Merges

7.4 Summary

7.5 Quiz Questions

Quiz Answers

Table of Contents for
7 Using SQL with SAS and R