2.2. Scatter Plots

Scatter plots are among the most basic and useful of graphical representation techniques. A scatter plot of two sets of variables is simply a two-dimensional representation of the points in a plane to show the relationship between two variables. The scatter plot is most useful in identifying the type of relationship (linear or nonlinear) between two sets of variables. Further, if the relationship is linear they help determine the negative or positive relationship between the two variables. This section uses various SAS procedures to plot scatter plots in two and three dimensions. When there are more than two variables, scatter plots of two variables at a time are displayed in a matrix of plots.

2.2.1. Two-Dimensional Scatter Plots

Two-dimensional scatter plots can be drawn using the PLOT or GPLOT procedures. The SAS code shown in the first two parts of Program 2.1 produces two scatter plots using the GPLOT procedure. The first of several optional statements of the GPLOT procedure in the program specifies the file name where the graphics will be stored as a postscript file (PROG21a. GRAPH in our program). The second statement is essentially used to specify the device name (DEV=PSLMONO in our program). The DEV=PSLMONO specification in that statement instructs SAS to store the graph in black and white in postscript form. The choice of DEV=PS can be used for a color graph. Of course if the PLOT procedure is used to produce the plots then there is no need to include any of the GOPTIONS statements in the program.

/* Program 2.1 */

filename gsasfile "prog21a.graph";
    goptions reset=all gaccess=gsasfile autofeed dev=pslmono;
    goptions horigin=1in vorigin=2in;
    goptions hsize=6in vsize=8in;
    options ls=64 ps=40 nodate nonumber;
    title1 h=1.5 'Two Dimensional Scatter Plot ';
    title2 j=l 'Output 2.1';
    title3 'Cork Data: Source C.R. Rao (1948)';
    data cork;
    infile 'cork.dat';
    input n e s w; * n:north,e:east,s:south,w:west;
    run;
    proc gplot data=cork;
    plot n*e='star';
    label n='Direction: North'
          e='Direction: East';
    run;

    data d1;
    set cork;
    y1=n-s;
    y2=e-w;
    run;
    filename gsasfile "prog21b.graph";
    title1 h=1.5 'Two Dimensional Scatter Plot of Contrasts';
    title2 j=l 'Output 2.1';
    title3 'Cork Data: Source C.R. Rao (1948)';
    proc gplot data=d1;
    plot y1*y2='star';

label y1='Contrast: North-South'
      y2='Contrast: East-West';
    run;

    data d2 d3;
    set cork;
    proc sort data=d2;
    by n;
    data d3;
    set d3;
    n_decr=n;
    drop n;
    proc sort data=d3;
    by descending n_decr;
    data both;
    merge d2 d3;
    filename gsasfile "prog21c.graph";
    title1 h=1.5 'Testing Symmetry of Data on North Direction';
    title2 j=l 'Output 2.1';
    title3 'Cork Data: Source C.R. Rao (1948)';
    proc gplot data=both;
    plot n_decr*n='star';
    label n_decr='Descending Ordered Data'
            n='Ascending Ordered Data';
    run;

The first part of the program plots the data corresponding to the directions of north (N) and east (E), and the second part plots the contrasts of the directions north (N) and south (S) (Y1=N-S) against those of the directions east (E) and west (W) (Y2=E-W). These are shown in Output 2.1. The statement PLOT Y1*Y2 in the program plots the variable Y1 versus variable Y2. That is, the variable listed first in the PLOT statement is plotted on the vertical axis and the other variable is plotted on the horizontal axis. The code

proc gplot;
plot y1*y2;

uses the default symbols +, in the plot. A statement of the form

plot y1*y2='char';

can be used to specify a plotting symbol where 'CHAR' stands for the user-specified characters or symbols. The choice of 'star' for 'CHAR' is used in the Output 2.1. The appropriate size of the plot can be determined by the PAGESIZE= (or simply, PS=) and LINESIZE= (or LS=) options.

In a scatter plot, if the points follow an increasing straight line pattern then there may be a positive correlation between the two variables. This pattern indicates that as one variable increases the other increases also. On the other hand, if the points follow a decreasing straight line pattern then there may be a negative correlation indicating that one variable is decreasing as the other variable is increasing. If the points are randomly scattered in the plane then there may be only a weak or no correlation between the two variables. The first scatter plot in Output 2.1 indicates that there is a positive correlation between the cork weights in the directions of north and east. On the other hand, the second scatter plot in Output 2.1 suggests the possibility of a weak or no correlation between the two contrasts, Y1=N-S and Y2=E-W.

There are various variations of scatter plots for a variety of special purposes. For example, scatter plots have been used to examine the symmetry of distribution of the univariate data (Gnanadesikan, 1997). We will briefly discuss this approach. Suppose y1,..., yn are n observations on a variable y. To examine if the distribution of y is symmetric, we order the data from smallest to largest as y(1) ≤,..., ≤ y(n). Then, if the distribution of y is symmetric about a number, say μ, then the scatter plot of the paired data, (y(1), y(n)),(y(2), y(n - 1)),...,(y(n), y(1)) should approximately fall around a line with slope −1 and intercept 2μ.

Example 2.1. Output 2.1


In Program 2.1, using the last few SAS statements we examine the symmetry of the cork data in the north direction (N) only. Using the sorted data sets D2 and D3 which arrange the observations on N in increasing and decreasing orders, we create a data set termed BOTH, which pairs the observations as (y(1), y(n)),(y(2), y(n - 1)),...,(y(n), y(1)). These are then plotted using the GPLOT procedure. The scatter plot shows a certain degree of departure from symmetry.



Gnanadesikan also suggests another scatter plot for symmetry where the pairs are defined not in terms of the original observations but in terms of deviations from the median, say m, of the data. Specifically, the paired values (m - y(1), y(n) - m), (m - y(2), y(n - 1) - m),..., (y(n) - m, m - y(1)), are plotted. If the original distribution is symmetric, the points should form a linear pattern along a line with slope 1 and intercept zero. The SAS code of Program 2.1 can be easily modified for this plot.



2.2.2. Three-Dimensional Scatter Plots

A three-dimensional scatter plot is needed to simultaneously display the relationships between three variables. The SCATTER statement in the G3D procedure can be used to draw a three-dimensional scatter plot. The code given in Program 2.2 produces a scatter plot of the variables N, E, and S by taking the variables N and S on the horizontal plane and E on the axis perpendicular to the plane as displayed in Output 2.2.

/* Program 2.2 */

filename gsasfile "prog22.graph";
    goptions reset=all gaccess=gsasfile autofeed dev=pslmono;
    goptions horigin=1in vorigin=2in;
    goptions hsize=6in vsize=8in;
    options ls=64 ps=45 nodate nonumber;
    data cork;
    infile 'cork.dat';
    input n e s w;
    title1 h=1.5 'Three-D Scatter Plot for Cork Data';
    title2 j=l 'Output 2.2';

title3 'by weight of cork boring';
    title4 'Source: C.R. Rao (1948)';
    footnote1 j=l 'N:Cork boring in North'
              j=r 'E:Cork boring in East';
    footnote2 j=l 'S:Cork boring in South'
              j=r 'W:West boring is not shown';
    proc g3d data=cork;
    scatter n*s=e;
    run;

Example 2.2. Output 2.2


Notice the SCATTER statement in Program 2.2 that plots the values of variables N and S on the horizontal plane and those of E on the axis perpendicular to that plane. The options J=L and J=R in the FOOTNOTE and TITLE statements indicate that the footnote or the title should be written on the left and on the right side of the page, respectively.

As in the two-dimensional scatter plot, if the points follow a pattern in the space then there may be correlations between any two or all three variables. If the points are scattered in the space then there is a weak or no correlation between any of the three variables. For example, the scatter plot of the three variables N, S, and E indicates that the points have an increasing pattern not only in the horizontal plane but also in the perpendicular direction. This seems to indicate that there is a positive correlation between the variables (N,S), between (S,E), and between (N, E).

Program 2.3 generates a three-dimensional scatter plot for the three contrasts C1=N-E-W+S, C2=N-S, and C3=E-W shown in Output 2.3.

/* Program 2.3 */

filename gsasfile "prog23.graph";
    goptions gaccess=gsasfile autofeed dev=pslmono;
    goptions horigin=1in vorigin=2in;
    goptions hsize=6in vsize=8in;
    options ls=64 ps=40 nodate nonumber;
    data cork;
    infile 'cork.dat';
    input n e s w;
    c1=n-e-w+s;
    c2=n-s;
    c3=e-w;
    title1 h=1.5 'Three-Dimensional Scatter Plot for Cork Data';
    title2 j=l 'Output 2.3';
    title3 'Contrasts of weights of cork boring';
    title4 'Source: C.R. Rao (1948)';
    footnote1 j=l 'C1:Contrast N-E-W+S'
              j=r 'C2:Contrast N-S';
    footnote2 j=r 'C3:Contrast E-W';
    proc g3d data=cork;
    scatter c1*c2=c3;
    run;

This scatter plot seems to show weak or no correlation among the three contrasts except perhaps between C2 and C3.

2.2.3. Scatter Plot Matrix

For multivariate data with p variables, y1,...,yp, a scatter plot of each pair of variables can be displayed in a p by p matrix of scatter plots. In this matrix the scatter plot of two different variables yi and yj is in the (i, j)th position of the matrix. The diagonal positions are usually used for writing descriptive comments. The scatter plot matrix is a useful way of representing multivariate data on a single two-dimensional display. It simultaneously identifies the relationships between various variables. In this sense it is a graphic analog of a correlation matrix. However, it may sometimes be more effective in that apart from the strength of linear relationships, any nonlinearities can also be easily spotted.

Example 2.3. Output 2.3


A macro for drawing a scatter plot matrix is given in Friendly (1991). However, a version of a scatter plot matrix can also be drawn very easily using SAS/INSIGHT software. See Section 2.9 for a brief description of the software. Program 2.4 produces a scatter plot matrix in a compact lower triangular form presented in Output 2.4.

/* Program 2.4 */

filename gsasfile "prog24.graph";
    goptions reset=all gaccess=gsasfile autofeed dev=pslmono;
    goptions horigin=1in vorigin=2in;
    goptions hsize=6in vsize=8in;
    options ls=64 ps=40 nodate nonumber;
    title1 h=1.5 'Scatter Plot Matrix for Cork Data';
    title2 'Output 2.4';
    data cork;
    infile 'cork.dat';
    input n e s w;
    proc insight data=cork;
    scatter n e s w * n e s w;
    run;

Example 2.4. Output 2.4


The plot indicates that there is a positive correlation between every pair of variables in the four directions. The correlation seems to be strongest between the variables S and W, but weakest between the variables E and W.

It may sometimes be cumbersome to represent all the variables on a matrix plot, especially if the number of variables is large. In order to visually extract the maximum information possible from these plots it may be necessary to restrict the choice to a moderate number of variables (say 5 or 6) at a time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.119.17