Chapter 24

Advance Descriptive Statistics

Abstract

SQL added descriptive stats to Standard SQL. This makes more elaborate reporting possible.

Keywords

SQL:2006

ALL

DISTINCT

CORR

Correlation coefficient

COVAR_POP

Population covariance

COVAR_SAMP

Sample covariance

Covariance

CUME_DIST

DISTINCT

PERCENTILE_CONT

PERCENTILE_DISC

Population

Population variance

REGR_AVGX

Average of independent variable

REGR_AVGY

Average of dependent variable

REGR_COUNT

REGR_INTERCEPT

y-intercept

REGR_R2

Square of correlation coefficient

REGR_SLOPE

Slope of least-squares-fit linear equation

REGR_SXX

Sum of squares of independent variable expression

REGR_SXY

Sum of products

REGR_SYY

Sum of squares of dependent variable expression

STDDEV_POP

Population standard deviation

STDDEV_SAMP

Sample standard deviation

VAR_POP population variance

VAR_POP

VAR_SAMP

STDDEV_POP

STDDEV_SAMP

VAR_SAMP

Sample variance

Descriptive statistics are better done in the presentation layers of the system than the database. Those tools will do floating point math in ways that prevent errors. However, the SQL:2006 Standard defines many common descriptive statistics in terms of the basic aggregate functions with a DISTINCT and OVER() clauses.

24.1 Binary Table Functions

They are elaborate and you need to know at least freshman statistics to use them, which I will not explain here. They use tables that have two columns, the < dependent variable expression > and the < independent variable expression >. The functions are

VAR_POP = population variance, defined as the sum of squares of the difference of < value expression > from the mean of < value expression >, divided by the number of rows.

VAR_SAMP = the sample variance of < value expression >, defined as the sum of squares of the difference of < value expression > from the mean of < value expression >, divided by the number of rows remaining minus 1.

STDDEV_POP = the population standard deviation of < value expression >, defined as the square root of the population variance.

STDDEV_SAMP = the sample standard deviation of < value expression >, defined as the square root of the sample variance.

Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP, STDDEV_POP, or STDDEV_SAMP; redundant duplicates are not removed when computing these functions.

REGR_COUNT = the number of rows remaining in the group.

COVAR_POP = the population covariance.

COVAR_SAMP = the sample covariance, defined as the sum of products of the difference of < independent variable expression > from its mean times the difference of < dependent variable expression > from its mean, divided by the number of rows remaining minus 1 (one).

CORR = the correlation coefficient, defined as the ratio of the population covariance divided by the product of the population standard deviation of < independent variable expression > and the population standard deviation of < dependent variable expression >.

REGR_R2 = the square of the correlation coefficient.

REGR_SLOPE = the slope of the least-squares-fit linear equation determined by the (< independent variable expression >, < dependent variable expression >) pairs.

REGR_INTERCEPT = the y-intercept of the least-squares-fit linear equation determined by the (< independent variable expression >, < dependent variable expression >) pairs.

REGR_SXX = the sum of squares of < independent variable expression >.

REGR_SYY = the sum of squares of < dependent variable expression >.

REGR_SXY = the sum of products of < independent variable expression > times < dependent variable expression >.

REGR_AVGX = the average of < independent variable expression >.

REGR_AVGY = the average of < dependent variable expression >.

There are two inverse distribution functions, PERCENTILE_CONT and PERCENTILE_DISC. Both inverse distribution functions specify an argument and an ordering of a value expression. The value of the argument should be between 0 (zero) and 1 (one) inclusive. The value expression is evaluated for each row of the group, NULLs are discarded, and the remaining rows are ordered. The computation concludes:

PERCENTILE_CONT = by considering the pair of consecutive rows that are indicated by the argument, treated as a fraction of the total number of rows in the group, and interpolating the value of the value expression evaluated for these rows.

PERCENTILE_DISC = by treating the group as a window partition of the CUME_DIST window function, using the specified ordering of the value expression as the window ordering, and returning the first value expression whose cumulative distribution value is greater than or equal to the argument.

If you need other statistics, consult your vendor’s documentation for what functions they have implemented.

24.2 Correlation

A correlation is a measure of how much alike two sets of data are. There are many types of correlation in statistics, but the most common one is Pearson’s r. It is worth be able to do without using the CORR() function if your SQL odes not have it.

This version of Pearson’s r is due to van Heusden consulting in the Netherlands (http://www.vanheusden.com/misc/pearson.php). The data are the movie ratings of a set of films by all the pairs of reviewers. The population is (n) movies.

CREATE TABLE Movie_Reviews
(reviewer VARCHAR(15) NOT NULL,
movie_title VARCHAR(15) NOT NULL,
PRIMARY KEY (user, movie_title)
rating DECIMAL(2,1) NOT NULL
CHECK (rating BETWEEN 1.0 AND 5.0);

The computation is

SELECT reviewer_1, reviewer_2,
 ((psum - (sum1 * sum2 / n))
/ SQRT((sum1sq - POWER(sum1, 2.0) / n)
* (sum2sq - POWER(sum2, 2.0) / n))) AS pearson_r, n
FROM (SELECT N1.reviewer AS reviewer_1, N2.reviewer AS reviewer_2,
 SUM(N1.rating) AS sum1, SUM(N2.rating) AS sum2,
 SUM(N1.rating * N1.rating) AS sum1sq,
 SUM(N2.rating * N2.rating) AS sum2sq,
 SUM(N1.rating * N2.rating) AS psum,
 COUNT(*) AS N
FROM Movie_Reviews AS N1
 LEFT OUTER JOIN
 Movie_Reviews AS N2
 ON N1.movie_title = N2.movie_title
 WHERE N1.reviewer > N2.reviewer
 GROUP BY N1.reviewer, N2.reviewer)
AS step1;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.251.70