Descriptive statistics are better done in the presentation layers of the system than the database. Those tools will do floating point math in ways that prevent errors. However, the SQL:2006 Standard defines many common descriptive statistics in terms of the basic aggregate functions with a DISTINCT and OVER() clauses.
They are elaborate and you need to know at least freshman statistics to use them, which I will not explain here. They use tables that have two columns, the < dependent variable expression > and the < independent variable expression >. The functions are
VAR_POP = population variance, defined as the sum of squares of the difference of < value expression > from the mean of < value expression >, divided by the number of rows.
VAR_SAMP = the sample variance of < value expression >, defined as the sum of squares of the difference of < value expression > from the mean of < value expression >, divided by the number of rows remaining minus 1.
STDDEV_POP = the population standard deviation of < value expression >, defined as the square root of the population variance.
STDDEV_SAMP = the sample standard deviation of < value expression >, defined as the square root of the sample variance.
Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP, STDDEV_POP, or STDDEV_SAMP; redundant duplicates are not removed when computing these functions.
REGR_COUNT = the number of rows remaining in the group.
COVAR_POP = the population covariance.
COVAR_SAMP = the sample covariance, defined as the sum of products of the difference of < independent variable expression > from its mean times the difference of < dependent variable expression > from its mean, divided by the number of rows remaining minus 1 (one).
CORR = the correlation coefficient, defined as the ratio of the population covariance divided by the product of the population standard deviation of < independent variable expression > and the population standard deviation of < dependent variable expression >.
REGR_R2 = the square of the correlation coefficient.
REGR_SLOPE = the slope of the least-squares-fit linear equation determined by the (< independent variable expression >, < dependent variable expression >) pairs.
REGR_INTERCEPT = the y-intercept of the least-squares-fit linear equation determined by the (< independent variable expression >, < dependent variable expression >) pairs.
REGR_SXX = the sum of squares of < independent variable expression >.
REGR_SYY = the sum of squares of < dependent variable expression >.
REGR_SXY = the sum of products of < independent variable expression > times < dependent variable expression >.
REGR_AVGX = the average of < independent variable expression >.
REGR_AVGY = the average of < dependent variable expression >.
There are two inverse distribution functions, PERCENTILE_CONT and PERCENTILE_DISC. Both inverse distribution functions specify an argument and an ordering of a value expression. The value of the argument should be between 0 (zero) and 1 (one) inclusive. The value expression is evaluated for each row of the group, NULLs are discarded, and the remaining rows are ordered. The computation concludes:
PERCENTILE_CONT = by considering the pair of consecutive rows that are indicated by the argument, treated as a fraction of the total number of rows in the group, and interpolating the value of the value expression evaluated for these rows.
PERCENTILE_DISC = by treating the group as a window partition of the CUME_DIST window function, using the specified ordering of the value expression as the window ordering, and returning the first value expression whose cumulative distribution value is greater than or equal to the argument.
If you need other statistics, consult your vendor’s documentation for what functions they have implemented.
A correlation is a measure of how much alike two sets of data are. There are many types of correlation in statistics, but the most common one is Pearson’s r. It is worth be able to do without using the CORR() function if your SQL odes not have it.
This version of Pearson’s r is due to van Heusden consulting in the Netherlands (http://www.vanheusden.com/misc/pearson.php). The data are the movie ratings of a set of films by all the pairs of reviewers. The population is (n) movies.
CREATE TABLE Movie_Reviews (reviewer VARCHAR(15) NOT NULL, movie_title VARCHAR(15) NOT NULL, PRIMARY KEY (user, movie_title) rating DECIMAL(2,1) NOT NULL CHECK (rating BETWEEN 1.0 AND 5.0);
The computation is
SELECT reviewer_1, reviewer_2, ((psum - (sum1 * sum2 / n)) / SQRT((sum1sq - POWER(sum1, 2.0) / n) * (sum2sq - POWER(sum2, 2.0) / n))) AS pearson_r, n FROM (SELECT N1.reviewer AS reviewer_1, N2.reviewer AS reviewer_2, SUM(N1.rating) AS sum1, SUM(N2.rating) AS sum2, SUM(N1.rating * N1.rating) AS sum1sq, SUM(N2.rating * N2.rating) AS sum2sq, SUM(N1.rating * N2.rating) AS psum, COUNT(*) AS N FROM Movie_Reviews AS N1 LEFT OUTER JOIN Movie_Reviews AS N2 ON N1.movie_title = N2.movie_title WHERE N1.reviewer > N2.reviewer GROUP BY N1.reviewer, N2.reviewer) AS step1;
18.226.251.70