Introduction
What PostStat isPostStat is a set of statistics extensions for PostgreSQL. It includes routines that compute a number of cumulative probability distributions, linear regression, and statistical tests, both parametric and non-parametric. The extensions were implemented as a way to test statistical hypothesis in simple models and so the returned value from all routines can be -- direct or indirectly -- seen as a p-value.
PostStat is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.
Probability distributions
Probability densities and cumulative distributionsThe motivation for providing probability densities (mass functions in the
case of discrete distributions) and cumulative distribution functions (CDFs) is
to allow the user the possibility of running quick statistical tests and
thus obtain p-values. Some statistical tests rely on point hypothesis, which
justifies the availability of their respective density functions. All
probability distribution routines return double precision
values.
Discrete distributions
dbinom(x, n, p)
pbinom(x, n, p)
Probability mass (dbinom
) and cumulative distribution
(pbinom
) functions for the binomial distribution with
input x
, the number of successes, and parameters
n
, the number of experiments, and p
, the
probability of success.
dhyper(x, w, b, n)
phyper(x, w, b, n)
Probability mass (dhyper
) and cumulative distribution
(phyper
) functions for the hypergeometric distribution
with input x
, the number of white balls drawn from an urn
that contains black and white balls, and parameters w
, the
number of white balls in the urn, b
, the number of black balls in
the urn, and n
, the number of balls drawn from the urn.
pnbinom(x, n, p)
CDF for the negative binomial distribution with input
x
, the number of failures before n
successes in a
sequence of Bernoulli experiments with probability of success
p
.
ppois(x, lambda)
CDF for the Poisson distribution with input x
and
parameter lambda
.
Continuous distributions
pexp(x, rate)
CDF for the exponential distribution with input x
and
parameter rate
.
pgamma(x, shape [, scale])
CDF for the gamma distribution with input x
and
parameters shape
and scale
. The
default value for scale
is 1.
pnorm(x [, mean [, sd]])
CDF for the normal distribution with input x
and
parameters mean
and sd
,
standard deviation. The default values for mean
is 0 and
for sd
is 1.
pchisq(x, df [, pnonc])
CDF for the chi-squared distribution with input x
and
parameters df
, the number of degrees of freedom, and
pnonc
, the non-centrality parameter. The default value
for pnonc
is 0.
pt(x, df)
CDF for the Student t distribution with input x
and
parameter df
, the number of degrees of freedom.
pf(x, dfn, dfd [, pnonc])
CDF for the F distribution with input x
and
parameters dfn
, the number of degrees of freedom in the
numerator, dfd
, the number of degrees of freedom in the
denominator, and pnonc
, the non-centrality
parameter. The default value for pnonc
is 0.
Non-parametric distributions
dsignrank(x, n)
psignrank(x, n)
Probability density (dsignrank
) and cumulative distribution
(psignrank
) functions for the distribution of the
Wilcoxon signed rank statistic with input x
and
parameter n
, the size of the sample.
dwilcox(x, m, n)
pwilcox(x, m, n)
Probability density (dwilcox
) and cumulative distribution
(pwilcox
) functions for the distribution of the
Wilcoxon rank sum statistic with input x
and
parameters m
and n
, the
sizes of the samples.
Linear regression
Simple linear fitSimple linear models can be computed using the low level functions
lsfit
and lssfit
. Both functions
compute the minimum-norm solution of linear least squares (LLS) problem
specified by the arguments and return the residual sum-of-squares (RSS) and
the rank of the design matrix. Note that the actual solution to the LLS
problem is not returned but only the sufficient statistics to generate a
p-value from a F-test using the higher level functions
lm
and lms
.
Aggregate functions vector_accum
and matrix_accum
can be used to build array arguments for the linear system functions
above. A fstat
type, defined as
CREATE TYPE fstat AS (f double precision, pvalue double precision);
is used to return values from lm
and lms
.
vector_accum(x)
Aggregate function that takes a set of double precision
values
and returns a one-dimensional array containing them.
matrix_accum(row)
Aggregate function that takes a set of one-dimensional
double precision
array as rows and returns a two-dimensional
array.
lsfit(y, X)
lssfit(y, X)
Takes a one-dimensional double precision
array y
and a two-dimensional double precision
array X
,
finds the solution beta
that minimizes the norm of
X * beta - y
, and returns a double precision
array
containing the minimum norm and the rank of X
in this order.
lsfit
computes the minimum norm using a complete orthogonal
factorization of X
, while lssfit
computes the
singular value decomposition (SVD) of X
.
lm(y, X [, X0])
lms(y, X [, X0])
Returns a fstat
type containing the F statistic and p-value
for testing y ~ X
against the null y ~ X0
.
y
and X
have the same interpretation and
specification as in lsfit
and lssfit
.
X0
is similar to X
and taken to be a zero array if
not provided. lm
uses lsfit
to derive RSS and
rank while lms
uses lssfit
.
Example
Suppose we want to linearly fit the following data:
# select * from linreg; id | x | y ----+---+--- 1 | 1 | 2 2 | 2 | 3 3 | 3 | 5 4 | 4 | 9 5 | 5 | 8 (5 rows)
Fitting y
to beta * x + alpha
gives:
# select lsfit(vector_accum(y), matrix_accum(ARRAY[1, x])) from linreg; lsfit --------- {4.8,2} (1 row)
To test against the null of beta = 0
we issue:
# select lm(vector_accum(y), matrix_accum(ARRAY[1, x]), # matrix_accum(ARRAY[1])) from linreg; lm ---------------------------- (20.25,0.0204904123444534) (1 row)
and so we can reject at 5% level since the p-value ~ 0.02.
Statistical tests
Parametric and non-parametric testsPostStat also provides other common tests that are more elaborate and cannot be easily cast in a CDF formulation.
fisher(t [, hybrid])
fisher2(t11, t12, t21, t22)
Returns the p-value of Fisher's exact test for testing the null of
independence of rows and columns in a contingency table t
with
fixed marginals. Table t
should be a two-dimensional double
precision array. Optional parameter hybrid
indicates if exact
probabilities should be computed (the default) or a hybrid approximation
should be assessed.
fisher2
is provided for convenience when running the test on 2
x 2 tables, and is equivalent to calling fisher(t)
when
t = ARRAY[[t11, t12], [t21, t22]]
.
shapiro(v)
Returns the p-value of the Shapiro-Wilk test of normality on the
one-dimensional single precision (real
) array v
.
Set enrichment analysis
Given a ranked list L
and a set S
of
elements in L
, the purpose of Set Enrichment Analysis (SEA) is to
assess how significant is the rank distribution in S
with respect
to a uniform rank distribution. S
partitions the elements in
L
by pertinence, and usually represents a category that
classifies elements.
Statistical significance is assessed by the p-value of a max deviation running sum statistic. More details about this statistic can be found at the paper that inspired this routine:
Keller, A., Backes, C., and Lenhof, H-P, (2007) "Computation of significance scores of unweighted Gene Set Enrichment Analyses", BMC Bioinformatics (8). http://www.biomedcentral.com/1471-2105/8/290
The p-value is computed similarly to the reference above, using dynamic programming, but uses a different formulation, is more memory efficient, and does not depend on external libraries for extended precision.
sea(m, l, s)
Returns the p-value of the max deviation running sum statistic
s
for a ranked list of size m
and a set of size
l
.
sea(rlist)
Aggregate function that takes an ordered boolean set rlist
as
a ranked list where true
signals pertinence to a set and returns
the p-value of its max deviation running sum statistic.
Installation
How to obtain and install PostStatPostStat is distributed as a source package and can be obtained at
PgFoundry.
After downloading and unpacking, our first task is to compile
libf77stat
. A FORTRAN compiler is needed, and you might have to
change the Makefile to suit your needs.
$ cd f77stat && make && make clean
The source package is then installed like any regular PostgreSQL module, that is, just run:
$ make && sudo make install $ psql -f poststat.sql <database>
BLAS and
LAPACK libraries are required to
build PostStat. For better performance it is recommended to install an
optimized version of these libraries, such as the ones provided by
the ATLAS project. As in the
libf77stat
compilation, you might need to edit the Makefile
according to your setup.
License
Copyright (c) 2008 Luis Carvalho
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.