Introduction
What PostBio isPostBio is a set of bioinformatics extensions for
PostgreSQL. It includes two data
types, int_interval
, an integer
interval used to represent biological sequence features, and
stree
, a suffix tree type to search for
maximum unique matches, and a set of
utilitary routines.
PostBio is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.
Integer intervals
A type for biological sequence featuresAn int_interval
is not much different from a
line segment lseg
type in PostGIS, being an specialization of it:
only integer coordinates are allowed and the line lies in the unidimensional
space represented by a sequence, which gives the "interval" representation. As
such, integer intervals admit topological operations and can be indexed by
GiSTs.
Values of int_interval
are specified with
int_interval '( low , high )'
where low
and high
are integers. Implicit casts
from and to text
are possible, and an integer i
can
be casted to (i, i)
.
The following table describes all int_interval
operators,
where we use ii1
as int_interval'(l1, h1)'
and
ii2
as int_interval'(l2, h2)'
in the
example column. The operators are actually syntatic sugar for the functions in
the function column.
Operator | Description | Function | Example |
# |
Size (length) | int_size |
#ii1 = h1 - l1 + 1 |
#< |
Lower endpoint | int_low |
#<ii1 = l1 |
#> |
Upper endpoint | int_high |
#>ii1 = h1 |
|-| |
Distance | int_distance |
ii1 |-| ii2 = (|l1 - l2| + |h1 - h2| - #ii1 - #ii2) / 2
|
<-> |
Buffer | int_buffer |
ii1 <-> b = int_interval(max(0, l1 - b), h1 + b)
|
<< |
Strictly left? | int_left |
ii1 << ii2 is true iff
h1 < l2 |
&< |
Left? | int_overleft |
ii1 &< ii2 is true iff
h1 <= h2 |
&& |
Overlaps? | int_overlaps |
ii1 && ii2 is true iff
l1 <= h2 and h1 >= l2 |
&> |
Right? | int_overright |
ii1 &> ii2 is true iff
l1 >= l2 |
>> |
Strictly right? | int_right |
ii1 >> ii2 is true iff
l1 > h2 |
~= |
Same? | int_same |
ii1 ~= ii2 is true iff
l1 = l2 and h1 = h2 |
~ |
Contains? | int_contains |
ii1 ~ ii2 is true iff
l1 <= l2 and h1 >= h2 |
@ |
Contained? | int_contained |
ii1 @ ii2 is true iff
ii2 ~ ii1 |
+ |
Join | int_join |
ii1 + ii2 = int_interval(min(l1, l2), max(h1, h2)) if ii1 and ii2 overlap and NULL
otherwise
|
* |
Overlap | int_overlap |
ii1 * ii2 = int_interval(max(l1, l2), min(h1, h2)) if ii1 and ii2 overlap and NULL
otherwise
|
^ |
Span | int_span |
ii1 ^ ii2 = int_interval(min(l1, l2), max(h1, h2))
|
Operators join and overlap are pairwise union and
intersection respectively. Union and intersection are defined as aggregate
functions int_union
and int_intersection
returning
an int_interval
. The aggregator version of int_span
is int_cover
.
A function intpair_to_int
that takes integers l
and h
as arguments and returns int_interval'(l, h)'
is provided for convenience. There is also a version of
int_buffer
that allows the specification of left and right
offsets: int_buffer(ii, b1, b2)
, where ii = (l, h)
,
returns (max(0, l - b1), h + b2)
; if b1 = b2
we have
the version in the table.
Examples
A sequence feature corresponds to a triplet
(id, orient, ii)
where id
is a sequence identifier
(possibly the primary key for a sequence table), orient
is a
boolean indicating if the feature is in the same or contrary orientation of
the sequence, and ii
is the int_interval
representing the feature as a subsequence.
Now, as a genomic application, suppose we have two tables of sequence
features: genes
and probesets
, both with sequence
identifiers relative to the chromosome sequence table for some species:
CREATE TABLE genes ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, cds int_interval -- coding region ); CREATE INDEX genes_id_idx ON genes(seq_id, orient); CREATE INDEX genes_cds_idx ON genes USING gist(cds); CREATE TABLE probesets ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, region int_interval ); CREATE INDEX probesets_id_idx ON probesets(seq_id, orient); CREATE INDEX probesets_region_idx ON probesets USING gist(region);
Our first task is to select which probesets are contained within each gene:
SELECT p.seq_id, p.orient, p.region, g.cds FROM probesets p JOIN genes g ON g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation AND p.region @ g.cds; -- probeset contained in CDS
As a more substantial application, let's now select probesets that are closest to a gene, within a specific maximum from the coding region (CDS) but not overlapping the CDS. We start with an auxiliary view:
\set offset 500 CREATE VIEW pset_buffer AS SELECT g.id AS gid, p.id AS pid, @(p.region |-| g.cds) AS dist FROM probesets p JOIN genes g ON g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation -- overlapping buffer but not CDS: AND p.region && (g.cds <-> :offset) AND NOT p.region && g.cds;
Finally, we use pset_buffer
to select probesets that attain
minimum distance and report more details:
SELECT g.seq_id, g.orient, p.region, g.cds FROM genes g JOIN (SELECT b.gid, b.pid FROM pset_buffer b JOIN (SELECT gid, MIN(dist) AS dist FROM pset_buffer GROUP BY gid) AS q ON b.gid=q.gid AND b.dist=q.dist) AS j ON g.id=j.gid JOIN probesets p ON p.id=j.pid;
Suffix trees
Efficient nucleotide sequence searchA suffix tree for a reference sequence S
is
a data structure representing suffixes of S
in a way to allow
fast searching of maximal matches from query sequences. You can get
more general information on suffix trees at
Wikipedia.
The suffix tree data type, stree
, is a
serialized version of the suffix tree struct found in
MUMmer version 3.20, more
specifically in its core suffix tree library libstree implemented by
Stefan Kurtz. For convenience, libstree is provided in PostBio but you
are strongly advised to refer to the links for more details and
references.
Values of stree
are specified from a lower-case extended
nucleotide sequence, that is,
stree '[a | c | g | t | s | w | r | y | m | k | b | d | h | v | n]+'
Suffix trees can also be casted from and to text
. If you
decide to store your reference sequences as stree
s and plan on
retrieving many subsequences then
stree_subtext(stree, start, len)
is more
memory efficient than casting to text
and running
substr(reftext, start, len)
.
Maximal matches between a stree
and a query sequence can be
found with maxmatch
, which uses a composite type,
streematch
, to store each result. If only the number of matches
are wanted maxmatchcount
can be used.
CREATE TYPE streematch AS (matchlen integer, refstart integer, querystart integer);
maxmatch(stree, query, uniqueinref, minmatchlen)
SRF that returns a
maximal match (of type streematch
) per row between the reference
sequence in stree
and the query
sequence.
uniqueinref
is a boolean for whether the matches in the reference
are unique; true
is equivalent to specifying
"-mumcand
" in MUMmer (the default behavior) while
false
is equivalent to "-maxmatch
".
minmatchlen
is the minimum length of the match, being
equivalent to the "-l
" input in MUMmer; if
minmatchlen = 0
then the query sequence length is taken as the
actual minimum length.
maxmatchcount(stree, query, uniqueinref, minmatchlen, anymatch)
Returns the number of maximal matches in stree
from
query
with length at least minmatchlen
. The first
four arguments have the same interpretation as in maxmatch
.
anymatch
is a boolean for whether maxmatchcount
should only check if there is any match returning 0 or 1, or if the actual
number of matches should be returned.
Examples
A simple example would be:
# select maxmatch('acgtacgt', 'cgta', false, 2); maxmatch ---------- (4,2,1) (3,6,1) (2 rows)
If uniqueinref
were true
the result would be only
the first row. In the next example we have two tables, one with sequences and
other with motifs, and we wish to count the number of occurrences of each
motif in each sequence:
SELECT s.id, motif, maxmatchcount(seq::stree, motif, false, 0) AS nmatches FROM sequences s, motifs;
Utils
Utilitary routinesFor now PostBio offers only one utilitary routine:
revcomp(seq [, complement])
Returns the reverse complement of sequence seq
if
complement
is true
(the default value) and the
simple reverse otherwise. seq
should contain only extended
nucleotides, that is, only the same letters from the alphabet defining
stree
.
Installation
How to obtain and install PostBioPostBio is distributed as a source package and can be obtained at
PgFoundry.
After downloading and unpacking, our first task is to compile
libstree
:
$ cd stree && make && make clean
The source package is then installed like any regular PostgreSQL module, that is, just run:
$ make && sudo make install $ psql -f postbio.sql <database>
License
Copyright (c) 2008 Luis Carvalho
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.