Behind the SPARROW server
How does it work?
The predictors offered in the SPARROW server are the result of
statistical learning procedures with roots in Vladimir Vapnik's
statistical learning theory. To apply this theory, the problem of
protein secondary structure prediction is transformed into a
multiclass classification problem. The eight secondary structure
classes, as provided by DSSP, are used as reference classification in
this case. As is commonly done, these eight classes are reduced to
three secondary structure classes as described in the help section.
Once the problem is defined this way, supervised learning methods from
the statistical learning theory can be used to produce classifiers
capable of predicting protein secondary structure. The optimal
classifiers are obtained using a finite learning dataset of proteins
with known secondary structure assignment. The learning procedure
consists in minimizing the classification error on this dataset.
The SPARROW server provides two predictors based on such classifiers,
SPARROW and *SPARROW (pronounced "star-sparrow"), both operating in
three stages. The third stage is in both cases implemented by an
artificial neural network, while the first two differ slightly between
the two predictors. Both SPARROW and *SPARROW are trained with the
most recent release of the ASTRAL40 compendium (currently version
1.75, released June 5, 2009). The following paragraphs should give a
brief insight in the methods used.
SPARROW - divide et impera
SPARROW was developed by Francesco Bettella during the time he served
as Ph.D. student in the AGKnapp. The strategy behind the SPARROW
predictor is to separate the multiclass problem into a set of
two-class problems. The latter are addressed singularly using the same
approach adopted by R. A. Fisher in the thirties. The linear
discriminants (quadratic functions of the amino acid sequence)
optimized to discern each secondary structure type provide a measure
of the propensity of a sequence to adopt that secondary structure.
Direct comparison of the propensity enables to perform the actual
multiclass prediction.
*SPARROW - a vectorial approach
In the course of 2010 Dawid Rasinski developed SPARROW's successor,
*SPARROW, as part of his Ph.D. project. Similarly to SPARROW, *SPARROW
uses quadratic scoring functions to perform classifications in its
first two stages. But instead of reducing the multiclass problem into
a set of two-class problems, *SPARROW solves the multiclass problem
directly. For this purpose a vector-valued scoring function is used in
each of the first two stages. Each scalar function component of the
vector-valued functions is a quadratic function of the sequence. The
number of component functions is one less than the number of classes
in the classification problem the vector valued scoring function is
being applied to, and so is the dimension of the functions output
space as well. In this output space each class is represented by a
unique vector, the so-called class vector. Members of a certain class
should be as close as possible to the corresponding class vector,
which leads to the minimization of the mean squared distances between
the outputs of the vector valued scoring function, belonging to the
learning examples (residues), and the class vectors representing the
classes these examples belong to. Once fully trained, the vector
valued scoring function can perform classifications of unknown
samples, by assigning them to the class whose class vector the output
of the vector valued scoring function is closest to.